CN114581502A - Monocular image-based three-dimensional human body model joint reconstruction method, electronic device and storage medium - Google Patents

Monocular image-based three-dimensional human body model joint reconstruction method, electronic device and storage medium Download PDF

Info

Publication number
CN114581502A
CN114581502A CN202210233442.5A CN202210233442A CN114581502A CN 114581502 A CN114581502 A CN 114581502A CN 202210233442 A CN202210233442 A CN 202210233442A CN 114581502 A CN114581502 A CN 114581502A
Authority
CN
China
Prior art keywords
model
smplx
monocular image
dimensional
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210233442.5A
Other languages
Chinese (zh)
Inventor
张亮
朱光明
冯明涛
梅林�
周海超
沈沛意
徐旭
宋娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202210233442.5A priority Critical patent/CN114581502A/en
Publication of CN114581502A publication Critical patent/CN114581502A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a monocular image-based three-dimensional human body model joint reconstruction method, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a body monocular image, a hand monocular image and a face monocular image based on an original monocular image, extracting two-dimensional key point coordinates in the body monocular image, the hand monocular image and the face monocular image to obtain a body candidate frame, a hand candidate frame and a face candidate frame, cutting the original monocular image based on the two-dimensional key point coordinates to obtain three feature images, respectively obtaining features in the feature images by using a local feature extraction network, splicing the features to obtain a cascade feature, training an SMPLX model, and performing three-dimensional human body reconstruction based on the cascade feature; the depth information extracted by the method is richer, the constructed three-dimensional human body model is higher in precision, and the time consumption is shorter.

Description

Monocular image-based three-dimensional human body model joint reconstruction method, electronic device and storage medium
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a monocular image-based three-dimensional human body model joint reconstruction method, electronic equipment and a storage medium.
Background
In recent years, billions of human daily activities are recorded as videos and uploaded to public internet websites to capture diversified human behavior actions in various real-world scenes, and technologies for digitizing human actions in the videos have great potential in various applications including human-computer interaction, social artificial intelligence and robotics; a three-dimensional human body reconstruction model method based on a monocular image draws much attention, and the problems of depth information loss and the like exist in the monocular image reconstruction.
At present, a three-dimensional human body reconstruction method based on monocular images only comprises human body postures and body types, the reconstruction of hand postures and facial expressions is omitted, and in the reconstruction of an integral three-dimensional human body model, the problems that the image occupation ratio of hands and faces is small, accurate postures are difficult to capture and the like need to be faced, the reconstruction precision of the human body model in a posture capture system is limited, and the situation that a computer understands human actions in a more intelligent mode in human-computer interaction is limited.
Disclosure of Invention
The embodiment of the invention aims to provide a monocular image-based three-dimensional human body model joint reconstruction method, which combines the hand posture, the eyes posture and the chin posture of a human body on the basis of the body posture, covers richer depth information, and has higher reconstructed three-dimensional human body model precision and wider application range.
The embodiment of the invention also aims to provide the electronic equipment and the storage medium.
In order to solve the technical problems, the technical scheme adopted by the invention is that the monocular image-based three-dimensional human body model joint reconstruction method comprises the following steps:
s1, processing an original monocular image to be reconstructed to obtain a body monocular image, a hand monocular image and a face monocular image, and predicting two-dimensional key point coordinates in each monocular image respectively to obtain a body candidate frame, a hand candidate frame and a face candidate frame;
s2, cutting the original monocular image by using the body candidate frame, the hand candidate frame and the face candidate frame, and adjusting the size of the cut image to obtain a body characteristic image, a face characteristic image and a hand characteristic image;
s3, training three local feature extraction networks, respectively extracting body features, face features and hand features in the body feature image, the face feature image and the hand feature image, and splicing the body features, the face features and the hand features to obtain cascade features;
and S4, training the SMPLX model, and performing three-dimensional human body reconstruction based on the cascade characteristics.
Further, in S1, the candidate frame Box { (l, r) }, where l denotes the coordinates of the two-dimensional key point at the upper left corner of the body, hand, and face, and r denotes the coordinates of the two-dimensional key point at the lower right corner of the body, hand, and face.
Further, the training process of the local feature extraction network is as follows:
deleting and selecting the HUMBI data set to obtain an original monocular image containing the whole body of a human body, cutting the original monocular image to obtain a local monocular image, predicting two-dimensional key point coordinates in the local monocular image by using an OpenPose model, obtaining a local candidate frame according to the distribution of the two-dimensional key points, cutting the original monocular image according to the local candidate frame to obtain a local characteristic image, and obtaining real model parameters corresponding to each local characteristic image;
capturing local features in the local feature images using a RestNet50 network;
predicting camera parameters and model parameters using a multi-layer perceptron based on local features;
and calculating a loss function of the feature extraction network based on the real model parameters, the camera parameters and the predicted model parameters, and performing parameter adjustment on the feature extraction network based on the loss function value to obtain an optimized feature extraction network.
Further, the local feature extraction network includes a body feature extraction network, a hand feature extraction network, and a face feature extraction network, and a loss function in the local feature extraction network is as follows:
L=Lp1Ljoint,3D2Lreproj
wherein L ispRepresenting the loss between the predicted model and the true model, Ljoint,3DRepresents the calculated loss of the three-dimensional key points extracted by the prediction model and the three-dimensional human body key points extracted by the real model, LreprojRepresents the loss between the two-dimensional key points obtained by projecting the three-dimensional key points extracted by the prediction model through a camera and the two-dimensional key points extracted by the real model, tau1、τ2Representing the weighting coefficients of the balance loss terms.
Further, the process of training the SMPLX model in S4 is as follows:
taking a monocular image with an SMPLX model label as training data, acquiring local features in the training data by using the method from S1 to S3, and splicing all the local features to obtain a cascade feature;
inputting the cascade feature F into the multilayer perceptron, predicting the camera parameters K, SMPLX model pose parameters
Figure BDA0003541196590000021
Body type parameter
Figure BDA0003541196590000022
Empirical parameters
Figure BDA0003541196590000023
Respectively calculating a projection error, a three-dimensional key point error and an SPMLX model parameter error according to the real SMPLX model parameters and the human body key point coordinates corresponding to each monocular image in the training data, and further obtaining a loss function L';
and repeating the process, calculating the loss function of each iteration, and updating the SMPLX model parameters based on the new loss function value to obtain the optimized SMPLX model.
Further, the loss function L' is as follows:
Figure BDA0003541196590000031
Figure BDA0003541196590000032
wherein
Figure BDA0003541196590000033
Representing the error between the true SMPLX model parameters and the predicted SMPLX model parameters, θsmplx、βsmplx、ψsmplxRespectively representing the posture parameter, the body type parameter and the experience parameter of the real SMPLX model,
Figure BDA0003541196590000034
Figure BDA0003541196590000035
respectively representing SMPLX model posture parameters, body type parameters and empirical parameters predicted by a multi-layer perceptron, wherein M represents the number variable of key points of a human body, M represents the total number of key points of the human body, M is 1,2, …, M, M is 137, vmIndicating whether the mth body keypoint is visible,
Figure BDA0003541196590000036
three-dimensional keypoint coordinates (x ″) representing predicted SMPLX model extraction'3d,y″′3d,z″′3d) Three-dimensional key point coordinate L 'extracted by representing real SMPLX model'joint,3DTo represent
Figure BDA0003541196590000037
And (x'3d,y″′3d,z″′3d) The loss between the two or more of the two,
Figure BDA0003541196590000038
to represent
Figure BDA0003541196590000039
Two-dimensional key point coordinates (x ') obtained by projection of camera parameter K'2d,y′2d) Is represented by (x'3d,y″′3d,z″′3d) Two-dimensional key point coordinates L 'obtained by projection of camera parameters K'reprojTo represent
Figure BDA00035411965900000310
And (x'2d,y′2d) Of between, τ1、τ2Representing the weighting coefficients of the balance loss terms.
Further, the training data is constructed as follows:
screening the HUMBI data set to obtain original monocular images including the whole body of a human body, storing SMPL model parameters and camera parameters corresponding to the original monocular images, and obtaining two-dimensional key point coordinates in the original monocular images by using an OpenPose model;
converting an original monocular image SMPL model into an SMPLX model by using SMPL2SMPLX to obtain SMPLX model parameters with the same posture as the SMPL model;
projecting the three-dimensional key point coordinates extracted by the SMPLX model to two-dimensional key point coordinates by using a projection matrix, and calculating an energy function between the two-dimensional key point coordinates obtained by projection and the two-dimensional key point coordinates obtained by the OpenPose model;
and repeating the process, updating the SMPLX model parameters based on the newly calculated energy function, and adding the updated SMPLX model parameters serving as new labels into the HUMBI data set to obtain the monocular image with the SMPLX model labels.
An electronic device comprises a processor, a memory and a communication bus, wherein the processor and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing the steps of the method when executing the program stored in the memory.
A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the above-mentioned method steps.
The invention has the beneficial effects that: according to the embodiment of the invention, based on the SMPL model of the monocular image, the parameters of the SMPLX model most closely attached to the monocular image are calculated to obtain a data set with an SMPLX model label, so that training data are provided for the subsequent reconstruction of the three-dimensional human body model; the body characteristic, the hand characteristic and the face characteristic in the monocular image of the human body are spliced, three-dimensional human body reconstruction is carried out based on the cascade characteristic, the body posture, the face posture and the hand posture of the human body are combined in the reconstruction process, rich depth information is contained, the reconstructed model is higher in precision, and the application range is wider.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is an overall framework of monocular image-based three-dimensional phantom joint reconstruction.
Fig. 2 is a flow chart of a pre-training human feature extraction network.
Figure 3 is a diagram of the input images and the reconstructed phantom results.
FIG. 4 is a diagram of the effects of a three-dimensional human body model reconstructed by an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The three-dimensional human body model joint reconstruction method based on the monocular image has the flow as shown in figure 1, and comprises the following steps:
step S1, cutting the monocular image to be reconstructed to obtain the body monocular image IBHand monocular image IHAnd face monocular image IFInputting the coordinates into an OpenPose model to predict to obtain two-dimensional key point coordinates, and calculating candidate frames of a body, hands and faces respectively;
the two-dimensional key point coordinates
Figure BDA0003541196590000041
M represents the number of two-dimensional key points, M represents the total number of two-dimensional key points, M is 137, including 25 body key points, 2 × 21 hand key points, and 70 face key points, (x, y)mDenotes the mth keypoint coordinate, vmThe visibility of the mth key point is set to be 0 or 1, 0 represents that the key point is invisible, and 1 represents that the key point is visible;
the candidate frame Box { (l, r) }, where l denotes the body, hand, upper left two-dimensional keypoint coordinates of the face (l)x,ly) And r represents the coordinates of two-dimensional key points at the lower right corner of the body, hand, and face (r)x,ry) Respectively obtaining a body candidate frame BoxbHand candidate BoxhAnd face candidate Box BoxfThe candidate frame is similar to a rectangle only with a skeleton and is used for representing the minimum circumscribed interval of the target area;
step S2, using the candidate frame to crop the monocular image, adjusting the size of the image to 224 × 224, and obtaining the body characteristic image IbodyFacial feature image IfaceAnd hand feature image Ihand
Step S3, training local feature extraction networks, wherein the local feature extraction networks comprise a body feature extraction network, a face feature extraction network and a hand feature extraction network, the local feature extraction networks have the same structure and different internal parameters, and adopt a RestNet50 network and a multi-layer network structure comprising down-sampling and residual blocks;
step S4, respectively extracting body feature images I by using local feature extraction networkbodyFacial feature image IfaceAnd hand feature image IhandThe obtained body characteristic, the face characteristic and the hand characteristic are cascaded to obtain a cascade characteristic F,
Figure BDA0003541196590000051
Figure BDA0003541196590000052
representing a splicing operation; ebody、Eface、EhandRespectively representing a body feature extraction network, a face feature extraction network and a hand feature extraction network, Ebody(Ibody) Representing body features extracted from the body feature image by a body feature extraction network, Eface(Iface) Representing facial features extracted from facial feature images by a facial feature extraction network, Ehand(Ihand) Representing hand features extracted from the hand feature image by a hand feature extraction network;
and step S5, training the SMPLX model, and performing three-dimensional human body reconstruction by using the SMPLX model based on the cascade characteristic F.
Since the structures of the local feature extraction networks are the same, the construction process of the local feature extraction network is described here by taking a body monocular image as an example, and the flow is shown in fig. 2, specifically as follows:
step S31, deleting the HUMBI data set to obtain original monocular images including the whole body of the human body, and storing real SMPL model parameters and camera parameters K corresponding to the original monocular images;
SMPL model parameters include body type parameter βsmplAnd a posture parameter thetasmplThe camera parameters refer to a camera external parameter matrix which represents the position and the pointing direction of the camera in a world coordinate system and mainly comprise a rotation matrix R and a flatThe displacement vector T represents the displacement of the image,
Figure BDA0003541196590000053
cutting the original monocular image to obtain a body monocular image, a hand monocular image and a face monocular image;
predicting by using an OpenPose model to obtain two-dimensional body key point coordinates, counting the distribution of each key point, determining the two-dimensional key point coordinates of the upper left corner and the lower right corner of the body, further obtaining a body candidate frame, cutting the original monocular image according to the size of the candidate frame, and adjusting the size of the image to obtain a body characteristic image Ibody
Step S32, using RestNet50 network as body feature extraction network to capture body feature image IbodyBody characteristic F inbody=Ebody(Ibody) In which FbodyFeature vectors in 1024 dimensions;
step S33, for body feature FbodyThe camera parameters K, SMPL model posture parameters corresponding to each body feature are obtained through the prediction of the multilayer perceptron
Figure BDA0003541196590000061
And body type parameters
Figure BDA0003541196590000062
The Multilayer Perceptron (abbreviated as MLP) is an artificial neural network with a forward structure, and maps a group of input vectors to a group of output vectors, the embodiment uses a three-layer fully-connected network, the first two layers are 1024 neurons, the last layer is neurons with the same dimension as the output result, and the ReLU is used as an activation function;
step S34, respectively calculating a projection error, a three-dimensional key point error and an SMPL model parameter error according to the real SMPL model parameters and the body key points corresponding to each original monocular image saved in step S31, and further obtaining a loss function L of the body feature extraction network, as follows:
L=Lp1Ljoint,3D2Lreproj
in the body feature extraction network, the terms of the loss function are expressed as follows:
Figure BDA0003541196590000063
wherein L ispRepresenting the loss between the predicted model and the true model, Ljoint,3DRepresents the calculated loss of the three-dimensional key points extracted by the prediction model and the three-dimensional human body key points extracted by the real model, LreprojRepresents the loss between the two-dimensional key points obtained by projecting the three-dimensional key points extracted by the prediction model through a camera and the two-dimensional key points extracted by the real model, tau1、τ2A weighting coefficient representing balance loss term with a value ranging from 0 to 1, tau1=0.7、τ2=0.5;
J represents the number of body keypoints, J represents the total number of body keypoints, J is 1,2, …, J is 25, vjIndicating whether the jth body key point is visible,
Figure BDA0003541196590000064
three-dimensional keypoint coordinates extracted by an SMPL model representing a prediction, (x)3d,y3d,z3d) Representing the three-dimensional keypoint coordinates extracted by the real SMPL model,
Figure BDA0003541196590000071
to represent
Figure BDA0003541196590000072
Two-dimensional key point coordinates (x) obtained by projection of camera parameters K2d,y2d) Represents (x)3d,y3d,z3d) Obtaining two-dimensional key point coordinates through camera parameter K projection;
and S35, repeating the steps S32-S34, calculating the error between the prediction result and the real result in each iteration, optimizing the parameters in the body characteristic extraction network and the multilayer sensor through back propagation based on the error, gradually reducing the loss function value until the loss function value is not reduced and tends to be stable, and terminating the iteration to obtain the final body characteristic extraction network and the multilayer sensor.
The process of training the hand feature extraction network and the face feature extraction network is similar to the above steps, except that when the hand feature extraction network is trained, the MANO model parameters corresponding to the monocular images of each hand are stored in step S31, and the loss between the real MANO model parameters and the predicted MANO model parameters is calculated when the loss function is calculated; when the face feature extraction network is trained, Surry model parameters corresponding to each face monocular image are stored in step S31, and the loss between the real Surry model parameters and the predicted Surry model parameters is calculated when the loss function is calculated.
The hand feature extraction network and the face feature extraction network can not acquire camera parameters according to the hand features and the face features, camera parameters do not need to be derived, and only the hand features and the face features are extracted, so that the third terms of the loss functions in the hand feature extraction network and the face feature extraction network are zero.
In the hand feature extraction network, the first two terms of the loss function are expressed as follows:
Figure BDA0003541196590000073
wherein
Figure BDA0003541196590000074
Representing the pose parameters, body type parameters, theta, of the multi-layer perceptron predicted MANO model, respectivelyMANO、βMANORespectively representing posture parameters and body type parameters of a real MANO model, H represents a variable quantity of the number of the key points of the hand, H represents the total number of the key points of the hand, H is 1,2, …, H is 2 multiplied by 21, vhIndicates whether the h-th hand key point is visible or not, (x'3d,y′3d,z′3d) Representing the three-dimensional keypoint coordinates extracted by the real MANO model,
Figure BDA0003541196590000075
three-dimensional key point coordinates, τ, representing predicted MANO model extraction1=0.8。
In the facial feature extraction network, the first two terms of the loss function are represented as follows:
Figure BDA0003541196590000076
wherein
Figure BDA0003541196590000077
Empirical parameters, face shape parameters, ρ, respectively, of Surry model representing multi-layer perceptron predictionsSurry、βSurryRespectively representing empirical parameters and face shape parameters of a real Surry model, E representing the number variable of face key points, E representing the total number of face key points, E being 1,2, …, E being 70, veIndicates whether the e-th individual face key point is visible or not, (x ″)3d,y″3d,z″3d) Representing the three-dimensional key point coordinates extracted by the real Surry model,
Figure BDA0003541196590000081
three-dimensional key point coordinates, tau, extracted by the Surry model representing the prediction1=0.5。
The SMPLX model in step S5 is constructed as follows:
step S51, using the data set with SMPLX model label as training data, using the local feature extraction network constructed in step S3 to respectively obtain the body feature, the face feature and the hand feature in the training data, and splicing them to obtain the cascade feature F,
Figure BDA0003541196590000082
step S52, inputting the cascade feature F into the multilayer perceptron, predicting the posture parameter of the camera parameter K, SMPLX model
Figure BDA0003541196590000083
Body type parameter
Figure BDA0003541196590000084
Empirical parameters
Figure BDA0003541196590000085
Step S53, respectively calculating projection errors L 'according to the real SMPLX model parameters and the human body key point coordinates corresponding to the monocular images in the training data'reprojThree-dimensional key point error L'joint,3DAnd SPMLX model parameter error
Figure BDA0003541196590000086
Further, a loss function L' is obtained:
Figure BDA0003541196590000087
Figure BDA0003541196590000088
θsmplx、βsmplx、ψsmplxrespectively representing the posture parameter, the body type parameter and the experience parameter of the real SMPLX model,
Figure BDA0003541196590000089
respectively representing SMPLX model posture parameters, body type parameters and empirical parameters predicted by the multi-layer perceptron,
Figure BDA00035411965900000810
three-dimensional keypoint coordinates (x ″) representing predicted SMPLX model extraction'3d,y″′3d,z″′3d) Representing the three-dimensional keypoint coordinates extracted by the real SMPLX model,
Figure BDA00035411965900000811
to represent
Figure BDA00035411965900000812
Two-dimensional key point coordinates (x ') obtained by projection of camera parameter K'2d,y′2d) Denotes (x'3d,y″′3d,z″′3d) Obtaining two-dimensional key point coordinates through camera parameter K projection;
and S54, repeating the steps S51-S53, calculating the error between the predicted result and the real result in each iteration, optimizing the parameters in the multilayer perceptron through back propagation on the basis until the loss function is not reduced and tends to be stable, terminating the iteration, and obtaining the final SMPLX model parameters predicted by the multilayer perceptron.
The data set with the SMPLX model tag in step S51 is constructed as follows:
step S511: acquiring a HUMBI data set, deleting the data set to obtain original monocular images including the whole body of a human body, and storing an SMPL model, parameters and camera parameters corresponding to each original monocular image;
step S512: inputting the screened monocular image into an OpenPose model, detecting two-dimensional key point coordinates in the monocular image, setting and generating 25 body key points, 2 × 21 hand key points and 70 face key points in the embodiment, and storing the key point coordinates into a JSON format;
step S513: converting the SMPL model of the monocular image into an SMPLX model through SMPL2SMPLX to obtain the SMPLX model and model parameters with the same posture as the SMPL model;
step S514: projecting the three-dimensional key point coordinates extracted by the SMPLX model to two-dimensional key point coordinates by using a projection matrix, wherein the three-dimensional key point coordinates are as follows:
Figure BDA0003541196590000091
wherein J3dAs three-dimensional key point coordinates, J2dTwo-dimensional key point coordinates obtained for the first two dimensions of the selected calculation result;
calculating two-dimensional key points J obtained by projection2dAnd energy function E between two-dimensional key points predicted by OpenPose modelJUpdating model parametersAnd (4) performing the next iteration until the iteration times reach a set value, obtaining the final SMPLX model parameters, and adding the SMPLX model parameters serving as tags into the corresponding monocular images to obtain tagged data sets.
Said energy function
Figure BDA0003541196590000092
Wherein i represents the number variable of the key points, ΠK() Representing a projection function, J3d,iRepresenting the ith key point coordinate, Π, extracted by the SMPLX modelK(J3d,i) Representing the pass through camera parameters K to J3d,iTwo-dimensional key point coordinates, J, obtained by projectionest,iRepresenting the ith two-dimensional key point coordinate predicted by the OpenPose model;
the energy function represents errors among the coordinates of the key points, when the energy function is smaller, the SMPLX model parameters obtained through the SMPL2SMPLX are more accurate, and the SMPLX model based on the fitting is closer to a target; fitting SMPLX model parameters using a LBFGS optimizer in PyTorch, setting the fitting parameters as follows: the learning rate is 0.1, the iteration times are 30 times, and when the iteration times reach a preset value, the iteration is terminated to obtain the final SMPLX model parameters.
The SMPL model only fits the posture of a body part, the reconstructed human body model is low in precision, the SMPLX model also comprises the fitting of the eyes, the chin and the hand posture, the hand posture and the facial expression are combined on the basis of the body posture by using the SMPLX model, and the reconstructed three-dimensional human body model is high in precision; meanwhile, the SMPL model is used as a basis in the embodiment, so that the body posture and the camera parameters of the SMPLX model are accurate, only the hand posture and the facial expression are required to be fitted, the whole process is simple, the calculated amount is small, compared with an optimization method SMPLfy-X, the method has the advantages that the processing speed is higher, the reconstruction result is accurate, three-dimensional human body reconstruction is carried out on each input image by using the method disclosed by the embodiment of the invention, the obtained reconstruction models are all fitted with the expression and posture of the face, the hand and the body of the human body, the body language, the emotion and the like of the human body are accurately expressed, and the method has a wider application prospect.
The present invention also encompasses an electronic device comprising a memory for storing various computer program instructions and a processor for executing the computer program instructions to perform all or a portion of the steps recited above; the electronic device may communicate with one or more external devices, may also communicate with one or more devices that enable user interaction with the electronic device, and/or with any device that enables the electronic device to communicate with one or more other computing devices, and may also communicate with one or more networks (e.g., local area networks, wide area networks, and/or public networks) through a network adapter.
The present invention also includes a computer-readable storage medium storing a computer program that can be executed by a processor, which can include, but is not limited to, magnetic storage devices, optical disks, digital versatile disks, smart cards, and flash memory devices, which can represent one or more devices for storing information and/or other machine-readable media, which term "machine-readable media" includes, but is not limited to, wireless channels and various other media (and/or storage media) that can store, contain, and/or carry code and/or instructions and/or data.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (9)

1. The three-dimensional human body model joint reconstruction method based on the monocular image is characterized by comprising the following steps of:
s1, processing an original monocular image to be reconstructed to obtain a body monocular image, a hand monocular image and a face monocular image, and predicting two-dimensional key point coordinates in each monocular image respectively to obtain a body candidate frame, a hand candidate frame and a face candidate frame;
s2, cutting the original monocular image by using the body candidate frame, the hand candidate frame and the face candidate frame, and adjusting the size of the cut image to obtain a body characteristic image, a face characteristic image and a hand characteristic image;
s3, training three local feature extraction networks, respectively extracting body features, face features and hand features in the body feature image, the face feature image and the hand feature image, and splicing the body features, the face features and the hand features to obtain cascade features;
and S4, training the SMPLX model, and performing three-dimensional human body reconstruction based on the cascade characteristics.
2. The method for three-dimensional human model joint reconstruction based on monocular image according to claim 1, wherein in S1, the candidate frame Box { (l, r) }, wherein l represents the coordinates of the two-dimensional key points at the upper left corner of the body, hand and face, and r represents the coordinates of the two-dimensional key points at the lower right corner of the body, hand and face.
3. The monocular image-based three-dimensional human body model joint reconstruction method according to claim 1, wherein the training process of the local feature extraction network is as follows:
deleting and selecting the HUMBI data set to obtain an original monocular image containing the whole body of a human body, cutting the original monocular image to obtain a local monocular image, predicting two-dimensional key point coordinates in the local monocular image by using an OpenPose model, obtaining a local candidate frame according to the distribution of the two-dimensional key points, cutting the original monocular image according to the local candidate frame to obtain a local characteristic image, and obtaining real model parameters corresponding to each local characteristic image;
capturing local features in the local feature images using a RestNet50 network;
predicting camera parameters and model parameters using a multi-layer perceptron based on local features;
and calculating a loss function of the feature extraction network based on the real model parameters, the camera parameters and the predicted model parameters, and performing parameter adjustment on the feature extraction network based on the loss function value to obtain an optimized feature extraction network.
4. The monocular image-based three-dimensional human body model joint reconstruction method according to claim 3, wherein the local feature extraction network comprises a body feature extraction network, a hand feature extraction network and a face feature extraction network, and a loss function in the local feature extraction network is as follows:
L=Lp1Ljoint,3D2Lreproj
wherein L ispRepresenting the loss between the predicted model and the true model, Ljoint,3DRepresents the calculated loss of the three-dimensional key points extracted by the prediction model and the three-dimensional human body key points extracted by the real model, LreprojRepresents the loss between the two-dimensional key points obtained by projecting the three-dimensional key points extracted by the prediction model through a camera and the two-dimensional key points extracted by the real model, tau1、τ2Representing the weighting coefficients of the balance loss terms.
5. The method for the three-dimensional human model joint reconstruction based on the monocular image as set forth in claim 1, wherein the process of training the SMPLX model in S4 is as follows:
taking a monocular image with an SMPLX model label as training data, acquiring local features in the training data by using the method from S1 to S3, and splicing all the local features to obtain a cascade feature;
inputting the cascade feature F into the multilayer perceptron, predicting the camera parameters K, SMPLX model pose parameters
Figure FDA0003541196580000021
Body type parameter
Figure FDA0003541196580000022
Empirical parameters
Figure FDA0003541196580000023
Respectively calculating a projection error, a three-dimensional key point error and an SPMLX model parameter error according to the real SMPLX model parameters and the human body key point coordinates corresponding to each monocular image in the training data, and further obtaining a loss function L';
and repeating the process, calculating the loss function of each iteration, and updating the SMPLX model parameters based on the new loss function value to obtain the optimized SMPLX model.
6. The monocular image-based three-dimensional human model joint reconstruction method of claim 5, wherein the loss function L' is as follows:
Figure FDA00035411965800000210
Figure FDA0003541196580000024
wherein
Figure FDA0003541196580000025
Representing the error between the true SMPLX model parameters and the predicted SMPLX model parameters, θsmplx、βsmplx、ψsmplxRespectively representing the posture parameter, the body type parameter and the experience parameter of the real SMPLX model,
Figure FDA0003541196580000026
Figure FDA0003541196580000027
SMPLX model pose parameters respectively representing multi-layer perceptron predictionBody type parameter, experience parameter, M represents number variable of human key points, M represents total number of human key points, M is 1,2, …, M is 137, vmIndicating whether the mth body keypoint is visible,
Figure FDA0003541196580000028
three-dimensional keypoint coordinates (x ″) representing predicted SMPLX model extraction'3d,y″′3d,z″′3d) Three-dimensional key point coordinate L 'extracted by representing real SMPLX model'joint,3DTo represent
Figure FDA0003541196580000029
And (x'3d,y″′3d,z″′3d) The loss between the two is reduced, and the loss between the two is reduced,
Figure FDA0003541196580000031
to represent
Figure FDA0003541196580000032
Two-dimensional key point coordinates (x ') obtained by projection of camera parameter K'2d,y′2d) Is represented by (x'3d,y″′3d,z″′3d) Two-dimensional key point coordinates L 'obtained by projection of camera parameters K'reprojTo represent
Figure FDA0003541196580000033
And (x'2d,y′2d) Of between, τ1、τ2Representing the weighting coefficients of the balance loss terms.
7. The monocular image-based three-dimensional human body model joint reconstruction method according to claim 5, wherein the training data is constructed as follows:
screening the HUMBI data set to obtain original monocular images including the whole body of a human body, storing SMPL model parameters and camera parameters corresponding to the original monocular images, and obtaining two-dimensional key point coordinates in the original monocular images by using an OpenPose model;
converting an original monocular image SMPL model into an SMPLX model by using SMPL2SMPLX to obtain SMPLX model parameters with the same posture as the SMPL model;
projecting the three-dimensional key point coordinates extracted by the SMPLX model to two-dimensional key point coordinates by using a projection matrix, and calculating an energy function between the two-dimensional key point coordinates obtained by projection and the two-dimensional key point coordinates obtained by the OpenPose model;
and repeating the process, updating the SMPLX model parameters based on the newly calculated energy function, and adding the updated SMPLX model parameters serving as new labels into the HUMBI data set to obtain the monocular image with the SMPLX model labels.
8. An electronic device is characterized by comprising a processor, a memory and a communication bus, wherein the processor and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 1 to 7 when executing a program stored in the memory.
9. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.
CN202210233442.5A 2022-03-10 2022-03-10 Monocular image-based three-dimensional human body model joint reconstruction method, electronic device and storage medium Pending CN114581502A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210233442.5A CN114581502A (en) 2022-03-10 2022-03-10 Monocular image-based three-dimensional human body model joint reconstruction method, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210233442.5A CN114581502A (en) 2022-03-10 2022-03-10 Monocular image-based three-dimensional human body model joint reconstruction method, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN114581502A true CN114581502A (en) 2022-06-03

Family

ID=81774045

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210233442.5A Pending CN114581502A (en) 2022-03-10 2022-03-10 Monocular image-based three-dimensional human body model joint reconstruction method, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN114581502A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115457104A (en) * 2022-10-28 2022-12-09 北京百度网讯科技有限公司 Human body information determination method and device and electronic equipment
CN115496864A (en) * 2022-11-18 2022-12-20 苏州浪潮智能科技有限公司 Model construction method, model reconstruction device, electronic equipment and storage medium
CN115830642A (en) * 2023-02-13 2023-03-21 粤港澳大湾区数字经济研究院(福田) 2D whole body key point labeling method and 3D human body grid labeling method
CN116958450A (en) * 2023-09-14 2023-10-27 南京邮电大学 Human body three-dimensional reconstruction method for two-dimensional data

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115457104A (en) * 2022-10-28 2022-12-09 北京百度网讯科技有限公司 Human body information determination method and device and electronic equipment
CN115496864A (en) * 2022-11-18 2022-12-20 苏州浪潮智能科技有限公司 Model construction method, model reconstruction device, electronic equipment and storage medium
CN115830642A (en) * 2023-02-13 2023-03-21 粤港澳大湾区数字经济研究院(福田) 2D whole body key point labeling method and 3D human body grid labeling method
CN115830642B (en) * 2023-02-13 2024-01-12 粤港澳大湾区数字经济研究院(福田) 2D whole body human body key point labeling method and 3D human body grid labeling method
CN116958450A (en) * 2023-09-14 2023-10-27 南京邮电大学 Human body three-dimensional reconstruction method for two-dimensional data
CN116958450B (en) * 2023-09-14 2023-12-12 南京邮电大学 Human body three-dimensional reconstruction method for two-dimensional data

Similar Documents

Publication Publication Date Title
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
WO2019228358A1 (en) Deep neural network training method and apparatus
CN114581502A (en) Monocular image-based three-dimensional human body model joint reconstruction method, electronic device and storage medium
CN107492121B (en) Two-dimensional human body bone point positioning method of monocular depth video
CN113158862B (en) Multitasking-based lightweight real-time face detection method
CN110555481A (en) Portrait style identification method and device and computer readable storage medium
CN110222718B (en) Image processing method and device
CN112329525A (en) Gesture recognition method and device based on space-time diagram convolutional neural network
CN109711356B (en) Expression recognition method and system
CN111680550B (en) Emotion information identification method and device, storage medium and computer equipment
CN110008839A (en) A kind of intelligent sign language interactive system and method for adaptive gesture identification
CN114529984A (en) Bone action recognition method based on learnable PL-GCN and ECLSTM
CN112121419B (en) Virtual object control method, device, electronic equipment and storage medium
CN113516227A (en) Neural network training method and device based on federal learning
CN114638408A (en) Pedestrian trajectory prediction method based on spatiotemporal information
CN112906520A (en) Gesture coding-based action recognition method and device
Sun et al. Two-stage deep regression enhanced depth estimation from a single RGB image
CN112801069B (en) Face key feature point detection device, method and storage medium
CN113887501A (en) Behavior recognition method and device, storage medium and electronic equipment
CN114494543A (en) Action generation method and related device, electronic equipment and storage medium
CN111738092B (en) Method for recovering occluded human body posture sequence based on deep learning
CN117576149A (en) Single-target tracking method based on attention mechanism
Ding et al. Enhance Image-to-Image Generation with LLaVA Prompt and Negative Prompt
WO2023142886A1 (en) Expression transfer method, model training method, and device
CN116453025A (en) Volleyball match group behavior identification method integrating space-time information in frame-missing environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination