CN114581502A

CN114581502A - Monocular image-based three-dimensional human body model joint reconstruction method, electronic device and storage medium

Info

Publication number: CN114581502A
Application number: CN202210233442.5A
Authority: CN
Inventors: 张亮; 朱光明; 冯明涛; 梅林�; 周海超; 沈沛意; 徐旭; 宋娟
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2022-06-03

Abstract

The invention discloses a monocular image-based three-dimensional human body model joint reconstruction method, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a body monocular image, a hand monocular image and a face monocular image based on an original monocular image, extracting two-dimensional key point coordinates in the body monocular image, the hand monocular image and the face monocular image to obtain a body candidate frame, a hand candidate frame and a face candidate frame, cutting the original monocular image based on the two-dimensional key point coordinates to obtain three feature images, respectively obtaining features in the feature images by using a local feature extraction network, splicing the features to obtain a cascade feature, training an SMPLX model, and performing three-dimensional human body reconstruction based on the cascade feature; the depth information extracted by the method is richer, the constructed three-dimensional human body model is higher in precision, and the time consumption is shorter.

Description

Monocular image-based three-dimensional human body model joint reconstruction method, electronic device and storage medium

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a monocular image-based three-dimensional human body model joint reconstruction method, electronic equipment and a storage medium.

Background

In recent years, billions of human daily activities are recorded as videos and uploaded to public internet websites to capture diversified human behavior actions in various real-world scenes, and technologies for digitizing human actions in the videos have great potential in various applications including human-computer interaction, social artificial intelligence and robotics; a three-dimensional human body reconstruction model method based on a monocular image draws much attention, and the problems of depth information loss and the like exist in the monocular image reconstruction.

At present, a three-dimensional human body reconstruction method based on monocular images only comprises human body postures and body types, the reconstruction of hand postures and facial expressions is omitted, and in the reconstruction of an integral three-dimensional human body model, the problems that the image occupation ratio of hands and faces is small, accurate postures are difficult to capture and the like need to be faced, the reconstruction precision of the human body model in a posture capture system is limited, and the situation that a computer understands human actions in a more intelligent mode in human-computer interaction is limited.

Disclosure of Invention

The embodiment of the invention aims to provide a monocular image-based three-dimensional human body model joint reconstruction method, which combines the hand posture, the eyes posture and the chin posture of a human body on the basis of the body posture, covers richer depth information, and has higher reconstructed three-dimensional human body model precision and wider application range.

The embodiment of the invention also aims to provide the electronic equipment and the storage medium.

In order to solve the technical problems, the technical scheme adopted by the invention is that the monocular image-based three-dimensional human body model joint reconstruction method comprises the following steps:

s1, processing an original monocular image to be reconstructed to obtain a body monocular image, a hand monocular image and a face monocular image, and predicting two-dimensional key point coordinates in each monocular image respectively to obtain a body candidate frame, a hand candidate frame and a face candidate frame;

s2, cutting the original monocular image by using the body candidate frame, the hand candidate frame and the face candidate frame, and adjusting the size of the cut image to obtain a body characteristic image, a face characteristic image and a hand characteristic image;

s3, training three local feature extraction networks, respectively extracting body features, face features and hand features in the body feature image, the face feature image and the hand feature image, and splicing the body features, the face features and the hand features to obtain cascade features;

and S4, training the SMPLX model, and performing three-dimensional human body reconstruction based on the cascade characteristics.

Further, in S1, the candidate frame Box { (l, r) }, where l denotes the coordinates of the two-dimensional key point at the upper left corner of the body, hand, and face, and r denotes the coordinates of the two-dimensional key point at the lower right corner of the body, hand, and face.

Further, the training process of the local feature extraction network is as follows:

deleting and selecting the HUMBI data set to obtain an original monocular image containing the whole body of a human body, cutting the original monocular image to obtain a local monocular image, predicting two-dimensional key point coordinates in the local monocular image by using an OpenPose model, obtaining a local candidate frame according to the distribution of the two-dimensional key points, cutting the original monocular image according to the local candidate frame to obtain a local characteristic image, and obtaining real model parameters corresponding to each local characteristic image;

capturing local features in the local feature images using a RestNet50 network;

predicting camera parameters and model parameters using a multi-layer perceptron based on local features;

and calculating a loss function of the feature extraction network based on the real model parameters, the camera parameters and the predicted model parameters, and performing parameter adjustment on the feature extraction network based on the loss function value to obtain an optimized feature extraction network.

Further, the local feature extraction network includes a body feature extraction network, a hand feature extraction network, and a face feature extraction network, and a loss function in the local feature extraction network is as follows:

L＝L^p+τ₁L_joint,3D+τ₂L_reproj

wherein L is^pRepresenting the loss between the predicted model and the true model, L_joint,3DRepresents the calculated loss of the three-dimensional key points extracted by the prediction model and the three-dimensional human body key points extracted by the real model, L_reprojRepresents the loss between the two-dimensional key points obtained by projecting the three-dimensional key points extracted by the prediction model through a camera and the two-dimensional key points extracted by the real model, tau₁、τ₂Representing the weighting coefficients of the balance loss terms.

Further, the process of training the SMPLX model in S4 is as follows:

taking a monocular image with an SMPLX model label as training data, acquiring local features in the training data by using the method from S1 to S3, and splicing all the local features to obtain a cascade feature;

inputting the cascade feature F into the multilayer perceptron, predicting the camera parameters K, SMPLX model pose parameters

Body type parameter

Empirical parameters

Respectively calculating a projection error, a three-dimensional key point error and an SPMLX model parameter error according to the real SMPLX model parameters and the human body key point coordinates corresponding to each monocular image in the training data, and further obtaining a loss function L';

and repeating the process, calculating the loss function of each iteration, and updating the SMPLX model parameters based on the new loss function value to obtain the optimized SMPLX model.

Further, the loss function L' is as follows:

wherein

Representing the error between the true SMPLX model parameters and the predicted SMPLX model parameters, θ_smplx、β_smplx、ψ_smplxRespectively representing the posture parameter, the body type parameter and the experience parameter of the real SMPLX model,

respectively representing SMPLX model posture parameters, body type parameters and empirical parameters predicted by a multi-layer perceptron, wherein M represents the number variable of key points of a human body, M represents the total number of key points of the human body, M is 1,2, …, M, M is 137, v_mIndicating whether the mth body keypoint is visible,

three-dimensional keypoint coordinates (x ″) representing predicted SMPLX model extraction'_3d,y″′_3d,z″′_3d) Three-dimensional key point coordinate L 'extracted by representing real SMPLX model'_joint,3DTo represent

And (x'_3d,y″′_3d,z″′_3d) The loss between the two or more of the two,

to represent

Two-dimensional key point coordinates (x ') obtained by projection of camera parameter K'_2d,y′_2d) Is represented by (x'_3d,y″′_3d,z″′_3d) Two-dimensional key point coordinates L 'obtained by projection of camera parameters K'_reprojTo represent

And (x'_2d,y′_2d) Of between, τ₁、τ₂Representing the weighting coefficients of the balance loss terms.

Further, the training data is constructed as follows:

screening the HUMBI data set to obtain original monocular images including the whole body of a human body, storing SMPL model parameters and camera parameters corresponding to the original monocular images, and obtaining two-dimensional key point coordinates in the original monocular images by using an OpenPose model;

converting an original monocular image SMPL model into an SMPLX model by using SMPL2SMPLX to obtain SMPLX model parameters with the same posture as the SMPL model;

projecting the three-dimensional key point coordinates extracted by the SMPLX model to two-dimensional key point coordinates by using a projection matrix, and calculating an energy function between the two-dimensional key point coordinates obtained by projection and the two-dimensional key point coordinates obtained by the OpenPose model;

and repeating the process, updating the SMPLX model parameters based on the newly calculated energy function, and adding the updated SMPLX model parameters serving as new labels into the HUMBI data set to obtain the monocular image with the SMPLX model labels.

An electronic device comprises a processor, a memory and a communication bus, wherein the processor and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing the steps of the method when executing the program stored in the memory.

A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the above-mentioned method steps.

The invention has the beneficial effects that: according to the embodiment of the invention, based on the SMPL model of the monocular image, the parameters of the SMPLX model most closely attached to the monocular image are calculated to obtain a data set with an SMPLX model label, so that training data are provided for the subsequent reconstruction of the three-dimensional human body model; the body characteristic, the hand characteristic and the face characteristic in the monocular image of the human body are spliced, three-dimensional human body reconstruction is carried out based on the cascade characteristic, the body posture, the face posture and the hand posture of the human body are combined in the reconstruction process, rich depth information is contained, the reconstructed model is higher in precision, and the application range is wider.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is an overall framework of monocular image-based three-dimensional phantom joint reconstruction.

Fig. 2 is a flow chart of a pre-training human feature extraction network.

Figure 3 is a diagram of the input images and the reconstructed phantom results.

FIG. 4 is a diagram of the effects of a three-dimensional human body model reconstructed by an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The three-dimensional human body model joint reconstruction method based on the monocular image has the flow as shown in figure 1, and comprises the following steps:

step S1, cutting the monocular image to be reconstructed to obtain the body monocular image I_BHand monocular image I_HAnd face monocular image I_FInputting the coordinates into an OpenPose model to predict to obtain two-dimensional key point coordinates, and calculating candidate frames of a body, hands and faces respectively;

the two-dimensional key point coordinates

M represents the number of two-dimensional key points, M represents the total number of two-dimensional key points, M is 137, including 25 body key points, 2 × 21 hand key points, and 70 face key points, (x, y)_mDenotes the mth keypoint coordinate, v_mThe visibility of the mth key point is set to be 0 or 1, 0 represents that the key point is invisible, and 1 represents that the key point is visible;

the candidate frame Box { (l, r) }, where l denotes the body, hand, upper left two-dimensional keypoint coordinates of the face (l)_x,l_y) And r represents the coordinates of two-dimensional key points at the lower right corner of the body, hand, and face (r)_x,r_y) Respectively obtaining a body candidate frame Box_bHand candidate Box_hAnd face candidate Box Box_fThe candidate frame is similar to a rectangle only with a skeleton and is used for representing the minimum circumscribed interval of the target area;

step S2, using the candidate frame to crop the monocular image, adjusting the size of the image to 224 × 224, and obtaining the body characteristic image I_bodyFacial feature image I_faceAnd hand feature image I_hand；

Step S3, training local feature extraction networks, wherein the local feature extraction networks comprise a body feature extraction network, a face feature extraction network and a hand feature extraction network, the local feature extraction networks have the same structure and different internal parameters, and adopt a RestNet50 network and a multi-layer network structure comprising down-sampling and residual blocks;

step S4, respectively extracting body feature images I by using local feature extraction network_bodyFacial feature image I_faceAnd hand feature image I_handThe obtained body characteristic, the face characteristic and the hand characteristic are cascaded to obtain a cascade characteristic F,

representing a splicing operation; e_body、E_face、E_handRespectively representing a body feature extraction network, a face feature extraction network and a hand feature extraction network, E_body(I_body) Representing body features extracted from the body feature image by a body feature extraction network, E_face(I_face) Representing facial features extracted from facial feature images by a facial feature extraction network, E_hand(I_hand) Representing hand features extracted from the hand feature image by a hand feature extraction network;

and step S5, training the SMPLX model, and performing three-dimensional human body reconstruction by using the SMPLX model based on the cascade characteristic F.

Since the structures of the local feature extraction networks are the same, the construction process of the local feature extraction network is described here by taking a body monocular image as an example, and the flow is shown in fig. 2, specifically as follows:

step S31, deleting the HUMBI data set to obtain original monocular images including the whole body of the human body, and storing real SMPL model parameters and camera parameters K corresponding to the original monocular images;

SMPL model parameters include body type parameter β_smplAnd a posture parameter theta_smplThe camera parameters refer to a camera external parameter matrix which represents the position and the pointing direction of the camera in a world coordinate system and mainly comprise a rotation matrix R and a flatThe displacement vector T represents the displacement of the image,

cutting the original monocular image to obtain a body monocular image, a hand monocular image and a face monocular image;

predicting by using an OpenPose model to obtain two-dimensional body key point coordinates, counting the distribution of each key point, determining the two-dimensional key point coordinates of the upper left corner and the lower right corner of the body, further obtaining a body candidate frame, cutting the original monocular image according to the size of the candidate frame, and adjusting the size of the image to obtain a body characteristic image I_body；

Step S32, using RestNet50 network as body feature extraction network to capture body feature image I_bodyBody characteristic F in_body＝E_body(I_body) In which F_bodyFeature vectors in 1024 dimensions;

step S33, for body feature F_bodyThe camera parameters K, SMPL model posture parameters corresponding to each body feature are obtained through the prediction of the multilayer perceptron

And body type parameters

The Multilayer Perceptron (abbreviated as MLP) is an artificial neural network with a forward structure, and maps a group of input vectors to a group of output vectors, the embodiment uses a three-layer fully-connected network, the first two layers are 1024 neurons, the last layer is neurons with the same dimension as the output result, and the ReLU is used as an activation function;

step S34, respectively calculating a projection error, a three-dimensional key point error and an SMPL model parameter error according to the real SMPL model parameters and the body key points corresponding to each original monocular image saved in step S31, and further obtaining a loss function L of the body feature extraction network, as follows:

L＝L^p+τ₁L_joint,3D+τ₂L_reproj

in the body feature extraction network, the terms of the loss function are expressed as follows:

wherein L is^pRepresenting the loss between the predicted model and the true model, L_joint,3DRepresents the calculated loss of the three-dimensional key points extracted by the prediction model and the three-dimensional human body key points extracted by the real model, L_reprojRepresents the loss between the two-dimensional key points obtained by projecting the three-dimensional key points extracted by the prediction model through a camera and the two-dimensional key points extracted by the real model, tau₁、τ₂A weighting coefficient representing balance loss term with a value ranging from 0 to 1, tau₁＝0.7、τ₂＝0.5；

J represents the number of body keypoints, J represents the total number of body keypoints, J is 1,2, …, J is 25, v_jIndicating whether the jth body key point is visible,

three-dimensional keypoint coordinates extracted by an SMPL model representing a prediction, (x)_3d，y_3d，z_3d) Representing the three-dimensional keypoint coordinates extracted by the real SMPL model,

to represent

Two-dimensional key point coordinates (x) obtained by projection of camera parameters K_2d,y_2d) Represents (x)_3d,y_3d,z_3d) Obtaining two-dimensional key point coordinates through camera parameter K projection;

and S35, repeating the steps S32-S34, calculating the error between the prediction result and the real result in each iteration, optimizing the parameters in the body characteristic extraction network and the multilayer sensor through back propagation based on the error, gradually reducing the loss function value until the loss function value is not reduced and tends to be stable, and terminating the iteration to obtain the final body characteristic extraction network and the multilayer sensor.

The process of training the hand feature extraction network and the face feature extraction network is similar to the above steps, except that when the hand feature extraction network is trained, the MANO model parameters corresponding to the monocular images of each hand are stored in step S31, and the loss between the real MANO model parameters and the predicted MANO model parameters is calculated when the loss function is calculated; when the face feature extraction network is trained, Surry model parameters corresponding to each face monocular image are stored in step S31, and the loss between the real Surry model parameters and the predicted Surry model parameters is calculated when the loss function is calculated.

The hand feature extraction network and the face feature extraction network can not acquire camera parameters according to the hand features and the face features, camera parameters do not need to be derived, and only the hand features and the face features are extracted, so that the third terms of the loss functions in the hand feature extraction network and the face feature extraction network are zero.

In the hand feature extraction network, the first two terms of the loss function are expressed as follows:

wherein

Representing the pose parameters, body type parameters, theta, of the multi-layer perceptron predicted MANO model, respectively_MANO、β_MANORespectively representing posture parameters and body type parameters of a real MANO model, H represents a variable quantity of the number of the key points of the hand, H represents the total number of the key points of the hand, H is 1,2, …, H is 2 multiplied by 21, v_hIndicates whether the h-th hand key point is visible or not, (x'_3d,y′_3d,z′_3d) Representing the three-dimensional keypoint coordinates extracted by the real MANO model,

three-dimensional key point coordinates, τ, representing predicted MANO model extraction₁＝0.8。

In the facial feature extraction network, the first two terms of the loss function are represented as follows:

wherein

Empirical parameters, face shape parameters, ρ, respectively, of Surry model representing multi-layer perceptron predictions_Surry、β_SurryRespectively representing empirical parameters and face shape parameters of a real Surry model, E representing the number variable of face key points, E representing the total number of face key points, E being 1,2, …, E being 70, v_eIndicates whether the e-th individual face key point is visible or not, (x ″)_3d,y″_3d,z″_3d) Representing the three-dimensional key point coordinates extracted by the real Surry model,

three-dimensional key point coordinates, tau, extracted by the Surry model representing the prediction₁＝0.5。

The SMPLX model in step S5 is constructed as follows:

step S51, using the data set with SMPLX model label as training data, using the local feature extraction network constructed in step S3 to respectively obtain the body feature, the face feature and the hand feature in the training data, and splicing them to obtain the cascade feature F,

step S52, inputting the cascade feature F into the multilayer perceptron, predicting the posture parameter of the camera parameter K, SMPLX model

Body type parameter

Empirical parameters

Step S53, respectively calculating projection errors L 'according to the real SMPLX model parameters and the human body key point coordinates corresponding to the monocular images in the training data'_reprojThree-dimensional key point error L'_joint,3DAnd SPMLX model parameter error

Further, a loss function L' is obtained:

θ_smplx、β_smplx、ψ_smplxrespectively representing the posture parameter, the body type parameter and the experience parameter of the real SMPLX model,

respectively representing SMPLX model posture parameters, body type parameters and empirical parameters predicted by the multi-layer perceptron,

three-dimensional keypoint coordinates (x ″) representing predicted SMPLX model extraction'_3d,y″′_3d,z″′_3d) Representing the three-dimensional keypoint coordinates extracted by the real SMPLX model,

to represent

Two-dimensional key point coordinates (x ') obtained by projection of camera parameter K'_2d,y′_2d) Denotes (x'_3d,y″′_3d,z″′_3d) Obtaining two-dimensional key point coordinates through camera parameter K projection;

and S54, repeating the steps S51-S53, calculating the error between the predicted result and the real result in each iteration, optimizing the parameters in the multilayer perceptron through back propagation on the basis until the loss function is not reduced and tends to be stable, terminating the iteration, and obtaining the final SMPLX model parameters predicted by the multilayer perceptron.

The data set with the SMPLX model tag in step S51 is constructed as follows:

step S511: acquiring a HUMBI data set, deleting the data set to obtain original monocular images including the whole body of a human body, and storing an SMPL model, parameters and camera parameters corresponding to each original monocular image;

step S512: inputting the screened monocular image into an OpenPose model, detecting two-dimensional key point coordinates in the monocular image, setting and generating 25 body key points, 2 × 21 hand key points and 70 face key points in the embodiment, and storing the key point coordinates into a JSON format;

step S513: converting the SMPL model of the monocular image into an SMPLX model through SMPL2SMPLX to obtain the SMPLX model and model parameters with the same posture as the SMPL model;

step S514: projecting the three-dimensional key point coordinates extracted by the SMPLX model to two-dimensional key point coordinates by using a projection matrix, wherein the three-dimensional key point coordinates are as follows:

wherein J_3dAs three-dimensional key point coordinates, J_2dTwo-dimensional key point coordinates obtained for the first two dimensions of the selected calculation result;

calculating two-dimensional key points J obtained by projection_2dAnd energy function E between two-dimensional key points predicted by OpenPose model_JUpdating model parametersAnd (4) performing the next iteration until the iteration times reach a set value, obtaining the final SMPLX model parameters, and adding the SMPLX model parameters serving as tags into the corresponding monocular images to obtain tagged data sets.

Said energy function

Wherein i represents the number variable of the key points, Π_K() Representing a projection function, J_3d,iRepresenting the ith key point coordinate, Π, extracted by the SMPLX model_K(J_3d,i) Representing the pass through camera parameters K to J_3d,iTwo-dimensional key point coordinates, J, obtained by projection_est,iRepresenting the ith two-dimensional key point coordinate predicted by the OpenPose model;

the energy function represents errors among the coordinates of the key points, when the energy function is smaller, the SMPLX model parameters obtained through the SMPL2SMPLX are more accurate, and the SMPLX model based on the fitting is closer to a target; fitting SMPLX model parameters using a LBFGS optimizer in PyTorch, setting the fitting parameters as follows: the learning rate is 0.1, the iteration times are 30 times, and when the iteration times reach a preset value, the iteration is terminated to obtain the final SMPLX model parameters.

The SMPL model only fits the posture of a body part, the reconstructed human body model is low in precision, the SMPLX model also comprises the fitting of the eyes, the chin and the hand posture, the hand posture and the facial expression are combined on the basis of the body posture by using the SMPLX model, and the reconstructed three-dimensional human body model is high in precision; meanwhile, the SMPL model is used as a basis in the embodiment, so that the body posture and the camera parameters of the SMPLX model are accurate, only the hand posture and the facial expression are required to be fitted, the whole process is simple, the calculated amount is small, compared with an optimization method SMPLfy-X, the method has the advantages that the processing speed is higher, the reconstruction result is accurate, three-dimensional human body reconstruction is carried out on each input image by using the method disclosed by the embodiment of the invention, the obtained reconstruction models are all fitted with the expression and posture of the face, the hand and the body of the human body, the body language, the emotion and the like of the human body are accurately expressed, and the method has a wider application prospect.

The present invention also encompasses an electronic device comprising a memory for storing various computer program instructions and a processor for executing the computer program instructions to perform all or a portion of the steps recited above; the electronic device may communicate with one or more external devices, may also communicate with one or more devices that enable user interaction with the electronic device, and/or with any device that enables the electronic device to communicate with one or more other computing devices, and may also communicate with one or more networks (e.g., local area networks, wide area networks, and/or public networks) through a network adapter.

The present invention also includes a computer-readable storage medium storing a computer program that can be executed by a processor, which can include, but is not limited to, magnetic storage devices, optical disks, digital versatile disks, smart cards, and flash memory devices, which can represent one or more devices for storing information and/or other machine-readable media, which term "machine-readable media" includes, but is not limited to, wireless channels and various other media (and/or storage media) that can store, contain, and/or carry code and/or instructions and/or data.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. The three-dimensional human body model joint reconstruction method based on the monocular image is characterized by comprising the following steps of:

2. The method for three-dimensional human model joint reconstruction based on monocular image according to claim 1, wherein in S1, the candidate frame Box { (l, r) }, wherein l represents the coordinates of the two-dimensional key points at the upper left corner of the body, hand and face, and r represents the coordinates of the two-dimensional key points at the lower right corner of the body, hand and face.

3. The monocular image-based three-dimensional human body model joint reconstruction method according to claim 1, wherein the training process of the local feature extraction network is as follows:

capturing local features in the local feature images using a RestNet50 network;

4. The monocular image-based three-dimensional human body model joint reconstruction method according to claim 3, wherein the local feature extraction network comprises a body feature extraction network, a hand feature extraction network and a face feature extraction network, and a loss function in the local feature extraction network is as follows:

L＝L^p+τ₁L_joint,3D+τ₂L_reproj

5. The method for the three-dimensional human model joint reconstruction based on the monocular image as set forth in claim 1, wherein the process of training the SMPLX model in S4 is as follows:

Body type parameter

Empirical parameters

6. The monocular image-based three-dimensional human model joint reconstruction method of claim 5, wherein the loss function L' is as follows:

wherein

SMPLX model pose parameters respectively representing multi-layer perceptron predictionBody type parameter, experience parameter, M represents number variable of human key points, M represents total number of human key points, M is 1,2, …, M is 137, v_mIndicating whether the mth body keypoint is visible,

And (x'_3d,y″′_3d,z″′_3d) The loss between the two is reduced, and the loss between the two is reduced,

to represent

7. The monocular image-based three-dimensional human body model joint reconstruction method according to claim 5, wherein the training data is constructed as follows:

8. An electronic device is characterized by comprising a processor, a memory and a communication bus, wherein the processor and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 7 when executing a program stored in the memory.

9. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.