CN113379904B - Hidden space motion coding-based multi-person human body model reconstruction method - Google Patents

Hidden space motion coding-based multi-person human body model reconstruction method Download PDF

Info

Publication number
CN113379904B
CN113379904B CN202110758141.XA CN202110758141A CN113379904B CN 113379904 B CN113379904 B CN 113379904B CN 202110758141 A CN202110758141 A CN 202110758141A CN 113379904 B CN113379904 B CN 113379904B
Authority
CN
China
Prior art keywords
dimensional
human body
motion
human
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110758141.XA
Other languages
Chinese (zh)
Other versions
CN113379904A (en
Inventor
王雁刚
黄步真
束愿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202110758141.XA priority Critical patent/CN113379904B/en
Publication of CN113379904A publication Critical patent/CN113379904A/en
Application granted granted Critical
Publication of CN113379904B publication Critical patent/CN113379904B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/207Analysis of motion for motion estimation over a hierarchy of resolutions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-person human body model reconstruction method based on hidden space motion coding, which specifically comprises the following steps: s1, hidden space motion prior training: s2, building a multi-view camera system and collecting multi-human motion videos; s3, preprocessing a multi-human motion video acquired by the multi-view camera system set up in the step S2 by using the existing open-source two-dimensional attitude estimation and tracking to obtain a two-dimensional joint point track; s4, acquiring external parameters of the initial camera; s5, completing alignment of each human body to be reconstructed in the multi-human body motion video; s6, iteratively optimizing the implicit space coding and the initial camera external parameters obtained in the step S4 to enable the joint point projection of the three-dimensional human body model to be consistent with the two-dimensional joint point track obtained in the step S3, and finally covering the human body model in each frame to obtain the human body reconstruction result of the whole sequence. The invention reduces the influence of mutual shielding of multiple persons and tracking errors on reconstruction.

Description

Hidden space motion coding-based multi-person human body model reconstruction method
Technical Field
The invention relates to a hidden space motion coding-based multi-person human body model reconstruction method, and belongs to the field of computer vision and three-dimensional vision.
Background
The reconstruction of the multi-person human body model has important functions in applications such as holographic communication, traffic behavior monitoring, group behavior analysis and the like. Driven by the technology related to artificial intelligence, the market demands for human reconstruction are increasing day by day. The current human body model reconstruction scheme based on a single visual angle faces the problems of depth ambiguity, non-robustness to shielding and the like, so that an accurate human body model cannot be reconstructed due to the fact that the depth is not fixed on the one hand, and a good reconstruction result cannot be obtained on the other hand when a plurality of people with shielding face due to the loss of image information. Although the multi-view-angle-based human body model reconstruction scheme can make up for the defects of the single-view-angle scheme to a certain extent, the existing method needs to calibrate the camera first and then reconstruct the human body model, and the process is complex. In addition, due to mutual shielding and visual angle loss of the human body models, effective information is reduced, and the accurate interactive multi-person human body models are difficult to reconstruct by the multi-visual angle method. In the face of a multi-person image with shielding, the time sequence constraint can well make up the missing human model information due to shielding. However, the reconstruction method added with the time sequence constraint has large solved space dimension and complex calculation, and is easy to fall into a local minimum value. Therefore, on the basis of simplifying the reconstruction process, adding time sequence information to realize accurate reconstruction of the multi-person human body model is a new challenging problem.
Disclosure of Invention
The purpose of the invention is as follows: the invention provides a hidden space motion coding-based multi-person human body model reconstruction method, which is characterized in that a plurality of synchronous cameras are used for collecting multi-person human body videos, camera parameters and human body models are optimized by combining hidden space motion coding with physical motion prior combination through posture detection and tracking, and a three-dimensional human body model is directly reconstructed from a multi-view-angle uncalibrated video sequence.
The technical scheme is as follows: the invention relates to a multi-person human body model reconstruction method based on hidden space motion coding, which comprises the following steps:
s1, hidden space motion prior training: extracting a human body motion sequence from the existing three-dimensional human body model data set containing the motion sequence, wherein the human body motion sequence is used for variational self-coding network training, and after the training is finished, the decoder parameters of the variational self-coding network are fixed and used for subsequent iterative optimization;
s2, building a multi-view camera system and collecting multi-human motion videos;
s3, preprocessing a multi-human motion video acquired by the multi-view camera system set up in the step S2 by using the existing open-source two-dimensional attitude estimation and tracking to obtain a two-dimensional joint point track;
s4, acquiring initial camera external parameters by utilizing the three-dimensional posture of the initial frame of the multi-view video;
s5, calculating the three-dimensional position of each joint point of each human body to be reconstructed in each frame through geometric triangulation based on the two-dimensional joint point track obtained in the step S3 and the initial camera external reference obtained in the step S4, aligning the joint points of the human body initial model with the three-dimensional positions of the joint points, and sequentially finishing the alignment of each human body to be reconstructed in the multi-human body motion video by taking the aligned initial three-dimensional human body model as a fitting initial state;
s6, with the three-dimensional human body model initialized in the step S5 as an initial state, decoding the hidden space coding by using a decoder of the self-coding network trained in the step S1, driving the three-dimensional human body model to deform, iteratively optimizing the hidden space coding and the initial camera external parameters obtained in the step S4, enabling the joint point projection of the three-dimensional human body model to be consistent with the two-dimensional joint point track obtained in the step S3, and finally covering the human body model in each frame to obtain the human body reconstruction result of the whole sequence.
Further, the specific method of step S1 includes:
s11, extracting a human body motion sequence from the existing three-dimensional human body model data set containing the motion sequence, wherein human body motion information in each frame is represented by a three-dimensional human body model of a skeleton skin, the skeleton posture of the human body motion sequence is represented by joint point rotation, human body grids are driven to deform by the joint point rotation, and the T frame is represented by T multiplied by N multiplied by 6 parameters;
s12, firstly, constructing a variational self-coding network to learn the human motion information in the step S11 and constructing a hidden space representation of the human motion information, wherein an encoder of the variational self-coding network consists of an encoding gating cycle unit, a mean value encoding network and a variance encoding network, a decoder of the variational self-coding network consists of a decoding full-connection network and a decoding gating cycle unit, a T-frame human motion sequence is used as input, a mean value and a variance corresponding to the sequence are obtained through the encoder of the variational self-coding network, sampling is carried out from the distribution formed by the mean value variance, and the sampling code of the sampling code is recovered into the input T-frame human motion sequence after passing through the decoder of the variational self-coding network;
s13, monitoring the mean value and the variance corresponding to the sequence obtained by an encoder of the variational self-coding network by using standard normal distribution, and constraining the implicit space distribution by using Kullback-Leibler divergence between the two distributions, wherein the formula is as follows:
Figure BDA0003148001580000021
wherein N (mu, sigma)2) Representing a standard normal distribution, P (x) representing the distribution obeyed by a mean encoding network and a variance encoding network, and KL (P | | | N) representing the Kullback-Leibler divergence between the distribution P and the distribution N;
in addition, the L2 penalty of using the input parameters for the output parameters is further constrained,
Lparam=‖Xin-Xout2
Xin,Xoutthe input and output three-dimensional articulation point rotation parameters of the human body |2Denotes a two-norm, LparamRepresents the L2 loss between the input parameter and the output parameter;
the mesh deformation is further driven with the rotation parameters, with the L2 loss between the input and output meshes as constraints:
Lmesh=‖M(Xin)-M(Xout)‖2
m is differentiable covering operation, taking rotation parameters as input, outputting a deformed three-dimensional human body grid, LmeshRepresents the L2 penalty between the input and output grids;
finally, the training loss of the variational self-coding network is:
L=λklKL(P||N)+λparamLparammeshLmesh
wherein λklparammeshIs a weight;
s14, because the human motion data are limited, the generalization performance of the trained model is enhanced by using data augmentation in the process of training the variational self-coding network, and the strategy comprises the following steps:
s141, performing frame-by-frame sampling on the existing motion sequence: sampling one frame every m frames in the original sequence to form a motion sequence of T frames as input;
s142, sampling in a reverse order: reversing the sequence of the sampled motion sequence to be used as input;
s143, mirror image turning: rotationally exchanging the joint points corresponding to the symmetrical limbs of the human body to form a new motion sequence as input, and training the network until convergence by the supervision of the step S13;
and S15, after the network training is completed, the hidden space coding can represent a complete human body motion sequence by a smaller space dimension, the network parameters of a decoder are fixed, the hidden space coding is randomly given, the human body motion sequence is obtained after the decoding of the decoder, and the decoded motion sequence also generates corresponding motion change by adjusting the hidden space coding.
Further, in the step S2, the multi-view camera system performs multi-view acquisition by using a plurality of FLIR Blackfly industrial cameras, fixes the industrial cameras by using a tripod, places the industrial cameras around an acquired area, performs synchronous acquisition by hardware triggering, and stores the industrial cameras in a solid state disk by using a USB3.0 and a PCIE interface.
Further, the specific method of step S3 is: the method comprises the following steps of preprocessing collected multi-human motion videos by using the existing open-source two-dimensional attitude estimation and tracking, obtaining continuous joint motion tracks of human joint points obtained by attitude estimation in each frame of image through human tracking, and screening the obtained tracks by taking displacement vectors of the human joint points as constraints, wherein the calculation formula of the displacement vectors is as follows:
Figure BDA0003148001580000031
wherein Si,Si-nRespectively representing the ith frame and the ith-nth frame of human body joint pointsIn a position of
Figure BDA0003148001580000032
When the modulus of the error correction data is larger than a threshold theta, the tracking is considered to be in error, if the number of error frames exceeds K frames, the error is considered to be incapable of being corrected, and all error results are discarded; if the number of error frames is less than K frames, the current frame and the last frame before error are used for updating
Figure BDA0003148001580000033
And carrying out the processing on each human body object detected in the video to obtain a two-dimensional joint point track.
Further, the specific method of step S4 includes: firstly, the three-dimensional posture of the same human body object in the initial frames of different visual angles of the collected video is predicted under the relative coordinates by utilizing the existing open-source single-visual-angle three-dimensional posture estimation method, because only the rigid rotation difference exists between the three-dimensional postures of different visual angles at the same moment, the rigid rotation of the human body is equal to the external reference rotation of the camera, one visual angle is selected as the initial visual angle, and the rigid rotation of the three-dimensional posture of the human body relative to the three-dimensional posture of the visual angle under other visual angles is calculated; then, the camera rotation of the initial view angle is a unit array, the camera translation of the unit array is obtained by the least square method through camera internal parameters, three-dimensional attitude estimation results and two-dimensional attitude estimation results, and the formula is as follows:
Figure BDA0003148001580000041
wherein pi is projection, R and t are rotation and translation of the camera respectively, J is three-dimensional joint point position obtained by three-dimensional attitude estimation, p is two-dimensional joint point pixel coordinate obtained by two-dimensional attitude detection, and t is*Solving the obtained translation;
after the initial view angle camera external parameter is obtained, rigid rotation of the three-dimensional posture of the second view angle is used as rotation of the camera external parameter, and the camera external parameter under the view angle is obtained through the formula by utilizing the three-dimensional posture of the initial view angle and the two-dimensional posture of the current view angle; and repeating the above process to obtain the external parameters of each visual angle camera.
Further, the specific method of step S6 includes:
performing adjustment optimization on the light beam by using the initialized three-dimensional model, camera parameters and two-dimensional joint points obtained by estimating the two-dimensional posture, wherein the formula is as follows:
Figure BDA0003148001580000042
wherein λrepreojmotionlatentR, t, z are camera rotation, translation and multi-human implicit space coding for each view angle, Lreproj(R, t, z) is a reprojection term, Lmotion(z) is the physical motion prior term, Llatent(z) is a hidden spatial motion prior term;
wherein the reprojection term is:
Figure BDA0003148001580000043
m is whether the two-dimensional joint point of the frame is visible or not, if the two-dimensional joint point is visible, the pixel coordinate of the two-dimensional joint point is 1, otherwise, the pixel coordinate of the two-dimensional joint point is 0, p is the pixel coordinate of the two-dimensional joint point obtained by two-dimensional posture detection, omega is the confidence coefficient of the two-dimensional joint point, z is hidden space coding, D (z) is human body three-dimensional joint rotation obtained by conversion after decoding of hidden space, FK is forward dynamics transformation, and F, V and N are the sequence frame number, the view angle number and the number of people to be reconstructed respectively;
physical motion prior term Lmotion(z) is:
Figure BDA0003148001580000044
Figure BDA0003148001580000045
rotating the three-dimensional joint points of the f frame of the nth human body object to be reconstructed,
Figure BDA0003148001580000046
representing the f-1 frame of the nth human object to be reconstructed,
Figure BDA0003148001580000047
representing the rotation of the three-dimensional joint point of the (n) th frame (f + 1) th human body object to be reconstructed;
hidden spatial motion prior term Llatent(z) is:
Figure BDA0003148001580000051
znand (3) representing the hidden space code of the nth human body object to be reconstructed, decoding the hidden space code obtained by optimization solution, and covering the human body model in each frame to obtain the human body reconstruction result of the whole sequence.
Compared with the prior art, the invention has the beneficial effects that: 1. and the multi-person video is synchronously acquired by the multi-view RGB camera, calibration and reconstruction are simultaneously carried out, the equipment is simple, and the flow is simple. 2. The hidden space coding is used for representing human body motion, motion prior information is provided in the optimization process, the dimensionality is reduced, and the optimization solving difficulty is reduced. 3. In the combined optimization process, the accuracy of the camera parameter solution is improved by human body structure prior, physical motion prior and hidden space motion prior. 4. For a video frame with a severely shielded human body, even if two-dimensional joint point constraint is lacked, the result of the frame can be reconstructed by utilizing a physical motion prior and a hidden space motion prior, so that the influence of mutual shielding of multiple people and tracking error on reconstruction is reduced.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram of a hidden spatial motion prior network architecture;
FIG. 3 is a schematic view of a multi-view acquisition system;
FIG. 4 is a schematic representation of a mannequin;
FIG. 5 is a two-dimensional pose estimation versus joint point trajectory results diagram;
fig. 6 is a graph of the reconstruction results.
Detailed Description
The following detailed description of the embodiments of the invention is provided in connection with the accompanying drawings.
The present invention is described in further detail below with reference to the attached drawing figures. As shown in fig. 1, the method for reconstructing a multi-person human body model based on implicit space motion coding according to the present invention includes the following steps:
(1) in the embodiment, the human motion information refers to the posture of a human body at a certain moment, the postures at a plurality of moments form a motion sequence, the skeleton posture is represented by joint point rotation, and the human body mesh deformation is driven by the joint point rotation. For continuous representation of three-dimensional rotation in a deep neural network, for a human skeleton with N joint points, each joint point represents the three-dimensional rotation using 6-dimensional parameters. The 6-dimensional parameter is specifically the first two column vectors of the three-dimensional rotation matrix, and the third column vector only represents the sign, so that the three-dimensional rotation matrix can be obtained by cross-multiplying the first two column vectors. After the network estimates 6-dimensional parameters, the parameters are formed into two column vectors, the two column vectors are subjected to Schmidt orthogonalization, and finally a third column vector is determined through cross multiplication, so that the rotation of each joint point can be represented in a 3 x 3 rotation matrix form. Thus, for a human motion sequence with N skeletal joint points for a T frame, it can be represented by T × N × 6 parameters.
(2) First, a variational self-coding network as shown in fig. 2 is constructed to learn the above-mentioned human motion information, and a hidden space representation of the human motion information is constructed. The encoder consists of an encoding gating cycle unit, a mean value encoding network and a variance encoding network. The decoder consists of a decoding full-connection network and a decoding gating cyclic unit. Taking the T-frame human motion sequence as input, obtaining the mean value and variance corresponding to the sequence through an encoder, sampling from the distribution formed by the mean value and variance, and recovering the sampling code into the input T-frame human motion sequence after passing through a decoder.
(3) The mean and variance obtained by coding are supervised by a standard normal distribution, and the implicit spatial distribution is constrained by the Kullback-Leibler divergence between the two distributions. The formula is as follows:
Figure BDA0003148001580000061
wherein N (mu, sigma)2) Represents the standard normal distribution, P (x) represents the distribution obeyed by the mean encoding network and the variance encoding network, and KL (P | | | N) represents the Kullback-Leibler divergence between distribution P and distribution N.
Furthermore, the L2 penalty of using the input parameters for the output parameters is further constrained.
Lparam=‖Xin-Xout2
Xin,XoutThe input and output three-dimensional articulation point rotation parameters of the human body |2Denotes a two-norm, LparamRepresenting the L2 loss between the input parameter and the output parameter. To obtain finer motion details, we further drive mesh deformation with rotation parameters, with the L2 loss between the input and output meshes as a constraint:
Lmesh=‖M(Xin)-M(Xout)‖2
m is differentiable covering operation, taking rotation parameters as input, outputting a deformed three-dimensional human body grid, LmeshRepresenting the L2 penalty between the input and output grids.
Finally, the training loss of the variational self-coding network is:
L=λklKL(P||N)+λparamLparammeshLmesh
wherein λklparammeshAre weights.
(4) Because human motion data is limited, data augmentation is used in training the network to enhance the generalization performance of the trained model. The strategy mainly comprises the following steps: 1. the existing motion sequence is sampled frame by frame. One frame is sampled every m frames in the original sequence, and a motion sequence of T frames is formed as input. 2. And (5) sampling in a reverse order. The sampled motion sequence is used as input after the sequence order is reversed. 3. And (5) mirror image turning. The corresponding joint points of the symmetrical limbs (such as the left hand and the right hand) of the human body are rotationally exchanged to form a new motion sequence as input. And (4) training the network until convergence under the supervision of the step (3).
(5) After the network training is completed, the hidden space encoding can represent a complete human motion sequence with smaller space dimension. And (3) fixing the network parameters of the decoder, randomly giving a hidden space code, and decoding by the decoder to obtain a human motion sequence. By adjusting the implicit spatial coding, the decoded motion sequence also produces corresponding motion variations.
Establishing a multi-view camera system:
(6) the acquisition system is simple in structure, and multi-view acquisition is performed only by using a plurality of FLIR Blackfly industrial cameras. The camera is fixed using a tripod, suitably placed around the area to be acquired. And synchronous acquisition is carried out through hardware triggering. And by utilizing the USB3.0, the PCIE interface is stored in the solid state disk. As shown in fig. 3.
Estimating and tracking the two-dimensional posture of the human body:
(7) and preprocessing the collected multi-human motion video by using the existing open-source two-dimensional attitude estimation and tracking. In each frame of image, the human body joint points obtained by posture estimation obtain continuous joint point motion tracks through human body tracking. Due to occlusion and interaction caused by motion of multiple persons, noise exists in a tracking result. In order to filter out the tracking error result, the displacement vector of the human body joint point is used as the constraint to screen the acquired track. The displacement vector calculation formula is as follows:
Figure BDA0003148001580000072
wherein Si,Si-nRespectively representing the positions of human body joint points of the ith frame and the ith-nth frames. When in use
Figure BDA0003148001580000073
When the modulus of (a) is greater than the threshold value θ, θ is generally determined according to the camera frame rate and the camera focal length, θ in this embodiment takes a value of 10, it is considered that tracking is erroneous, if the number of erroneous frames exceeds t frames, t is generally selected according to the camera frame rate, t in this embodiment takes a value of 30, it is considered that the error cannot be corrected, and all erroneous results are discarded, as shown in fig. 5. If the error frame number is less than t frames, the current frame and the last frame before error are used for updating
Figure BDA0003148001580000074
And (5) carrying out step (7) on each human body object detected in the video to obtain a two-dimensional joint point track.
Initial camera parameter acquisition:
(8) the combined solution of the camera parameters and the human body three-dimensional model is a highly non-convex optimization problem, and the accuracy of the initial value greatly influences the accuracy of the final result. Since the camera-internal reference is determined only by the camera hardware, it is assumed in the present invention that the camera-internal reference is known. The invention provides a method for acquiring initial camera external parameters by using a three-dimensional posture of an initial frame of a multi-view video. Firstly, the same human body object in the initial frames of different visual angles of the collected video is subjected to three-dimensional attitude prediction under relative coordinates by using the existing open-source single-visual-angle three-dimensional attitude estimation method. Because only the difference of rigid rotation exists between the three-dimensional postures of different visual angles at the same moment, the rigid rotation of the human body is equal to the rotation of external parameters of the camera. Therefore, one of the visual angles is selected as an initial visual angle, and rigid rotation of the three-dimensional posture of the human body relative to the three-dimensional posture of the visual angle under other visual angles is calculated. Then, the camera rotation of the initial view angle is a unit array, and the camera translation of the unit array is obtained by utilizing a least square method through camera internal parameters, a three-dimensional attitude estimation result and a two-dimensional attitude estimation result. The formula is as follows:
Figure BDA0003148001580000071
wherein pi is projection, R and t are rotation and translation of the camera respectively, J is a three-dimensional joint point position obtained by three-dimensional attitude estimation, and p is a two-dimensional joint point pixel coordinate obtained by two-dimensional attitude detection. And after the external parameters of the camera at the initial view angle are obtained, rigid rotation of the three-dimensional posture of the second view angle is used as rotation of the external parameters of the camera, and the external parameters of the camera at the view angle are obtained by utilizing the three-dimensional posture of the initial view angle and the two-dimensional posture of the current view angle through the formula. And repeating the above process to obtain the external parameters of each visual angle camera.
Obtaining an initial three-dimensional joint point and initializing a human body model:
(9) and (7) calculating the three-dimensional position of each joint point of each human body to be reconstructed in each frame through the two-dimensional posture result obtained in the step (7) and the step (8) and the initial camera parameters through geometric triangulation.
(10) In order to reduce the difficulty of optimization solution, the human body three-dimensional model needs to be initialized. And (3) aligning the joint points of the human body initial model with the three-dimensional joint points obtained in the step (9) for each frame in the sequence to be fitted, and taking the aligned initial three-dimensional human body model as the fitting initial state. And finishing the alignment of each human body to be reconstructed in the video sequence in sequence.
Solving by a light beam adjustment method based on implicit space motion coding:
(11) and performing adjustment optimization on the light beam by using the initialized three-dimensional model, the camera parameters and the two-dimensional joint points obtained by two-dimensional posture estimation. The formula is as follows:
Figure BDA0003148001580000081
wherein λrepreojmotionlatentR, t, z are camera rotation, translation and multi-human implicit spatial coding for each view. Wherein the reprojection term is:
Figure BDA0003148001580000082
and m is whether the two-dimensional joint point of the frame is visible or not, if so, the two-dimensional joint point is 1, otherwise, the two-dimensional joint point is 0. p is a two-dimensional joint point, and ω is a confidence of the two-dimensional joint point. z is implicit spatial coding. And D (z) human body three-dimensional joint rotation obtained by conversion after decoding in hidden space. FK is a forward kinetic transformation. F, V and N are respectively the number of sequence frames, the number of visual angles and the number of people to be reconstructed.
The physical motion prior term is:
Figure BDA0003148001580000083
Figure BDA0003148001580000084
rotating the three-dimensional joint points of the f frame of the nth human body object to be reconstructed,
Figure BDA0003148001580000085
representing the f-1 frame of the nth human object to be reconstructed,
Figure BDA0003148001580000086
representing the three-dimensional joint point rotation of the (n) th frame (f + 1) th human body object to be reconstructed. .
The hidden spatial motion prior term is:
Figure BDA0003148001580000087
znand (3) representing the hidden space coding of the nth human body object to be reconstructed, performing optimization solution on the obtained hidden space coding, decoding, and covering the human body model in each frame to obtain the human body reconstruction result of the whole sequence.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the specific embodiments described above, which are intended to further illustrate the principles of the invention, and that various changes and modifications may be made without departing from the spirit and scope of the invention, which is intended to be covered by the appended claims. The scope of the invention is defined by the claims and their equivalents.

Claims (6)

1. A multi-person human body model reconstruction method based on hidden space motion coding is characterized by comprising the following steps:
s1, hidden space motion prior training: extracting a human body motion sequence from the existing three-dimensional human body model data set containing the motion sequence, using the human body motion sequence for variational self-coding network training, and fixing decoder parameters of the variational self-coding network after the training is finished for subsequent iterative optimization;
s2, building a multi-view camera system and collecting multi-human motion videos;
s3, preprocessing a multi-human motion video acquired by the multi-view camera system set up in the step S2 by using the existing open-source two-dimensional attitude estimation and tracking to obtain a two-dimensional joint point track;
s4, acquiring initial camera external parameters by utilizing the three-dimensional posture of the initial frame of the multi-view video;
s5, calculating the three-dimensional position of each joint point of each human body to be reconstructed in each frame through geometric triangulation by using the two-dimensional joint point track obtained in the step S3 and the initial camera external reference obtained in the step S4, aligning the joint points of the human body initial model with the three-dimensional position of each joint point, and sequentially finishing the alignment of each human body to be reconstructed in the multi-human body motion video by taking the aligned initial three-dimensional human body model as a fitting initial state;
s6, with the three-dimensional human body model initialized in the step S5 as an initial state, decoding the hidden space coding by using a decoder of the self-coding network trained in the step S1, driving the three-dimensional human body model to deform, iteratively optimizing the hidden space coding and the initial camera external parameters obtained in the step S4, enabling the joint point projection of the three-dimensional human body model to be consistent with the two-dimensional joint point track obtained in the step S3, and finally covering the human body model in each frame to obtain the human body reconstruction result of the whole sequence.
2. The hidden spatial motion coding based multi-person mannequin reconstruction method according to claim 1, wherein the step S1 includes the following specific steps:
s11, extracting a human motion sequence from the existing three-dimensional human model data set containing the motion sequence, wherein human motion information in each frame is represented by a three-dimensional human model of a skeleton skin, the skeleton posture of the human motion sequence is represented by joint point rotation, human body grids are driven to deform by joint point rotation, and the T frame is represented by T multiplied by N multiplied by 6 parameters;
s12, firstly, constructing a variational self-coding network to learn the human motion information in the step S11 and constructing a hidden space representation of the human motion information, wherein an encoder of the variational self-coding network consists of an encoding gating cycle unit, a mean value encoding network and a variance encoding network, a decoder of the variational self-coding network consists of a decoding full-connection network and a decoding gating cycle unit, a T-frame human motion sequence is used as input, a mean value and a variance corresponding to the sequence are obtained through the encoder of the variational self-coding network, sampling is carried out from distribution consisting of the mean value variances, and the sampling codes are recovered into the input T-frame human motion sequence after passing through the decoder of the variational self-coding network;
s13, monitoring the mean value and the variance corresponding to the sequence obtained by an encoder of the variational self-coding network by using standard normal distribution, and constraining the implicit space distribution by using Kullback-Leibler divergence between the two distributions, wherein the formula is as follows:
Figure FDA0003416643860000021
wherein N (mu, sigma)2) Representing a standard normal distribution, P (x) representing the distribution obeyed by a mean encoding network and a variance encoding network, and KL (P | | | N) representing the Kullback-Leibler divergence between the distribution P and the distribution N;
in addition, the L2 penalty of using the input parameters for the output parameters is further constrained,
Figure FDA0003416643860000022
Xin,Xoutrespectively input and output human three-dimensional joint point rotation parameters | · | | luminance2Denotes a two-norm, LparamRepresents the L2 penalty between the input parameter and the output parameter;
the mesh deformation is further driven with the rotation parameters, with the L2 loss between the input and output meshes as constraints:
Lmesh=||M(Xin)-M(Xout)||2
m is differentiable covering operation, taking rotation parameters as input, outputting a deformed three-dimensional human body grid, LmeshRepresents the L2 penalty between the input and output grids;
finally, the training loss of the variational self-coding network is:
Figure FDA0003416643860000023
wherein λkl,λparam,λmeshIs a weight;
s14, because the human motion data are limited, the generalization performance of the trained model is enhanced by using data augmentation in the process of training the variational self-coding network, and the strategy comprises the following steps:
s141, performing frame-by-frame sampling on the existing motion sequence: sampling one frame every m frames in the original sequence to form a motion sequence of T frames as input;
s142, sampling in a reverse order: reversing the sequence of the sampled motion sequence to be used as input;
s143, mirror image turning: rotationally exchanging the joint points corresponding to the symmetrical limbs of the human body to form a new motion sequence as input, and training the network until convergence by the supervision of the step S13;
s15, after the steps S11-S14 are completed, the hidden space coding can represent a complete human motion sequence by smaller space dimension, the network parameters of a decoder are fixed, the hidden space coding is randomly given, the human motion sequence is obtained after the decoding of the decoder, and the decoded motion sequence also generates corresponding motion change by adjusting the hidden space coding.
3. The hidden spatial motion coding-based multi-person human body model reconstruction method according to claim 1, wherein in step S2, the multi-view camera system performs multi-view acquisition by using a plurality of FLIR Blackfly industrial cameras, fixes the industrial cameras by using a tripod, places the industrial cameras around an acquired area, performs synchronous acquisition by hardware triggering, and stores the industrial cameras in a solid state disk by using a USB3.0 PCIE interface.
4. The hidden spatial motion coding-based multi-person mannequin reconstruction method according to claim 1, wherein the specific method in step S3 is: the method comprises the following steps of preprocessing collected multi-human motion videos by using the existing open-source two-dimensional attitude estimation and tracking, obtaining continuous joint point motion tracks of human joint points by human tracking through human joint points obtained by attitude estimation in each frame of image, and screening the obtained tracks by taking displacement vectors of the human joint points as constraints, wherein the calculation formula of the displacement vectors is as follows:
Figure FDA0003416643860000034
wherein Si,Si-nRespectively showing the positions of human body joint points of the ith frame and the ith-nth frames when
Figure FDA0003416643860000035
When the modulus of the error correction data is larger than a threshold theta, the tracking is considered to be in error, if the number of error frames exceeds K frames, the error is considered to be incapable of being corrected, and all error results are discarded; if the error frame number is less than K frames, the current frame and the last frame before error are used for updating
Figure FDA0003416643860000036
And performing the processing on each human body object detected in the video to obtain a two-dimensional joint point track.
5. The hidden spatial motion coding based multi-person mannequin reconstruction method according to claim 1, wherein the specific method of step S4 includes: firstly, the three-dimensional attitude prediction is carried out on the same human body object in the initial frames of the collected videos at different visual angles under the relative coordinates by utilizing the existing open-source single-visual-angle three-dimensional attitude estimation method, because only the rigid rotation difference exists between the three-dimensional attitudes at different visual angles at the same moment, and the rigid rotation of the human body is equal to the rotation of external parameters of a camera, one visual angle is selected as the initial visual angle, and the rigid rotation of the three-dimensional attitude of the human body relative to the three-dimensional attitude of the initial visual angle under other visual angles is calculated; then, the camera rotation of the initial view angle is a unit array, the camera translation of the unit array is obtained by the least square method through camera internal parameters, three-dimensional attitude estimation results and two-dimensional attitude estimation results, and the formula is as follows:
Figure FDA0003416643860000031
wherein pi is projection, R and t are rotation and translation of the camera respectively, J is three-dimensional joint point position obtained by three-dimensional attitude estimation, p is two-dimensional joint point pixel coordinate obtained by two-dimensional attitude detection, and t is*Solving the obtained translation;
after the initial view angle camera external parameter is obtained, rigid rotation of the three-dimensional posture of the second view angle is used as rotation of the camera external parameter, and the initial view angle and the camera external parameter under the current view angle are obtained through the formula by utilizing the three-dimensional posture of the initial view angle and the two-dimensional posture of the current view angle; and repeating the above process to obtain the external parameters of each visual angle camera.
6. The hidden spatial motion coding based multi-person mannequin reconstruction method according to claim 1, wherein the specific method of step S6 includes:
the initialized three-dimensional model, camera parameters and two-dimensional joint points obtained by two-dimensional posture estimation are used for carrying out adjustment optimization on light beams, and the formula is as follows:
Figure FDA0003416643860000032
wherein λrepreoj,λmotion,λlatentR, t, z are camera rotation, translation and multi-human implicit space coding for each view angle, Lreproj(R, t, z) is a reprojection term, Lmotion(z) is the physical motion prior term, Llatent(z) is a hidden spatial motion prior term;
wherein the reprojection term is:
Figure FDA0003416643860000033
m is whether a two-dimensional joint point obtained by using the initialized three-dimensional model, camera parameters and two-dimensional posture estimation is visible, if so, 1 is obtained, otherwise, 0 is obtained, p is a two-dimensional joint point pixel coordinate obtained by two-dimensional posture detection, omega is a confidence coefficient of the two-dimensional joint point, z is a multi-human-body hidden space code, D (z) is human body three-dimensional joint rotation obtained by conversion after decoding a hidden space, FK is forward dynamics transformation, F, V and N are a sequence frame number, a view angle number and a number of people to be reconstructed respectively, F, V and N are variables of F, V and N values respectively, namely, F is (1, 2, 3.. F), V is (1, 2, 3.. V), N is (1, 2, 3.. N), and pi is a projection;
physical motion prior term Lmotion(z) is:
Figure FDA0003416643860000041
Figure FDA0003416643860000042
three-dimensional joint for the f frame of the n-th human object to be reconstructedThe point is rotated in a direction that the point rotates,
Figure FDA0003416643860000043
representing the f-1 frame of the nth human object to be reconstructed,
Figure FDA0003416643860000044
representing the (n) th frame of the human body object to be reconstructed and the (f + 1) th frame of the human body object to be reconstructed;
hidden spatial motion prior term Llatent(z) is:
Figure FDA0003416643860000045
znand (3) representing the hidden space code of the nth human body object to be reconstructed, decoding the hidden space code obtained by optimization solution, and covering the human body model in each frame to obtain the human body reconstruction result of the whole sequence.
CN202110758141.XA 2021-07-05 2021-07-05 Hidden space motion coding-based multi-person human body model reconstruction method Active CN113379904B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110758141.XA CN113379904B (en) 2021-07-05 2021-07-05 Hidden space motion coding-based multi-person human body model reconstruction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110758141.XA CN113379904B (en) 2021-07-05 2021-07-05 Hidden space motion coding-based multi-person human body model reconstruction method

Publications (2)

Publication Number Publication Date
CN113379904A CN113379904A (en) 2021-09-10
CN113379904B true CN113379904B (en) 2022-02-15

Family

ID=77580875

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110758141.XA Active CN113379904B (en) 2021-07-05 2021-07-05 Hidden space motion coding-based multi-person human body model reconstruction method

Country Status (1)

Country Link
CN (1) CN113379904B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116434097A (en) * 2021-12-31 2023-07-14 中科寒武纪科技股份有限公司 Method for video encoding and decoding and related products
CN114581613B (en) * 2022-04-29 2022-08-19 杭州倚澜科技有限公司 Trajectory constraint-based human model posture and shape optimization method and system
CN117371184B (en) * 2023-09-20 2024-04-16 广东省水利水电第三工程局有限公司 Hydration reaction structure strength change simulation method and system for large concrete

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930770A (en) * 2016-04-13 2016-09-07 重庆邮电大学 Human motion identification method based on Gaussian process latent variable model
CN110335343A (en) * 2019-06-13 2019-10-15 清华大学 Based on RGBD single-view image human body three-dimensional method for reconstructing and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111597976A (en) * 2020-05-14 2020-08-28 杭州相芯科技有限公司 Multi-person three-dimensional attitude estimation method based on RGBD camera
CN112785692B (en) * 2021-01-29 2022-11-18 东南大学 Single-view-angle multi-person human body reconstruction method based on depth UV prior

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930770A (en) * 2016-04-13 2016-09-07 重庆邮电大学 Human motion identification method based on Gaussian process latent variable model
CN110335343A (en) * 2019-06-13 2019-10-15 清华大学 Based on RGBD single-view image human body three-dimensional method for reconstructing and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Spectral Learning of Latent Semantics for Action Recognition;Zhiwu Lu等;《2011 IEEE International Conference on Computer Vision》;20120112;第1503-1510页 *
Synergetic reconstruction from 2D pose and 3D motion for wide-space multi-person video motion capture in the wild;Takuya Ohashi等;《Image and Vision Computing》;20200928;第104卷;第1-9页 *
一种基于运动矢量空间编码的HEVC信息隐藏方法;李松斌等;《计算机学报》;20160731;第39卷(第7期);第1450-1463页 *

Also Published As

Publication number Publication date
CN113379904A (en) 2021-09-10

Similar Documents

Publication Publication Date Title
CN113379904B (en) Hidden space motion coding-based multi-person human body model reconstruction method
Akhter et al. Trajectory space: A dual representation for nonrigid structure from motion
CN113168710A (en) Three-dimensional object reconstruction
JP2021518622A (en) Self-location estimation, mapping, and network training
CN114581613B (en) Trajectory constraint-based human model posture and shape optimization method and system
CN110660017A (en) Dance music recording and demonstrating method based on three-dimensional gesture recognition
CN110503680A (en) It is a kind of based on non-supervisory convolutional neural networks monocular scene depth estimation method
JP2018520425A (en) 3D space modeling
CN103824326B (en) Dynamic human body three-dimensional modeling method
CN111062326B (en) Self-supervision human body 3D gesture estimation network training method based on geometric driving
CN110458944B (en) Human body skeleton reconstruction method based on double-visual-angle Kinect joint point fusion
Tu et al. Consistent 3d hand reconstruction in video via self-supervised learning
CN112001859A (en) Method and system for repairing face image
CN103886595B (en) A kind of catadioptric Camera Self-Calibration method based on broad sense unified model
CN114550292A (en) High-physical-reality human body motion capture method based on neural motion control
Fan et al. RS-DPSNet: Deep plane sweep network for rolling shutter stereo images
CN113345032B (en) Initialization map building method and system based on wide-angle camera large distortion map
CN113688683A (en) Optical motion capture data processing method, model training method and device
CN111539288B (en) Real-time detection method for gestures of both hands
CN116310146B (en) Face image replay method, system, electronic device and storage medium
CN116152442B (en) Three-dimensional point cloud model generation method and device
CN116612238A (en) 3D human body posture estimation method based on global and local space-time encoders
CN116079727A (en) Humanoid robot motion simulation method and device based on 3D human body posture estimation
CN115965765A (en) Human motion capture method in deformable scene based on neural deformation
CA3177593A1 (en) Transformer-based shape models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant