CN113379904B

CN113379904B - Hidden space motion coding-based multi-person human body model reconstruction method

Info

Publication number: CN113379904B
Application number: CN202110758141.XA
Authority: CN
Inventors: 王雁刚; 黄步真; 束愿
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-07-05
Filing date: 2021-07-05
Publication date: 2022-02-15
Anticipated expiration: 2041-07-05
Also published as: CN113379904A

Abstract

The invention discloses a multi-person human body model reconstruction method based on hidden space motion coding, which specifically comprises the following steps: s1, hidden space motion prior training: s2, building a multi-view camera system and collecting multi-human motion videos; s3, preprocessing a multi-human motion video acquired by the multi-view camera system set up in the step S2 by using the existing open-source two-dimensional attitude estimation and tracking to obtain a two-dimensional joint point track; s4, acquiring external parameters of the initial camera; s5, completing alignment of each human body to be reconstructed in the multi-human body motion video; s6, iteratively optimizing the implicit space coding and the initial camera external parameters obtained in the step S4 to enable the joint point projection of the three-dimensional human body model to be consistent with the two-dimensional joint point track obtained in the step S3, and finally covering the human body model in each frame to obtain the human body reconstruction result of the whole sequence. The invention reduces the influence of mutual shielding of multiple persons and tracking errors on reconstruction.

Description

Hidden space motion coding-based multi-person human body model reconstruction method

Technical Field

The invention relates to a hidden space motion coding-based multi-person human body model reconstruction method, and belongs to the field of computer vision and three-dimensional vision.

Background

The reconstruction of the multi-person human body model has important functions in applications such as holographic communication, traffic behavior monitoring, group behavior analysis and the like. Driven by the technology related to artificial intelligence, the market demands for human reconstruction are increasing day by day. The current human body model reconstruction scheme based on a single visual angle faces the problems of depth ambiguity, non-robustness to shielding and the like, so that an accurate human body model cannot be reconstructed due to the fact that the depth is not fixed on the one hand, and a good reconstruction result cannot be obtained on the other hand when a plurality of people with shielding face due to the loss of image information. Although the multi-view-angle-based human body model reconstruction scheme can make up for the defects of the single-view-angle scheme to a certain extent, the existing method needs to calibrate the camera first and then reconstruct the human body model, and the process is complex. In addition, due to mutual shielding and visual angle loss of the human body models, effective information is reduced, and the accurate interactive multi-person human body models are difficult to reconstruct by the multi-visual angle method. In the face of a multi-person image with shielding, the time sequence constraint can well make up the missing human model information due to shielding. However, the reconstruction method added with the time sequence constraint has large solved space dimension and complex calculation, and is easy to fall into a local minimum value. Therefore, on the basis of simplifying the reconstruction process, adding time sequence information to realize accurate reconstruction of the multi-person human body model is a new challenging problem.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a hidden space motion coding-based multi-person human body model reconstruction method, which is characterized in that a plurality of synchronous cameras are used for collecting multi-person human body videos, camera parameters and human body models are optimized by combining hidden space motion coding with physical motion prior combination through posture detection and tracking, and a three-dimensional human body model is directly reconstructed from a multi-view-angle uncalibrated video sequence.

The technical scheme is as follows: the invention relates to a multi-person human body model reconstruction method based on hidden space motion coding, which comprises the following steps:

s1, hidden space motion prior training: extracting a human body motion sequence from the existing three-dimensional human body model data set containing the motion sequence, wherein the human body motion sequence is used for variational self-coding network training, and after the training is finished, the decoder parameters of the variational self-coding network are fixed and used for subsequent iterative optimization;

s2, building a multi-view camera system and collecting multi-human motion videos;

s3, preprocessing a multi-human motion video acquired by the multi-view camera system set up in the step S2 by using the existing open-source two-dimensional attitude estimation and tracking to obtain a two-dimensional joint point track;

s4, acquiring initial camera external parameters by utilizing the three-dimensional posture of the initial frame of the multi-view video;

s5, calculating the three-dimensional position of each joint point of each human body to be reconstructed in each frame through geometric triangulation based on the two-dimensional joint point track obtained in the step S3 and the initial camera external reference obtained in the step S4, aligning the joint points of the human body initial model with the three-dimensional positions of the joint points, and sequentially finishing the alignment of each human body to be reconstructed in the multi-human body motion video by taking the aligned initial three-dimensional human body model as a fitting initial state;

s6, with the three-dimensional human body model initialized in the step S5 as an initial state, decoding the hidden space coding by using a decoder of the self-coding network trained in the step S1, driving the three-dimensional human body model to deform, iteratively optimizing the hidden space coding and the initial camera external parameters obtained in the step S4, enabling the joint point projection of the three-dimensional human body model to be consistent with the two-dimensional joint point track obtained in the step S3, and finally covering the human body model in each frame to obtain the human body reconstruction result of the whole sequence.

Further, the specific method of step S1 includes:

s11, extracting a human body motion sequence from the existing three-dimensional human body model data set containing the motion sequence, wherein human body motion information in each frame is represented by a three-dimensional human body model of a skeleton skin, the skeleton posture of the human body motion sequence is represented by joint point rotation, human body grids are driven to deform by the joint point rotation, and the T frame is represented by T multiplied by N multiplied by 6 parameters;

s12, firstly, constructing a variational self-coding network to learn the human motion information in the step S11 and constructing a hidden space representation of the human motion information, wherein an encoder of the variational self-coding network consists of an encoding gating cycle unit, a mean value encoding network and a variance encoding network, a decoder of the variational self-coding network consists of a decoding full-connection network and a decoding gating cycle unit, a T-frame human motion sequence is used as input, a mean value and a variance corresponding to the sequence are obtained through the encoder of the variational self-coding network, sampling is carried out from the distribution formed by the mean value variance, and the sampling code of the sampling code is recovered into the input T-frame human motion sequence after passing through the decoder of the variational self-coding network;

s13, monitoring the mean value and the variance corresponding to the sequence obtained by an encoder of the variational self-coding network by using standard normal distribution, and constraining the implicit space distribution by using Kullback-Leibler divergence between the two distributions, wherein the formula is as follows:

wherein N (mu, sigma)²) Representing a standard normal distribution, P (x) representing the distribution obeyed by a mean encoding network and a variance encoding network, and KL (P | | | N) representing the Kullback-Leibler divergence between the distribution P and the distribution N;

in addition, the L2 penalty of using the input parameters for the output parameters is further constrained,

L_param＝‖X_in-X_out‖₂

X_in,X_outthe input and output three-dimensional articulation point rotation parameters of the human body |₂Denotes a two-norm, L_paramRepresents the L2 loss between the input parameter and the output parameter;

the mesh deformation is further driven with the rotation parameters, with the L2 loss between the input and output meshes as constraints:

L_mesh＝‖M(X_in)-M(X_out)‖₂

m is differentiable covering operation, taking rotation parameters as input, outputting a deformed three-dimensional human body grid, L_meshRepresents the L2 penalty between the input and output grids;

finally, the training loss of the variational self-coding network is:

L＝λ_klKL(P||N)+λ_paramL_param+λ_meshL_mesh

wherein λ_kl,λ_param,λ_meshIs a weight;

s14, because the human motion data are limited, the generalization performance of the trained model is enhanced by using data augmentation in the process of training the variational self-coding network, and the strategy comprises the following steps:

s141, performing frame-by-frame sampling on the existing motion sequence: sampling one frame every m frames in the original sequence to form a motion sequence of T frames as input;

s142, sampling in a reverse order: reversing the sequence of the sampled motion sequence to be used as input;

s143, mirror image turning: rotationally exchanging the joint points corresponding to the symmetrical limbs of the human body to form a new motion sequence as input, and training the network until convergence by the supervision of the step S13;

and S15, after the network training is completed, the hidden space coding can represent a complete human body motion sequence by a smaller space dimension, the network parameters of a decoder are fixed, the hidden space coding is randomly given, the human body motion sequence is obtained after the decoding of the decoder, and the decoded motion sequence also generates corresponding motion change by adjusting the hidden space coding.

Further, in the step S2, the multi-view camera system performs multi-view acquisition by using a plurality of FLIR Blackfly industrial cameras, fixes the industrial cameras by using a tripod, places the industrial cameras around an acquired area, performs synchronous acquisition by hardware triggering, and stores the industrial cameras in a solid state disk by using a USB3.0 and a PCIE interface.

Further, the specific method of step S3 is: the method comprises the following steps of preprocessing collected multi-human motion videos by using the existing open-source two-dimensional attitude estimation and tracking, obtaining continuous joint motion tracks of human joint points obtained by attitude estimation in each frame of image through human tracking, and screening the obtained tracks by taking displacement vectors of the human joint points as constraints, wherein the calculation formula of the displacement vectors is as follows:

wherein S_i,S_i-nRespectively representing the ith frame and the ith-nth frame of human body joint pointsIn a position of

When the modulus of the error correction data is larger than a threshold theta, the tracking is considered to be in error, if the number of error frames exceeds K frames, the error is considered to be incapable of being corrected, and all error results are discarded; if the number of error frames is less than K frames, the current frame and the last frame before error are used for updating

And carrying out the processing on each human body object detected in the video to obtain a two-dimensional joint point track.

Further, the specific method of step S4 includes: firstly, the three-dimensional posture of the same human body object in the initial frames of different visual angles of the collected video is predicted under the relative coordinates by utilizing the existing open-source single-visual-angle three-dimensional posture estimation method, because only the rigid rotation difference exists between the three-dimensional postures of different visual angles at the same moment, the rigid rotation of the human body is equal to the external reference rotation of the camera, one visual angle is selected as the initial visual angle, and the rigid rotation of the three-dimensional posture of the human body relative to the three-dimensional posture of the visual angle under other visual angles is calculated; then, the camera rotation of the initial view angle is a unit array, the camera translation of the unit array is obtained by the least square method through camera internal parameters, three-dimensional attitude estimation results and two-dimensional attitude estimation results, and the formula is as follows:

wherein pi is projection, R and t are rotation and translation of the camera respectively, J is three-dimensional joint point position obtained by three-dimensional attitude estimation, p is two-dimensional joint point pixel coordinate obtained by two-dimensional attitude detection, and t is^*Solving the obtained translation;

after the initial view angle camera external parameter is obtained, rigid rotation of the three-dimensional posture of the second view angle is used as rotation of the camera external parameter, and the camera external parameter under the view angle is obtained through the formula by utilizing the three-dimensional posture of the initial view angle and the two-dimensional posture of the current view angle; and repeating the above process to obtain the external parameters of each visual angle camera.

Further, the specific method of step S6 includes:

performing adjustment optimization on the light beam by using the initialized three-dimensional model, camera parameters and two-dimensional joint points obtained by estimating the two-dimensional posture, wherein the formula is as follows:

wherein λ_repreoj,λ_motion,λ_latentR, t, z are camera rotation, translation and multi-human implicit space coding for each view angle, L_reproj(R, t, z) is a reprojection term, L_motion(z) is the physical motion prior term, L_latent(z) is a hidden spatial motion prior term;

wherein the reprojection term is:

m is whether the two-dimensional joint point of the frame is visible or not, if the two-dimensional joint point is visible, the pixel coordinate of the two-dimensional joint point is 1, otherwise, the pixel coordinate of the two-dimensional joint point is 0, p is the pixel coordinate of the two-dimensional joint point obtained by two-dimensional posture detection, omega is the confidence coefficient of the two-dimensional joint point, z is hidden space coding, D (z) is human body three-dimensional joint rotation obtained by conversion after decoding of hidden space, FK is forward dynamics transformation, and F, V and N are the sequence frame number, the view angle number and the number of people to be reconstructed respectively;

physical motion prior term L_motion(z) is:

rotating the three-dimensional joint points of the f frame of the nth human body object to be reconstructed,

representing the f-1 frame of the nth human object to be reconstructed,

representing the rotation of the three-dimensional joint point of the (n) th frame (f + 1) th human body object to be reconstructed;

hidden spatial motion prior term L_latent(z) is:

z_nand (3) representing the hidden space code of the nth human body object to be reconstructed, decoding the hidden space code obtained by optimization solution, and covering the human body model in each frame to obtain the human body reconstruction result of the whole sequence.

Compared with the prior art, the invention has the beneficial effects that: 1. and the multi-person video is synchronously acquired by the multi-view RGB camera, calibration and reconstruction are simultaneously carried out, the equipment is simple, and the flow is simple. 2. The hidden space coding is used for representing human body motion, motion prior information is provided in the optimization process, the dimensionality is reduced, and the optimization solving difficulty is reduced. 3. In the combined optimization process, the accuracy of the camera parameter solution is improved by human body structure prior, physical motion prior and hidden space motion prior. 4. For a video frame with a severely shielded human body, even if two-dimensional joint point constraint is lacked, the result of the frame can be reconstructed by utilizing a physical motion prior and a hidden space motion prior, so that the influence of mutual shielding of multiple people and tracking error on reconstruction is reduced.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of a hidden spatial motion prior network architecture;

FIG. 3 is a schematic view of a multi-view acquisition system;

FIG. 4 is a schematic representation of a mannequin;

FIG. 5 is a two-dimensional pose estimation versus joint point trajectory results diagram;

fig. 6 is a graph of the reconstruction results.

Detailed Description

The following detailed description of the embodiments of the invention is provided in connection with the accompanying drawings.

The present invention is described in further detail below with reference to the attached drawing figures. As shown in fig. 1, the method for reconstructing a multi-person human body model based on implicit space motion coding according to the present invention includes the following steps:

(1) in the embodiment, the human motion information refers to the posture of a human body at a certain moment, the postures at a plurality of moments form a motion sequence, the skeleton posture is represented by joint point rotation, and the human body mesh deformation is driven by the joint point rotation. For continuous representation of three-dimensional rotation in a deep neural network, for a human skeleton with N joint points, each joint point represents the three-dimensional rotation using 6-dimensional parameters. The 6-dimensional parameter is specifically the first two column vectors of the three-dimensional rotation matrix, and the third column vector only represents the sign, so that the three-dimensional rotation matrix can be obtained by cross-multiplying the first two column vectors. After the network estimates 6-dimensional parameters, the parameters are formed into two column vectors, the two column vectors are subjected to Schmidt orthogonalization, and finally a third column vector is determined through cross multiplication, so that the rotation of each joint point can be represented in a 3 x 3 rotation matrix form. Thus, for a human motion sequence with N skeletal joint points for a T frame, it can be represented by T × N × 6 parameters.

(2) First, a variational self-coding network as shown in fig. 2 is constructed to learn the above-mentioned human motion information, and a hidden space representation of the human motion information is constructed. The encoder consists of an encoding gating cycle unit, a mean value encoding network and a variance encoding network. The decoder consists of a decoding full-connection network and a decoding gating cyclic unit. Taking the T-frame human motion sequence as input, obtaining the mean value and variance corresponding to the sequence through an encoder, sampling from the distribution formed by the mean value and variance, and recovering the sampling code into the input T-frame human motion sequence after passing through a decoder.

(3) The mean and variance obtained by coding are supervised by a standard normal distribution, and the implicit spatial distribution is constrained by the Kullback-Leibler divergence between the two distributions. The formula is as follows:

wherein N (mu, sigma)²) Represents the standard normal distribution, P (x) represents the distribution obeyed by the mean encoding network and the variance encoding network, and KL (P | | | N) represents the Kullback-Leibler divergence between distribution P and distribution N.

Furthermore, the L2 penalty of using the input parameters for the output parameters is further constrained.

L_param＝‖X_in-X_out‖₂

X_in,X_outThe input and output three-dimensional articulation point rotation parameters of the human body |₂Denotes a two-norm, L_paramRepresenting the L2 loss between the input parameter and the output parameter. To obtain finer motion details, we further drive mesh deformation with rotation parameters, with the L2 loss between the input and output meshes as a constraint:

L_mesh＝‖M(X_in)-M(X_out)‖₂

m is differentiable covering operation, taking rotation parameters as input, outputting a deformed three-dimensional human body grid, L_meshRepresenting the L2 penalty between the input and output grids.

Finally, the training loss of the variational self-coding network is:

L＝λ_klKL(P||N)+λ_paramL_param+λ_meshL_mesh

wherein λ_kl,λ_param,λ_meshAre weights.

(4) Because human motion data is limited, data augmentation is used in training the network to enhance the generalization performance of the trained model. The strategy mainly comprises the following steps: 1. the existing motion sequence is sampled frame by frame. One frame is sampled every m frames in the original sequence, and a motion sequence of T frames is formed as input. 2. And (5) sampling in a reverse order. The sampled motion sequence is used as input after the sequence order is reversed. 3. And (5) mirror image turning. The corresponding joint points of the symmetrical limbs (such as the left hand and the right hand) of the human body are rotationally exchanged to form a new motion sequence as input. And (4) training the network until convergence under the supervision of the step (3).

(5) After the network training is completed, the hidden space encoding can represent a complete human motion sequence with smaller space dimension. And (3) fixing the network parameters of the decoder, randomly giving a hidden space code, and decoding by the decoder to obtain a human motion sequence. By adjusting the implicit spatial coding, the decoded motion sequence also produces corresponding motion variations.

Establishing a multi-view camera system:

(6) the acquisition system is simple in structure, and multi-view acquisition is performed only by using a plurality of FLIR Blackfly industrial cameras. The camera is fixed using a tripod, suitably placed around the area to be acquired. And synchronous acquisition is carried out through hardware triggering. And by utilizing the USB3.0, the PCIE interface is stored in the solid state disk. As shown in fig. 3.

Estimating and tracking the two-dimensional posture of the human body:

(7) and preprocessing the collected multi-human motion video by using the existing open-source two-dimensional attitude estimation and tracking. In each frame of image, the human body joint points obtained by posture estimation obtain continuous joint point motion tracks through human body tracking. Due to occlusion and interaction caused by motion of multiple persons, noise exists in a tracking result. In order to filter out the tracking error result, the displacement vector of the human body joint point is used as the constraint to screen the acquired track. The displacement vector calculation formula is as follows:

wherein S_i,S_i-nRespectively representing the positions of human body joint points of the ith frame and the ith-nth frames. When in use

When the modulus of (a) is greater than the threshold value θ, θ is generally determined according to the camera frame rate and the camera focal length, θ in this embodiment takes a value of 10, it is considered that tracking is erroneous, if the number of erroneous frames exceeds t frames, t is generally selected according to the camera frame rate, t in this embodiment takes a value of 30, it is considered that the error cannot be corrected, and all erroneous results are discarded, as shown in fig. 5. If the error frame number is less than t frames, the current frame and the last frame before error are used for updating

And (5) carrying out step (7) on each human body object detected in the video to obtain a two-dimensional joint point track.

Initial camera parameter acquisition:

(8) the combined solution of the camera parameters and the human body three-dimensional model is a highly non-convex optimization problem, and the accuracy of the initial value greatly influences the accuracy of the final result. Since the camera-internal reference is determined only by the camera hardware, it is assumed in the present invention that the camera-internal reference is known. The invention provides a method for acquiring initial camera external parameters by using a three-dimensional posture of an initial frame of a multi-view video. Firstly, the same human body object in the initial frames of different visual angles of the collected video is subjected to three-dimensional attitude prediction under relative coordinates by using the existing open-source single-visual-angle three-dimensional attitude estimation method. Because only the difference of rigid rotation exists between the three-dimensional postures of different visual angles at the same moment, the rigid rotation of the human body is equal to the rotation of external parameters of the camera. Therefore, one of the visual angles is selected as an initial visual angle, and rigid rotation of the three-dimensional posture of the human body relative to the three-dimensional posture of the visual angle under other visual angles is calculated. Then, the camera rotation of the initial view angle is a unit array, and the camera translation of the unit array is obtained by utilizing a least square method through camera internal parameters, a three-dimensional attitude estimation result and a two-dimensional attitude estimation result. The formula is as follows:

wherein pi is projection, R and t are rotation and translation of the camera respectively, J is a three-dimensional joint point position obtained by three-dimensional attitude estimation, and p is a two-dimensional joint point pixel coordinate obtained by two-dimensional attitude detection. And after the external parameters of the camera at the initial view angle are obtained, rigid rotation of the three-dimensional posture of the second view angle is used as rotation of the external parameters of the camera, and the external parameters of the camera at the view angle are obtained by utilizing the three-dimensional posture of the initial view angle and the two-dimensional posture of the current view angle through the formula. And repeating the above process to obtain the external parameters of each visual angle camera.

Obtaining an initial three-dimensional joint point and initializing a human body model:

(9) and (7) calculating the three-dimensional position of each joint point of each human body to be reconstructed in each frame through the two-dimensional posture result obtained in the step (7) and the step (8) and the initial camera parameters through geometric triangulation.

(10) In order to reduce the difficulty of optimization solution, the human body three-dimensional model needs to be initialized. And (3) aligning the joint points of the human body initial model with the three-dimensional joint points obtained in the step (9) for each frame in the sequence to be fitted, and taking the aligned initial three-dimensional human body model as the fitting initial state. And finishing the alignment of each human body to be reconstructed in the video sequence in sequence.

Solving by a light beam adjustment method based on implicit space motion coding:

(11) and performing adjustment optimization on the light beam by using the initialized three-dimensional model, the camera parameters and the two-dimensional joint points obtained by two-dimensional posture estimation. The formula is as follows:

wherein λ_repreoj,λ_motion,λ_latentR, t, z are camera rotation, translation and multi-human implicit spatial coding for each view. Wherein the reprojection term is:

and m is whether the two-dimensional joint point of the frame is visible or not, if so, the two-dimensional joint point is 1, otherwise, the two-dimensional joint point is 0. p is a two-dimensional joint point, and ω is a confidence of the two-dimensional joint point. z is implicit spatial coding. And D (z) human body three-dimensional joint rotation obtained by conversion after decoding in hidden space. FK is a forward kinetic transformation. F, V and N are respectively the number of sequence frames, the number of visual angles and the number of people to be reconstructed.

The physical motion prior term is:

representing the f-1 frame of the nth human object to be reconstructed,

representing the three-dimensional joint point rotation of the (n) th frame (f + 1) th human body object to be reconstructed. .

The hidden spatial motion prior term is:

z_nand (3) representing the hidden space coding of the nth human body object to be reconstructed, performing optimization solution on the obtained hidden space coding, decoding, and covering the human body model in each frame to obtain the human body reconstruction result of the whole sequence.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the specific embodiments described above, which are intended to further illustrate the principles of the invention, and that various changes and modifications may be made without departing from the spirit and scope of the invention, which is intended to be covered by the appended claims. The scope of the invention is defined by the claims and their equivalents.

Claims

1. A multi-person human body model reconstruction method based on hidden space motion coding is characterized by comprising the following steps:

s1, hidden space motion prior training: extracting a human body motion sequence from the existing three-dimensional human body model data set containing the motion sequence, using the human body motion sequence for variational self-coding network training, and fixing decoder parameters of the variational self-coding network after the training is finished for subsequent iterative optimization;

s5, calculating the three-dimensional position of each joint point of each human body to be reconstructed in each frame through geometric triangulation by using the two-dimensional joint point track obtained in the step S3 and the initial camera external reference obtained in the step S4, aligning the joint points of the human body initial model with the three-dimensional position of each joint point, and sequentially finishing the alignment of each human body to be reconstructed in the multi-human body motion video by taking the aligned initial three-dimensional human body model as a fitting initial state;

2. The hidden spatial motion coding based multi-person mannequin reconstruction method according to claim 1, wherein the step S1 includes the following specific steps:

s11, extracting a human motion sequence from the existing three-dimensional human model data set containing the motion sequence, wherein human motion information in each frame is represented by a three-dimensional human model of a skeleton skin, the skeleton posture of the human motion sequence is represented by joint point rotation, human body grids are driven to deform by joint point rotation, and the T frame is represented by T multiplied by N multiplied by 6 parameters;

s12, firstly, constructing a variational self-coding network to learn the human motion information in the step S11 and constructing a hidden space representation of the human motion information, wherein an encoder of the variational self-coding network consists of an encoding gating cycle unit, a mean value encoding network and a variance encoding network, a decoder of the variational self-coding network consists of a decoding full-connection network and a decoding gating cycle unit, a T-frame human motion sequence is used as input, a mean value and a variance corresponding to the sequence are obtained through the encoder of the variational self-coding network, sampling is carried out from distribution consisting of the mean value variances, and the sampling codes are recovered into the input T-frame human motion sequence after passing through the decoder of the variational self-coding network;

X_in，X_outrespectively input and output human three-dimensional joint point rotation parameters | · | | luminance₂Denotes a two-norm, L_paramRepresents the L2 penalty between the input parameter and the output parameter;

L_mesh＝||M(X_in)-M(X_out)||₂

finally, the training loss of the variational self-coding network is:

wherein λ_kl，λ_param，λ_meshIs a weight;

s15, after the steps S11-S14 are completed, the hidden space coding can represent a complete human motion sequence by smaller space dimension, the network parameters of a decoder are fixed, the hidden space coding is randomly given, the human motion sequence is obtained after the decoding of the decoder, and the decoded motion sequence also generates corresponding motion change by adjusting the hidden space coding.

3. The hidden spatial motion coding-based multi-person human body model reconstruction method according to claim 1, wherein in step S2, the multi-view camera system performs multi-view acquisition by using a plurality of FLIR Blackfly industrial cameras, fixes the industrial cameras by using a tripod, places the industrial cameras around an acquired area, performs synchronous acquisition by hardware triggering, and stores the industrial cameras in a solid state disk by using a USB3.0 PCIE interface.

4. The hidden spatial motion coding-based multi-person mannequin reconstruction method according to claim 1, wherein the specific method in step S3 is: the method comprises the following steps of preprocessing collected multi-human motion videos by using the existing open-source two-dimensional attitude estimation and tracking, obtaining continuous joint point motion tracks of human joint points by human tracking through human joint points obtained by attitude estimation in each frame of image, and screening the obtained tracks by taking displacement vectors of the human joint points as constraints, wherein the calculation formula of the displacement vectors is as follows:

wherein S_i，S_i-nRespectively showing the positions of human body joint points of the ith frame and the ith-nth frames when

When the modulus of the error correction data is larger than a threshold theta, the tracking is considered to be in error, if the number of error frames exceeds K frames, the error is considered to be incapable of being corrected, and all error results are discarded; if the error frame number is less than K frames, the current frame and the last frame before error are used for updating

And performing the processing on each human body object detected in the video to obtain a two-dimensional joint point track.

5. The hidden spatial motion coding based multi-person mannequin reconstruction method according to claim 1, wherein the specific method of step S4 includes: firstly, the three-dimensional attitude prediction is carried out on the same human body object in the initial frames of the collected videos at different visual angles under the relative coordinates by utilizing the existing open-source single-visual-angle three-dimensional attitude estimation method, because only the rigid rotation difference exists between the three-dimensional attitudes at different visual angles at the same moment, and the rigid rotation of the human body is equal to the rotation of external parameters of a camera, one visual angle is selected as the initial visual angle, and the rigid rotation of the three-dimensional attitude of the human body relative to the three-dimensional attitude of the initial visual angle under other visual angles is calculated; then, the camera rotation of the initial view angle is a unit array, the camera translation of the unit array is obtained by the least square method through camera internal parameters, three-dimensional attitude estimation results and two-dimensional attitude estimation results, and the formula is as follows:

after the initial view angle camera external parameter is obtained, rigid rotation of the three-dimensional posture of the second view angle is used as rotation of the camera external parameter, and the initial view angle and the camera external parameter under the current view angle are obtained through the formula by utilizing the three-dimensional posture of the initial view angle and the two-dimensional posture of the current view angle; and repeating the above process to obtain the external parameters of each visual angle camera.

6. The hidden spatial motion coding based multi-person mannequin reconstruction method according to claim 1, wherein the specific method of step S6 includes:

the initialized three-dimensional model, camera parameters and two-dimensional joint points obtained by two-dimensional posture estimation are used for carrying out adjustment optimization on light beams, and the formula is as follows:

wherein λ_repreoj，λ_motion，λ_latentR, t, z are camera rotation, translation and multi-human implicit space coding for each view angle, L_reproj(R, t, z) is a reprojection term, L_motion(z) is the physical motion prior term, L_latent(z) is a hidden spatial motion prior term;

wherein the reprojection term is:

m is whether a two-dimensional joint point obtained by using the initialized three-dimensional model, camera parameters and two-dimensional posture estimation is visible, if so, 1 is obtained, otherwise, 0 is obtained, p is a two-dimensional joint point pixel coordinate obtained by two-dimensional posture detection, omega is a confidence coefficient of the two-dimensional joint point, z is a multi-human-body hidden space code, D (z) is human body three-dimensional joint rotation obtained by conversion after decoding a hidden space, FK is forward dynamics transformation, F, V and N are a sequence frame number, a view angle number and a number of people to be reconstructed respectively, F, V and N are variables of F, V and N values respectively, namely, F is (1, 2, 3.. F), V is (1, 2, 3.. V), N is (1, 2, 3.. N), and pi is a projection;

physical motion prior term L_motion(z) is:

three-dimensional joint for the f frame of the n-th human object to be reconstructedThe point is rotated in a direction that the point rotates,

representing the f-1 frame of the nth human object to be reconstructed,

representing the (n) th frame of the human body object to be reconstructed and the (f + 1) th frame of the human body object to be reconstructed;

hidden spatial motion prior term L_latent(z) is: