CN114926594A

CN114926594A - Single-view-angle shielding human body motion reconstruction method based on self-supervision space-time motion prior

Info

Publication number: CN114926594A
Application number: CN202210684494.4A
Authority: CN
Inventors: 王雁刚; 黄步真; 束愿
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2022-08-19

Abstract

The invention provides a shielded human body motion sequence reconstruction method based on self-supervision space-time motion prior, which comprises the following steps: s1, synthesizing and representing human body movement; s2, constructing a human body shielding space-time prior network; s3, carrying out time-space prior network training on a sheltered human body; s4, constructing a three-dimensional motion reconstruction network; s5, three-dimensional motion reconstruction network training; s6, estimating the global position; and S7, reconstructing the motion of the human body shielded by the single visual angle in real time. The method can quickly synthesize a large amount of shielding data, does not influence the generalization capability of the model on the real data, and solves the problem that the existing method strongly depends on the real shielding human body data.

Description

Single-view-angle shielding human body motion reconstruction method based on self-supervision space-time motion prior

Technical Field

The invention relates to the field of computer vision and three-dimensional vision, in particular to a single-view-angle sheltered human motion reconstruction method based on self-supervision space-time motion prior.

Technical Field

The human motion reconstruction plays an important role in holographic communication, behavior analysis, motion capture and other applications. With the continuous development of artificial intelligence and other related technologies, the single-view human motion reconstruction based on a single RGB camera has wide market demands due to the advantages of low cost and convenient deployment. However, the existing single-view methods are difficult to realize the human motion reconstruction under the shielding condition. Reconstruction of single-view-angle blocking human motion has become an urgent problem to be solved. The current human motion reconstruction scheme aiming at the shielding problem faces two main problems: firstly, the three-dimensional human motion data is not sufficiently shielded, and a model with better generalization performance is difficult to train in a deep learning mode; secondly, information underdetermination caused by depth lack and shielding of the single-view image leads to serious ambiguity, and a network model is difficult to regress reliable and accurate three-dimensional motion. Therefore, the problem of practical significance and challenge is to realize the reconstruction of the movement of the occluded human body by an automatic supervision method and by learning the priori knowledge of the spatio-temporal movement of the occluded human body by using the synthetic occluded human body movement data and modeling the spatio-temporal characteristics of the joint point layer.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a single-view shielding human body motion sequence reconstruction method based on self-supervision space-time motion prior, which synthesizes a shielding motion sequence by utilizing real two-dimensional human body data and constructs a training data set; the space-time characteristics of a local joint point layer and a global motion layer which shield the human motion are learned in the synthetic two-dimensional training data by constructing a human motion space-time prior encoder based on an expansion volume and a Transformer. And then the trained human motion space-time prior encoder is used for assisting the three-dimensional motion reconstruction network training. When the real-time single-view-angle sheltered human body motion reconstruction is carried out, a two-dimensional sheltered human body motion joint point map is obtained from a single-view-angle sheltered human body video, the two-dimensional sheltered human body motion joint point map is sent into a trained network for forward reasoning and global position estimation, and then the real-time sheltered human body three-dimensional motion reconstruction can be realized.

The technical scheme is as follows: the invention discloses a shielded human body motion sequence reconstruction method based on self-supervision space-time motion prior, which comprises the following steps:

s1, human body motion synthesis and representation: in order to solve the problem that the manufacturing cost of the shielded human body data set is high, the invention provides a method for generating the joint point diagram of the two-dimensional shielded human body motion and the human body shielding data to quickly synthesize a large amount of shielding data without influencing the generalization capability of the model on real data. The method comprises the steps of obtaining a shelter by using example segmentation in a network video, randomly covering the shelter on a picture in a human body data set without the shelter to realize synthetic shelter, and generating a two-dimensional shelter human body movement joint point picture on a two-dimensional complete human body movement joint point picture for network training.

S2, constructing a human body shielding space-time prior network: occlusion of two-dimensional body joints is highly ambiguous and difficult to directly estimate accurate three-dimensional complete body motion. The invention constructs a coder comprising a local joint point level space-time characteristic extraction network and a global motion level space-time characteristic extraction network to respectively extract space-time characteristics of different levels. In addition, an encoder consisting of a fully-connected network is constructed, and a two-dimensional complete human body motion related node graph is regressed from the encoding estimated by the encoder.

S3, carrying out time-space prior network training on a sheltered human body: and taking the synthesized two-dimensional articulation point diagram for shielding human body motion as an input, taking the corresponding articulation point diagram for shielding-free two-dimensional complete human body motion as a monitor, and training the shielded human body space-time prior network constructed by S2 in a self-monitoring mode. And self-supervision training the neural network until convergence, and learning the priori knowledge for shielding the two-dimensional human motion.

S4, constructing a three-dimensional motion reconstruction network: and constructing a network which is the same as the occlusion human body space-time prior network encoder in the S2 as an encoder, and taking a full connection layer and two transform modules as decoders for estimating a three-dimensional complete human body motion graph from a two-dimensional occlusion human body motion related node graph.

S5, three-dimensional motion reconstruction network training: and taking the parameters of the encoder of the shielded human body space-time prior network trained in the step S3 as pre-training parameters, and assigning the pre-training parameters to the encoder of the three-dimensional motion reconstruction network constructed in the step S4, so that the three-dimensional motion reconstruction network obtains the prior knowledge of the shielded two-dimensional human body motion. And further taking the synthesized two-dimensional joint point graph for shielding human body motion as input and the three-dimensional complete human body motion graph as supervision for network training. And in the training process, fine adjustment is carried out on the encoder parameters. Training until convergence.

S6, global position estimation: and sampling the three-dimensional complete human motion image estimated in the S4 to obtain a three-dimensional human motion sequence. Because the three-dimensional human motion sequence obtained by sampling is in a local coordinate system, in order to obtain the three-dimensional human motion sequence with absolute positions, the invention uses the joint point position of a three-dimensional human sequence skeleton and the input shielding two-dimensional joint point position to construct a least square function and estimate the global translation. And applying the estimated translation to the three-dimensional human motion sequence.

S7, reconstructing the motion of a human body shielded by a real-time single view angle: and after the network training is completed. A single view motion capture system is constructed using a single RGB camera. Inputting a human motion video sequence collected by a single-view camera, and detecting the position and the confidence coefficient of a joint point of each frame of picture by using the conventional open-source two-dimensional joint point detection method to obtain a two-dimensional joint point graph for shielding human motion. And estimating a three-dimensional complete human body motion map through a three-dimensional motion reconstruction network. And acquires a three-dimensional human motion sequence with absolute positions by using the method in step S6. And the reconstruction of the three-dimensional motion of the human body shielded by a single visual angle is realized. The global human motion sequence can be further skinned to obtain a human network model with deformation.

Further, the specific method of step S1 includes:

s11, obtaining a random shielding object: and (4) obtaining an obstruction Mask and an RGB picture by Mask-RCNN in a network video by using example segmentation for simulating obstruction.

S12, acquiring a two-dimensional joint point image of the complete human body motion: a human body two-dimensional motion sequence is extracted from the human body data set, wherein the human body two-dimensional motion of each frame is represented by the coordinates of the K personal body joint points. For the human body two-dimensional motion sequence with K joint points in the F frame, in order to facilitate network calculation, root node coordinates are firstly subtracted from all the joint point coordinates (x, y)(x _root ,y _root ) Then dividing by the size w multiplied by h of the skeleton bounding box to realize normalization, and converting the normalized point coordinate into the point coordinate

Stored in a two-dimensional closed node graph I _2d ∈R ^F×K×2 In (1). Each row in the figure represents the normalized position of all the joint points in a certain frame.

S13, acquiring a two-dimensional joint point image for shielding human body movement: because the joint point diagram of the human motion is not influenced by the appearance of the image, the sheltered human motion data can be obtained by adding the sheltering object on the non-sheltered human picture. Calculating the intersection ratio between the skeleton-enclosing frame and the shade obtained in S11, and making the intersection ratio be X ₁ To X ₂ And shielding with different shielding proportions is realized. For mask map of a mask, a confidence map I is generated by defining the value of a joint point located inside the mask map as 0 and the value of a joint point located outside the mask map as 1 _c ∈R ^F×K×1 . Multiplying the confidence map and the two-dimensional joint map to obtain a joint map of the two-dimensional sheltered human body movement

The representation method can quickly synthesize a large amount of occlusion data, and does not influence the generalization capability of the model on real data.

S14, obtaining a three-dimensional complete human body motion picture: a sequence of three-dimensional movements of the body is extracted from the existing body data set, wherein the three-dimensional movements of the body per frame are represented by a three-dimensional body model of a skeletal skin. For the human body three-dimensional motion sequence with N joint points in the F frame, the skeleton posture of the three-dimensional human body model is represented by joint point rotation, the human body grid deformation is driven by the joint point rotation, and the three-dimensional complete human body motion sequence is stored in a three-dimensional complete human body motion diagram I _3d ∈R ^F×N×6 In (1).

Further, the specific method of step S2 is:

s21, because the time sequence relation and the spatial relation are considered at different stages respectively, information loss can be caused, and the constructed encoder comprises 2 modules: and extracting the space-time characteristics of different layers by a local joint point layer space-time characteristic extraction network and a global motion layer space-time characteristic extraction network. The local spatio-temporal relationship module is used to model local spatio-temporal features, which contain 4 convolutional layers, where the first 3 convolutional layers are 3 convolutional layers with expansion coefficients of 1, 2, 5, respectively. And splicing the outputs of the three expansion convolutional layers, and fusing the space-time characteristics through the last convolutional layer.

S22, further constructing a global motion level space-time feature extraction network to model global space-time features due to the limited performance of the convolutional layer in the aspect of time sequence continuity. The global motion level space-time characteristic extraction network comprises a global space relation module and a global time sequence relation module and adopts a Transformer network structure. Spatial embedding is added to the output of a first module and temporal embedding is added to the output of a second module.

S23, constructing an encoder consisting of a full-connection network, and regressing a two-dimensional complete human body motion related node graph from the encoder estimated encoding.

Further, the specific method of step S3 is:

and S31, in order to enhance the generalization performance of the network, performing data augmentation on the training data. The specific strategy comprises the following steps: 1) and (5) mirror image turning. Because the human body has bilateral symmetry, the multiplication of human body data is realized by interchanging the rotation angles of the symmetrical joints; 2) sampling at different rates. Obtaining a new human body motion sequence by sampling the original human body motion sequence at different sampling rates; 3) and (5) sampling in a reverse order. And reversing the sequence of the original human body motion sequence to be used as a new motion sequence.

And S32, taking the synthesized two-dimensional joint point graph for shielding human body motion as an input, and outputting a two-dimensional joint point graph for completely shielding human body motion through a human body space-time prior network. And monitoring the joint point diagram of the two-dimensional complete human motion output by the human body occlusion space-time prior network by using the true value of the joint point diagram of the two-dimensional complete human motion. The L1 loss is used as a constraint, and the formula is as follows:

in which I _2d And

truth values | · | of the reconstructed two-dimensional whole body motion correlation node graph and the reconstructed two-dimensional whole body motion correlation node graph, respectively ₁ Represents a norm, L _2d L1 losses representing the output two-dimensional complete body motion's correlation map and corresponding true values.

In addition, the output two-dimensional complete human body motion joint point graph is further constrained by a smoothing term:

wherein

And

and parameters of a t-th row and a t + 1-th row of a two-dimensional joint point graph, which are estimated at the current moment and the next moment, namely the two-dimensional joint point graph of the two-dimensional complete human body motion are respectively represented. L is a radical of an alcohol _smooth,2d And (4) smooth item constraint of a related node map representing the output two-dimensional complete human body motion.

Finally, the training loss of the human body spatiotemporal prior network is shielded as follows:

L _seff ＝ω ₁ L _2d +ω ₂ L _smooth,2d

wherein omega ₁ ,ω ₂ Are weights.

Further, the specific method of step S4 is:

s41, constructing a network which is the same as the encoder of the occlusion human body space-time prior network in the S2 and taking a full connection layer and two transform modules which are respectively used for estimating a time sequence relation and a space relation as a decoder, and estimating a three-dimensional complete human body motion graph from a two-dimensional occlusion human body motion closed node graph.

Further, the specific method of step S5 is:

and S51, taking the parameters of the coder of the shielding human body space-time prior network trained in the step S3 as pre-training parameters, and assigning the pre-training parameters to the coder of the three-dimensional motion reconstruction network constructed in the step S4, so that the three-dimensional motion reconstruction network obtains the prior knowledge of the shielding two-dimensional human body motion.

S52, the synthesized two-dimensional joint point image for shielding human body movement is used as input, and a three-dimensional complete human body movement image is output after three-dimensional movement reconstruction network coding and decoding. And (3) constraining the output three-dimensional complete human motion map by using the reconstructed L2 loss, wherein the formula is as follows:

wherein I _3d ,V _3d ,J _3d Respectively, a reconstructed complete three-dimensional human body motion picture, a vertex coordinate and a joint point coordinate, beta is a parameter representing the body type of the human body, the superscript ^ represents a true value of each parameter obtained from a data set, and L _rec Indicating a loss of reconstructed L2.

And, an additional regularization constraint is used to prevent the occurrence of abnormal body shapes:

L _reg ＝‖β‖ ₂

in addition, the output three-dimensional complete human motion map is further constrained by a smoothing term:

wherein θ is from I _3d The human body motion parameters obtained by middle sampling, M (theta) is the motion characteristics obtained by motion prior coding, and L _smooth,3d Representing a smooth term constraint on the output three-dimensional complete body motion map.

Finally, the training loss of the three-dimensional motion reconstruction network is as follows:

L _3dbranch ＝L _rec +ω ₃ L _reg +ω ₄ L _smooth,3d

wherein ω is ₃ ,ω ₄ Are weights. And in the training process, fine adjustment is carried out on the parameters of the encoder. Training until convergence.

Further, the specific method of step S6 is:

and S61, sampling the three-dimensional complete human body motion image obtained by estimation in the S4 to obtain a three-dimensional human body motion sequence. Because the three-dimensional human motion sequence obtained by sampling is in a local coordinate system, in order to obtain the three-dimensional human motion sequence with absolute positions, a least square function is constructed by using the joint point positions of a three-dimensional human sequence skeleton and the input shielding two-dimensional joint point positions, and global translation is estimated. And applying the estimated translation to the three-dimensional human motion sequence. The following least squares function was constructed:

where K is camera internal reference, P is the coordinates of the two-dimensional joint points of the human body recovered by sampling in the joint point diagram of the two-dimensional human body motion _c Is the confidence level of the corresponding joint point,

is the translation of the three-dimensional mannequin of the skeletal skin. The human body three-dimensional motion containing the absolute position can be obtained by increasing the translation amount on the three-dimensional human body model of the skeleton skin.

Further, the specific method of step S7 is:

s71, video acquisition is carried out by using a single camera. Fix the camera with the tripod, gather the human body that is sheltered from.

And S72, detecting the joint point position and the confidence coefficient of each frame of picture by using the existing open-source two-dimensional joint point detection method for the video sequence acquired in the step S71. The joint point positions are stored as a two-dimensional relational node map in accordance with the method in S13. The confidence is binarized, values greater than the threshold value X are assigned to 1, values less than the threshold value X are assigned to 0, and the result is stored in an F × N × 1 confidence map. And multiplying the confidence map and the two-dimensional correlation node map to obtain a correlation node map of the two-dimensional shielding human body motion.

And S73, inputting a two-dimensional human body motion shielding joint point graph, and outputting a reconstructed human body complete three-dimensional human body motion graph through a human body shielding space-time prior network and a three-dimensional motion reconstruction network.

S74, the global position of each pose is obtained through S61. And finally, adding a translation parameter on the three-dimensional human body model of the skeleton skin to obtain the human body three-dimensional motion containing the absolute position. The global human motion sequence can be further skinned to obtain a human network model with deformation.

Has the advantages that: compared with the prior art, the invention has the beneficial effects that: 1. the joint point diagram representation of the two-dimensional human body shielding movement and the human body shielding data generation method are provided, a large amount of shielding data can be rapidly synthesized, the generalization capability of a model on real data is not influenced, and the strong dependence of the existing method on the real human body shielding data is solved. 2. The self-supervision training strategy of the synthetic data and the model is utilized to train the spatial-temporal motion prior of the two-dimensional shielding human body, and the accuracy of shielding motion reconstruction is improved. 3. The local joint point space-time feature extraction network based on the expansion convolution and the global motion layer space-time feature extraction network based on the transform are provided, so that the space-time features of different layers are effectively extracted, the ambiguity of shielding two-dimensional human motion is reduced, and the accuracy of three-dimensional human motion reconstruction is improved. 4. The method can realize real-time shielding human body motion reconstruction only by depending on a single RGB camera, and has convenient deployment and lower cost.

Drawings

FIG. 1 is a diagram of an occlusion spatiotemporal motion prior training framework;

FIG. 2 is a diagram of an occluding human motion reconstruction network training framework;

FIG. 3 is a flow chart of occluded body motion reconstruction;

FIG. 4 is a schematic diagram of a mask, where the top row is a mask diagram and the bottom row is a mask diagram;

FIG. 5 is a schematic diagram of a synthesized occlusion, wherein the upper row is a synthesized occlusion graph and the lower row is a corresponding original picture;

FIG. 6 is a schematic diagram of a composite occlusion two-dimensional motion sequence, wherein the left diagram is a composite occlusion schematic diagram, the middle diagram is a composite occlusion mask diagram, and the right diagram is a two-dimensional occlusion human motion joint diagram;

FIG. 7 is a schematic diagram of a three-dimensional human motion map, wherein the left diagram is a three-dimensional human complete motion map, and the right diagram is a reconstruction result map;

fig. 8 is a result diagram of the reconstructed three-dimensional motion after skinning, in which the left column is an original picture, the middle column is a skinning result rendering diagram, and the right column is a rendering result diagram at different viewing angles.

Detailed Description

The present invention is described in further detail below with reference to the attached drawing figures. The invention relates to a shielded human body motion sequence reconstruction method based on self-supervision space-time motion prior, which comprises the following steps:

s11, obtaining a random shielding object: as shown in FIG. 4, an obstruction Mask and RGB pictures are obtained by Mask-RCNN in a web video using example segmentation for simulating occlusion.

S12, extracting a human body two-dimensional motion sequence from the human body data set, wherein the human body two-dimensional motion of each frame is represented by coordinates of 24 human body joint points. For 24 frames of human body two-dimensional motion sequence with 24 joint points, for the convenience of network calculation, root node coordinates (x) are firstly subtracted from all joint point coordinates (x, y) _root ,y _root ) Then dividing the frame by the size w multiplied by h of the frame enclosure to realize normalization, and obtaining the point coordinates after normalization

Stored in a two-dimensional closed node graph I _2d ∈R ^24×24×2 In (1). Each row in the figure represents the normalized position of all the joint points in a certain frame.

S13, acquiring a two-dimensional joint point image for shielding human body movement: since the joint point representation of the human motion is not affected by the appearance of the image, as shown in fig. 5, the occluded human motion data can be obtained by adding an occlusion object to the non-occluded human picture. Such asAs shown in FIG. 6, the intersection ratio between the skeleton-enclosing frame and the shade obtained in S11 is calculated and made to be X ₁ To X ₂ And shielding with different shielding proportions is realized. For mask map of a mask, a confidence map I is generated by defining the value of a joint point located inside the mask map as 0 and the value of a joint point located outside the mask map as 1 _c ∈R ^24×24×1 . Multiplying the confidence map and the two-dimensional correlation node map to obtain a two-dimensional joint point map for shielding human body movement

The ordinate in the figure represents time, the abscissa represents joint points, and the coordinates of two-dimensional joint points corresponding to corresponding moments are stored in each point in the figure. The representation method can quickly synthesize a large amount of occlusion data, and does not influence the generalization capability of the model on real data.

S14, extracting a human body three-dimensional motion sequence from the existing human body data set, wherein the human body three-dimensional motion of each frame is represented by a three-dimensional human body model of skeleton skin. For 24 frames of human body three-dimensional motion sequence with 24 joint points, the skeleton posture of the three-dimensional human body model is represented by joint point rotation, human body grids are deformed by the joint point rotation, and the three-dimensional complete human body motion sequence is stored in a three-dimensional complete human body motion diagram I _3d ∈R ^24×24×6 In (1). As shown in fig. 7, the ordinate represents time, the abscissa represents joint points, and coordinates of three-dimensional joint points corresponding to corresponding moments are stored in each point in the figure.

S21, as shown in fig. 1, since the time sequence relationship and the spatial relationship are considered at different stages respectively to cause information loss, the constructed encoder includes 2 modules: and extracting the spatiotemporal characteristics of different layers by a local joint point layer spatiotemporal characteristic extraction network and a global motion layer spatiotemporal characteristic extraction network. The local spatio-temporal relationship module is used to model local spatio-temporal features, which contain 4 convolutional layers, where the first 3 convolutional layers are 3 convolutional layers with expansion coefficients of 1, 2, 5, respectively. And splicing the outputs of the three expansion convolutional layers, and fusing the space-time characteristics through the last convolutional layer.

S22, further constructing a global motion level space-time feature extraction network to model global space-time features due to the fact that the performance of the convolutional layer in the aspect of time sequence continuity is limited. The global motion level space-time characteristic extraction network comprises a global space relation module and a global time sequence relation module and adopts a Transformer network structure. Spatial embedding is added to the output of a first module and temporal embedding is added to the output of a second module.

S31, because data augmentation plays an important role in network training, a data augmentation method is adopted for the existing data. The specific strategy comprises the following steps: 1) and (5) mirror image turning. Because the human body has bilateral symmetry, the multiplication of human body data is realized by interchanging the rotation angles of the symmetrical joints; 2) sampling at different rates. Obtaining a new human body motion sequence by sampling the original human body motion sequence at different sampling rates; 3) and (5) sampling in a reverse order. And reversing the sequence of the original human body motion sequence to be used as a new motion sequence.

And S32, taking the synthesized two-dimensional joint point graph for shielding human body motion as an input, and outputting a two-dimensional joint point graph for completely shielding human body motion through a human body space-time prior network. And monitoring the joint point graph of the two-dimensional complete human motion output by the human body space-time prior network by using the true value of the joint point graph of the two-dimensional complete human motion. The L1 penalty is used as a constraint, and the formula is as follows:

wherein I _2d And

the real values of the reconstructed two-dimensional complete human body motion related node graph and the reconstructed two-dimensional complete human body motion related node graph are respectively represented, | · | ₁ Represents a norm, L _2d A graph of the output correlation nodes of two-dimensional complete body motion and the corresponding true values of L1 loss.

wherein

And

L _self ＝ω ₁ L _2d +ω ₂ L _smooth,2d

wherein omega ₁ ＝1,ω ₂ 0.5 is a weight.

S52, as shown in figure 2, the synthesized two-dimensional joint point graph for shielding human body motion is used as input, and a three-dimensional complete human body motion graph is output after the three-dimensional motion reconstruction network is used for encoding and decoding. And (3) constraining the output three-dimensional complete human motion map by using the reconstructed L2 loss, wherein the formula is as follows:

in which I _3d ,V _3d ,J _3d Respectively, a reconstructed complete three-dimensional complete human body motion picture, a vertex coordinate and a joint point coordinate, beta is a parameter representing the body type of the human body, and the superscript ^ represents a true value L of each parameter obtained from a data set _rec Indicating a loss of reconstructed L2.

L _reg ＝‖β‖ ₂

L _smooth,3d ＝‖M(θ)-M(θ ^gt )‖ ₂ +‖M(θ)‖ ₂

wherein θ is from I _3d The human body motion parameters obtained by middle sampling, M (theta) is the motion characteristics obtained by motion prior coding, and L _smooth,3d Representing a smooth term constraint on the output three-dimensional complete human motion map.

L _3dbranch ＝L _rec +ω ₃ L _reg +ω ₄ L _smooth,3d

wherein ω is ₃ ＝1,ω ₄ 0.5 is a weight. And in the training process, fine adjustment is carried out on the parameters of the encoder. Training until convergence.

And S61, sampling the three-dimensional complete human body motion picture obtained by estimation in the S4 to obtain a three-dimensional human body motion sequence. Because the three-dimensional human motion sequence obtained by sampling is in a local coordinate system, in order to obtain the three-dimensional human motion sequence with absolute positions, a least square function is constructed by using the joint point positions of a three-dimensional human sequence skeleton and the input position of an occlusion two-dimensional joint point, and global translation is estimated. And applying the estimated translation to the three-dimensional human motion sequence. The following least squares function was constructed:

And S71, carrying out video acquisition by using a single camera. The camera is fixed by a tripod, and the sheltered human body is collected.

And S72, as shown in FIG. 3, inputting the video sequence acquired in the step S71, and detecting the joint point position and the confidence thereof for each frame of picture by using the existing open-source two-dimensional joint point detection method. The joint point positions are stored as a two-dimensional relational node map in accordance with the method in S13. The confidence is binarized, a threshold value of 0.6 is set, values greater than the threshold value are assigned as 1, values less than the threshold value are assigned as 0, and the result is stored in a 24 × 24 × 1 confidence map. And multiplying the confidence map and the two-dimensional correlation node map to obtain a correlation node map of the two-dimensional shielding human body motion.

And S73, inputting a two-dimensional human body motion shielding closed node map, and outputting a reconstructed human body complete three-dimensional human body motion map through a human body shielding space-time prior network and a three-dimensional motion reconstruction network.

S74, obtaining the global position of each gesture through S61. Finally, as shown in fig. 8, the three-dimensional motion of the human body including the absolute position is obtained by adding the translation parameter to the three-dimensional human body model of the skeleton skin. The global human motion sequence can be further skinned to obtain a human network model with deformation.

Claims

1. A single-view-angle shielding human body motion reconstruction method based on self-supervision space-time motion prior is characterized by comprising the following steps:

s1, human body motion synthesis and representation: providing a joint point diagram representation of two-dimensional sheltered human body movement and a human body sheltered data generation method to quickly synthesize a large amount of sheltered data; obtaining a shelter by using example segmentation in a network video, randomly covering the shelter on a picture in a human body data set without the shelter to realize synthetic shelter, and generating a two-dimensional joint point graph for sheltering human body motion on a two-dimensional joint point graph for complete human body motion for network training;

s2, constructing a human body shielding space-time prior network: constructing an encoder comprising a local joint point level space-time characteristic extraction network and a global motion level space-time characteristic extraction network to respectively extract space-time characteristics of different levels; in addition, an encoder consisting of a full-connection network is constructed, and a two-dimensional complete human body motion related node graph is regressed from the encoding estimated by the encoder;

s3, carrying out human body shielding space-time prior network training: taking the synthesized two-dimensional articulation point diagram for shielding human body motion as input, taking the corresponding articulation point diagram for shielding-free two-dimensional complete human body motion as supervision, and training the human body shielding space-time prior network constructed in the step S2 in a self-supervision manner; training the neural network in a self-supervision mode until convergence, and learning priori knowledge for shielding two-dimensional human body motion;

s4, constructing a three-dimensional motion reconstruction network: constructing a network which is the same as the encoder of the blocking human body space-time prior network in the step S2 as an encoder, and taking a full connection layer and two transform modules as decoders, wherein the decoder is used for estimating a three-dimensional complete human body motion graph from a two-dimensional blocking human body motion related node graph;

s5, three-dimensional motion reconstruction network training: taking the parameters of the encoder of the shielded human body space-time prior network trained in the step S3 as pre-training parameters, and assigning the pre-training parameters to the encoder of the three-dimensional motion reconstruction network constructed in the step S4, so that the three-dimensional motion reconstruction network obtains the prior knowledge of shielding two-dimensional human body motion; further taking the synthesized two-dimensional joint point graph for shielding human body motion as input and the three-dimensional complete human body motion graph as supervision for network training; in the training process, fine adjustment is carried out on the parameters of the encoder; training until convergence;

s6, global position estimation: sampling the three-dimensional complete human body motion image estimated in the step S4 to obtain a three-dimensional human body motion sequence; because the three-dimensional human motion sequence obtained by sampling is in a local coordinate system, in order to obtain the three-dimensional human motion sequence with absolute positions, a least square function is constructed by using the joint point positions of a three-dimensional human sequence skeleton and the input shielding two-dimensional joint point positions, and global translation is estimated; and applying the estimated translation to the three-dimensional human motion sequence;

s7, reconstructing the motion of a human body shielded by a real-time single visual angle: after network training is completed, a single-view motion capture system is constructed by utilizing a single RGB camera, a human motion video sequence collected by the single-view camera is input, the position and the confidence coefficient of a joint point of each frame of picture are detected by using the existing open-source two-dimensional joint point detection method, and a two-dimensional joint point graph for shielding human motion is obtained; and estimating a three-dimensional complete human body motion diagram through a three-dimensional motion reconstruction network, acquiring a three-dimensional human body motion sequence with an absolute position by using the method in the step S6, realizing the reconstruction of the single-view-angle shielding human body three-dimensional motion, further covering the global human body motion sequence, and acquiring a human body network model with deformation.

2. The method for reconstructing single-view-angle-of-occlusion human motion based on an auto-supervised spatio-temporal motion prior as claimed in claim 1, wherein said step S1 specifically comprises:

s11, obtaining a random shielding object: obtaining an obstruction Mask and an RGB picture by Mask-RCNN in a network video by using example segmentation for simulating obstruction;

s12, acquiring a two-dimensional joint point image of the complete human body motion: extracting a human body two-dimensional motion sequence from a human body data set, wherein the human body two-dimensional motion of each frame is represented by K coordinates of human body joint points; for a human body two-dimensional motion sequence with K joint points in an F frame, firstly, root node coordinates (x) are subtracted from all joint point coordinates (x, y) _root ,y _root ) Then divided by the size of the skeleton bounding box w × hRealizing normalization and converting the normalized point coordinates

Stored in a two-dimensional closed node graph I _2d ∈R ^F×K×2 Performing the following steps; each line in the graph represents the normalized position of all the joint points of a certain frame;

s13, acquiring a two-dimensional joint point image for shielding human body movement: obtaining shielded human motion data by adding a shielding object on a non-shielded human picture; calculating the intersection ratio between the skeleton-enclosing frame and the shade obtained in S11, and making the intersection ratio be X ₁ To X ₂ Realizing shielding of different shielding proportions; for a mask map of an obstruction, a confidence map I is generated by specifying a value of a joint point located inside the mask map as 0 and a value of a joint point located outside the mask map as 1 _c ∈R ^F×K×1 (ii) a Multiplying the confidence map and the two-dimensional joint map to obtain a joint map of the two-dimensional sheltered human body movement

S14, obtaining a three-dimensional complete human body motion picture: extracting a human body three-dimensional motion sequence from the existing human body data set, wherein the human body three-dimensional motion of each frame is represented by a three-dimensional human body model of a skeleton covering; for the human body three-dimensional motion sequence with N joint points in the F frame, the skeleton posture of the three-dimensional human body model is represented by joint point rotation, the human body grid deformation is driven by the joint point rotation, and the three-dimensional complete human body motion sequence is stored in a three-dimensional complete human body motion diagram I _3d ∈R ^F×N×6 In (1).

3. The method for reconstructing single-view-angle-of-occlusion human motion based on an auto-supervised spatio-temporal motion prior as claimed in claim 1, wherein said step S2 specifically comprises:

s21, because the time sequence relation and the spatial relation are considered at different stages respectively, information loss can be caused, and the constructed encoder comprises 2 modules: extracting networks of local joint point level space-time characteristics and global motion level space-time characteristics to respectively extract space-time characteristics of different levels; the local space-time relation module is used for modeling local space-time characteristics and comprises 4 convolutional layers, wherein the first 3 convolutional layers are respectively convolutional layers with expansion coefficients of 1, 2 and 5; splicing the output of the three expansion convolution layers, and fusing the space-time characteristics through the last layer of convolution layer;

s22, further constructing a global motion level space-time feature extraction network to model global space-time features due to the limited performance of the convolutional layer in the aspect of time sequence continuity; the global motion level space-time characteristic extraction network comprises a global space relation module and a global time sequence relation module and adopts a Transformer network structure; adding spatial embedding to the output of a first module and adding sequential embedding to the output of a second module;

4. The method for reconstructing single-view-angle-of-occlusion human motion based on an auto-supervised spatio-temporal motion prior as claimed in claim 1, wherein said step S3 specifically comprises:

s31, in order to enhance the generalization performance of the network, data augmentation is carried out on training data, and the specific strategy comprises the following steps: 1) mirror image turning; because the human body has bilateral symmetry, the multiplication of human body data is realized by interchanging the rotation angles of the symmetrical joints; 2) sampling at different rates; obtaining a new human body motion sequence by sampling the original human body motion sequence at different sampling rates; 3) sampling in a reverse order; reversing the sequence of the original human body motion sequence to be used as a new motion sequence;

s32, taking the synthesized two-dimensional joint point graph for shielding human body motion as an input, and outputting a two-dimensional joint point graph for completely shielding human body motion through a human body space-time prior network; monitoring a joint point diagram of two-dimensional complete human body motion output by a human body occlusion space-time prior network by using a true value of the joint point diagram of the two-dimensional complete human body motion; the L1 loss is used as a constraint, and the formula is as follows:

in which I _2d And

the real values of the reconstructed two-dimensional complete human body motion related node graph and the reconstructed two-dimensional complete human body motion related node graph are respectively represented, | · | ₁ Represents a norm, L _2d An L1 loss representing a correlation map of the output two-dimensional complete body motion and corresponding true values;

in addition, the output two-dimensional complete human body movement joint point graph is further constrained by a smoothing term:

wherein

And

respectively representing parameters of a t-th line and a t + 1-th line of a two-dimensional joint point graph of two-dimensional complete human body motion, wherein the two-dimensional joint points are estimated at the current moment and the next moment; l is _smooth,2d A smooth term constraint of a related node map representing the output two-dimensional complete human body motion;

L _serf ＝ω ₁ L _2d +ω ₂ L _smooth,2d

wherein ω is ₁ ,ω ₂ Are the weights.

5. The method for reconstructing single-view-shielding human motion based on self-supervision spatio-temporal motion prior according to claim 1, wherein the step S4 specifically comprises:

s41, constructing a network which is the same as the encoder of the blocking human body space-time prior network in the step S2 and taking a full connection layer and two transform modules which are respectively used for estimating a time sequence relation and a space relation as a decoder, wherein the network is used for estimating a three-dimensional complete human body motion image from a two-dimensional blocking human body motion related node image.

6. The method for reconstructing single-view-shielding human motion based on self-supervision spatio-temporal motion prior according to claim 1, wherein the step S5 specifically comprises:

s51, the parameters of the encoder of the shielding human body space-time prior network trained in the step S3 are used as pre-training parameters and are assigned to the encoder of the three-dimensional motion reconstruction network constructed in the step S4, so that the three-dimensional motion reconstruction network obtains the prior knowledge of shielding two-dimensional human body motion;

s52, taking the synthesized two-dimensional joint point graph for shielding human body motion as input, and outputting a three-dimensional complete human body motion graph after three-dimensional motion reconstruction network coding and decoding; and (3) constraining the output three-dimensional complete human motion map by using the reconstructed L2 loss, wherein the formula is as follows:

in which I _3d ,V _3d ,J _3d Respectively, a reconstructed complete three-dimensional complete human body motion picture, a vertex coordinate and a joint point coordinate, beta is a parameter representing the body type of the human body, and the superscript ^ represents a true value L of each parameter obtained from a data set _rec Represents reconstructed L2 loss;

L _reg ＝‖β‖ ₂

wherein θ is from I _3d The human body motion parameters obtained by middle sampling, M (theta) is the motion characteristics obtained by motion prior coding, and L _smooth,3d Representing a smooth term constraint on the output three-dimensional complete human motion map;

L _3dbranch ＝L _rec +ω ₃ L _reg +ω ₄ L _smooth,3d

wherein ω is ₃ ,ω ₄ Is a weight; and in the training process, fine adjustment is carried out on the encoder parameters at the same time, and the training is carried out until convergence.

7. The method for reconstructing single-view-angle-of-occlusion human motion based on an auto-supervised spatio-temporal motion prior as claimed in claim 1, wherein said step S6 specifically comprises:

s61, sampling from the three-dimensional complete human body motion picture estimated in the step S4 to obtain a three-dimensional human body motion sequence; because the three-dimensional human motion sequence obtained by sampling is in a local coordinate system, in order to obtain the three-dimensional human motion sequence with absolute position, a least square function is constructed by using the joint point position of a three-dimensional human sequence skeleton and the input position of an occlusion two-dimensional joint point, and global translation is estimated; applying the estimated translation to the three-dimensional human motion sequence; the following least squares function was constructed:

where K is camera internal reference, P is the coordinates of the two-dimensional joint points of the human body recovered by sampling in the joint point diagram of the two-dimensional human body motion _c Is the confidence of the corresponding joint point, and T is the translation amount of the three-dimensional human body model of the skeleton skin; the human body three-dimensional motion containing the absolute position is obtained by increasing the translation amount on the three-dimensional human body model of the skeleton skin.

8. The method for reconstructing single-view-shielding human body motion based on self-supervision spatio-temporal motion prior according to claim 7, wherein the step S7 specifically comprises:

s71, carrying out video acquisition by using a single camera; fixing the camera by using a tripod, and collecting the shielded human body;

s72, detecting the position and the confidence coefficient of the joint point of each frame of picture by using the existing open-source two-dimensional joint point detection method for the video sequence acquired in the step S71; storing the joint point positions into a two-dimensional joint point map according to the method in S13; carrying out binarization on the confidence coefficient, assigning a value larger than the threshold value X as 1, assigning a value smaller than the threshold value X as 0, and storing the result in an F multiplied by N multiplied by 1 confidence coefficient map; multiplying the confidence coefficient map and the two-dimensional correlation node map to obtain a correlation node map of the two-dimensional shielding human body movement;

s73, inputting a two-dimensional human body motion shielding closed node map, and outputting a reconstructed human body complete three-dimensional complete human body motion map through a human body shielding space-time prior network and a three-dimensional motion reconstruction network;

s74, obtaining the global position of each gesture through the step S61; finally, a translation parameter is added to the three-dimensional human body model of the skeleton skin to obtain human body three-dimensional motion containing an absolute position; and further skinning the global human motion sequence to obtain a human network model with deformation.