CN116681838A - Monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization - Google Patents

Monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization Download PDF

Info

Publication number
CN116681838A
CN116681838A CN202310831742.8A CN202310831742A CN116681838A CN 116681838 A CN116681838 A CN 116681838A CN 202310831742 A CN202310831742 A CN 202310831742A CN 116681838 A CN116681838 A CN 116681838A
Authority
CN
China
Prior art keywords
human body
frame
target
dimensional
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310831742.8A
Other languages
Chinese (zh)
Inventor
陈志刚
汪海波
肖祎龙
庾永昂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Xigua Network Technology Co ltd
Central South University
Original Assignee
Hunan Xigua Network Technology Co ltd
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Xigua Network Technology Co ltd, Central South University filed Critical Hunan Xigua Network Technology Co ltd
Priority to CN202310831742.8A priority Critical patent/CN116681838A/en
Publication of CN116681838A publication Critical patent/CN116681838A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/005General purpose rendering architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the disclosure provides a monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization, which belongs to the technical field of image processing and specifically comprises the following steps: inputting a monocular character action video sequence, sequentially performing target detection, character foreground extraction, camera and SMPL human body parameter estimation and SMPL parameter time sequence optimization, and constructing a human body reconstruction training data set; initializing a monocular video dynamic human body three-dimensional reconstruction model, obtaining a synthetic image through a volume rendering algorithm, taking the error between the synthetic image and an actual image as a main loss function, taking the volume rendering weight and the error between the volume rendering weight and the foreground of a person as auxiliary loss functions, and training the model in a mixed precision training mode to obtain a target model; inputting any frame in the target video into a target model, adjusting the posture and the position of a camera according to the three-dimensional posture information of the person in the target frame, and rendering and generating a human body action image under a new view angle. By the scheme, the reconstruction efficiency, accuracy and adaptability are improved.

Description

Monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization
Technical Field
The embodiment of the disclosure relates to the technical field of image processing, in particular to a monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization.
Background
At present, a common dynamic three-dimensional human body reconstruction method is reconstruction based on Multi-View three-dimensional reconstruction (Multi-View three), wherein three-dimensional coordinates of corresponding points in an image are deduced according to the images of the multiple views, so that point cloud representation of a human body is obtained, the point cloud is triangulated into a three-dimensional grid, and surface textures are filled. But this method is heavily dependent on the number of views and the density of the point cloud. The neural radiation field based on implicit representation simplifies the reconstruction process through a differentiable volume rendering process, and reconstruction can be realized by inputting multi-view observation images. However, the original nerve radiation field is suitable for static scenes, reconstruction can be realized through multi-view input and camera parameter information, and for dynamic scenes, the main stream solution thinking is to divide the reconstruction process into two steps, firstly, establish a mapping relation from an observation space (one frame in a dynamic video) to a template space, and then query color values and volume density values in the template space for volume rendering to generate a rendered image. The existing monocular video human body reconstruction method mainly has three problems:
1. the human foreground extraction accuracy is not high, manual correction is needed, the background difference is 0 or 1, and the edge transition is unnatural due to excessive background difference; 2. monocular SMPL parameters are not estimated accurately enough and consistency of gestures between video frames cannot be guaranteed; 3. the convergence speed of the human nerve radiation field is slow, and the calculation force requirement is large.
Therefore, a monocular video dynamic human body three-dimensional reconstruction method with high reconstruction efficiency, accuracy and adaptability and based on gesture optimization is needed.
Disclosure of Invention
In view of the above, the embodiments of the present disclosure provide a monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization, which at least partially solves the problems of poor reconstruction efficiency, accuracy and adaptability in the prior art.
The embodiment of the disclosure provides a monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization, which comprises the following steps:
step 1, inputting a monocular character action video sequence, sequentially performing target detection, character foreground extraction, camera and SMPL human body parameter estimation and SMPL parameter time sequence optimization, and constructing a human body reconstruction training data set;
step 2, initializing a monocular video dynamic human body three-dimensional reconstruction model, obtaining a synthetic image through a volume rendering algorithm, taking the error between the synthetic image and an actual image as a main loss function, taking the error between volume rendering weight and the foreground of a person as an auxiliary loss function, and obtaining a target model by using a mixed precision training mode and a human body reconstruction training data set training model;
and 3, inputting any frame in the target video as a target frame into a target model, adjusting the posture and the position of a camera according to the three-dimensional posture information of the person in the target frame, and rendering and generating a human body action image under a new visual angle.
According to a specific implementation manner of the embodiment of the present disclosure, the step 1 specifically includes:
step 1.1, extracting a target detection frame in a monocular character action video sequence by using a YOLOv3 target detector, and screening effective video frames according to detection confidence and detection categories;
step 1.2, extracting a character prospect in an effective video frame by using a PP-marking human body image segmentation method;
step 1.3, estimating camera parameters and SMPL human parameters in each frame of the effective video frame by using a SPEC human SMPL parameter estimation method, wherein the camera parameters comprise a vertical field of view, a pitch angle, a roll angle and relative displacement between the camera and the human body, and the SMPL human parameters comprise 10-dimensional morphological parameters and 24 bone posture parameters represented by shaft angles;
step 1.4, smoothing the pitch angle and the roll angle of each frame of effective video frame by using OneEuro filtering algorithm, taking the average value of each frame of vertical vision as the whole vertical vision, extracting the SMPL parameter of each frame of effective video frame by using HRNet two-dimensional posture estimation algorithm to obtain each frame of two-dimensional human posture point J i est Then using the improved SMPLify method to perform time sequence optimization;
step 1.5, forming the data of steps 1.1 to 1.5 into a human body reconstruction training data set.
According to a specific implementation manner of the embodiment of the present disclosure, the step of performing timing optimization using the improved SMPLify method includes:
and (3) taking the minimized error sum on the video sequence as an optimization target, newly adding two time sequence smoothing items on the basis of the SMPLify error function to obtain a target error function, and performing time sequence optimization by adopting an LBFGS optimization algorithm.
According to a specific implementation manner of the embodiment of the disclosure, the expression of the objective error function is
Wherein N represents the number of active video frames, beta i And theta i Respectively represent the SMPL form parameter and the gesture parameter corresponding to the ith frame, K i For the camera projection matrix corresponding to the frame, lambda s-2d And lambda (lambda) s-3d Loss weights corresponding to the two-dimensional time sequence smoothing term and the three-dimensional time sequence smoothing term respectively,corresponding three-dimensional human bone under given parameters for the ith frameCritical points of the ilium->Representation using K i Two-dimensional human skeleton key points obtained after projection, J est Is a frame-by-frame two-dimensional human body key point obtained by using an HRNet two-dimensional human body posture estimation method.
According to a specific implementation manner of the embodiment of the disclosure, the monocular video dynamic human body three-dimensional reconstruction model comprises a coordinate mapping module and a template nerve radiation field module.
According to a specific implementation manner of the embodiment of the present disclosure, the step 2 specifically includes:
step 2.1, the coordinate mapping module maps the coordinate point x of the observation space o Mapping to template space x c
Step 2.2, arbitrary coordinate point x in the template space c Input template neural radiation field module, output x c The corresponding color c and volume density sigma, a synthetic image is obtained through a ray stepping algorithm;
step 2.3, calculating the volume rendering value and opacity of the composite image.
And 2.4, using a rendering error and an opacity loss term between the synthetic image and the actual image as a target loss function, and using a mixed precision training mode and a human body reconstruction training data set training model to obtain a target model.
According to a specific implementation manner of the embodiment of the present disclosure, the step 2.1 specifically includes:
acquiring an initial coordinate mapping point x by using an inverse linear skin method c′
Mapping points x using a coordinate correction network c′ Fine tuning is carried out to obtain a final mapping point x c Wherein the coordinate correction network is a multi-layer perceptron comprising multi-resolution hash codes, and the input information of the coordinate correction network comprises coordinate mapping points x c′ Out of the characteristic information f corresponding to the frame i
According to a specific implementation manner of the embodiment of the present disclosure, the step 2.2 specifically includes:
for any sampling point in a ray stepping algorithm, acquiring a feature code corresponding to the sampling point by using multi-resolution hash codes, inputting the feature code into a network to acquire a color and density value corresponding to the sampling point, setting the ray stepping algorithm to start from a camera for the color value of any pixel in a picture, emitting a ray r passing through the pixel, sampling the ray r, and calculating a volume rendering value by using a volume rendering formula according to the color and density at the sampling point, wherein the expression of the volume rendering formula is as follows
Wherein the method comprises the steps ofThe color value of the pixel point where the ray r passes through is M, the number of sampling points along the ray is w i Representing the weight, delta, at the sample point i =t i+1 -t i Representing the distance between two adjacent vector sampling points, T i Representing the position from the starting position t n To t i Transmittance at the position, and the calculation formula is as follows
Obtaining opacity of a composite image according to a volume rendering formulaIs calculated according to the formula:
according to a specific implementation manner of the embodiment of the disclosure, the expression of the objective loss function is
Wherein lambda is opacity For the opacity loss weight to be high,representing an opacity loss term,/->Representing the rendering error between the composite image and the actual image,
and C i Respectively representing an ith frame synthesized image and an actual image, LPIPS representing the similarity of perceived image blocks, MSE being mean square error, lambda LPIPS And lambda (lambda) MSE Representing the loss weights of LPIPS and MSE, respectively.
According to a specific implementation manner of the embodiment of the present disclosure, the step 3 specifically includes:
and taking any frame in the target video as a target frame to be input into a target model, adjusting the orientation and position of a camera for the target frame, generating rays of picture pixels from the camera, sampling on the rays through a ray stepping algorithm, mapping sampling points to a template space through a coordinate mapping module, inquiring a color value and a volume density value corresponding to the points through a template nerve radiation field module, calculating the color value corresponding to the pixels through a volume rendering formula, and synthesizing a final image.
The monocular video dynamic human body three-dimensional reconstruction scheme based on gesture optimization in the embodiment of the disclosure comprises the following steps: step 1, inputting a monocular character action video sequence, sequentially performing target detection, character foreground extraction, camera and SMPL human body parameter estimation and SMPL parameter time sequence optimization, and constructing a human body reconstruction training data set; step 2, initializing a monocular video dynamic human body three-dimensional reconstruction model, obtaining a synthetic image through a volume rendering algorithm, taking the error between the synthetic image and an actual image as a main loss function, taking the error between volume rendering weight and the foreground of a person as an auxiliary loss function, and obtaining a target model by using a mixed precision training mode and a human body reconstruction training data set training model; and 3, inputting any frame in the target video as a target frame into a target model, adjusting the posture and the position of a camera according to the three-dimensional posture information of the person in the target frame, and rendering and generating a human body action image under a new visual angle.
The beneficial effects of the embodiment of the disclosure are that: by the scheme, the human body prospect generation method and the SMPL parameter estimation strategy are optimized, and a more accurate human body reconstruction data set can be obtained. And meanwhile, the position coding mode of the human nerve radiation field and the trained loss function are adjusted, so that the model convergence efficiency is improved, and the reconstruction efficiency, the accuracy and the adaptability are further improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.
Fig. 1 is a schematic flow chart of a monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization provided in an embodiment of the disclosure;
fig. 2 is a schematic diagram of a monocular video data preprocessing flow related to a monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of a human body reconstruction model related to a monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization according to an embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.
Other advantages and effects of the present disclosure will become readily apparent to those skilled in the art from the following disclosure, which describes embodiments of the present disclosure by way of specific examples. It will be apparent that the described embodiments are merely some, but not all embodiments of the present disclosure. The disclosure may be embodied or practiced in other different specific embodiments, and details within the subject specification may be modified or changed from various points of view and applications without departing from the spirit of the disclosure. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.
It is noted that various aspects of the embodiments are described below within the scope of the following claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present disclosure, one skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. In addition, such apparatus may be implemented and/or such methods practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.
It should also be noted that the illustrations provided in the following embodiments merely illustrate the basic concepts of the disclosure by way of illustration, and only the components related to the disclosure are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
In addition, in the following description, specific details are provided in order to provide a thorough understanding of the examples. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details.
The current common dynamic three-dimensional human body reconstruction method is based on Multi-View three-dimensional reconstruction (Multi-View Stereo), and the three-dimensional coordinates of corresponding points in an image are deduced according to the images of the multiple views, so that point cloud representation of a human body is obtained, then the point cloud is triangulated into a three-dimensional grid, and surface textures are filled. But this method is heavily dependent on the number of views and the density of the point cloud. The neural radiation field based on implicit representation simplifies the reconstruction process through a differentiable volume rendering process, and reconstruction can be realized by inputting multi-view observation images. However, the original nerve radiation field is suitable for static scenes, reconstruction can be realized through multi-view input and camera parameter information, and for dynamic scenes, the main stream solution thinking is to divide the reconstruction process into two steps, firstly, establish a mapping relation from an observation space (one frame in a dynamic video) to a template space, and then query color values and volume density values in the template space for volume rendering to generate a rendered image.
When the dynamic three-dimensional reconstruction of the human body is performed, the foreground region of the human body needs to be accurately extracted so as to be fused with the background in the synthesis process. The accuracy of the human foreground directly affects the quality of the final image result. With the current common instance segmentation method, the task objective is to segment each target instance in the image and assign a unique identifier to each instance. In the case of single objects, the result of the example segmentation is to actually distinguish the object foreground and the background by using a boolean value of 0 or 1, which makes the transition of the person foreground at the edge quite unnatural, and the synthesized image may have problems of blurring edges, uncoordinated background and the like, resulting in reduced realism and fidelity of the synthesized result.
Then, the three-dimensional skeleton information of the human body is required to be obtained to establish the mapping relation from the observation space to the template space. Because monocular data lacks depth information and has a shielding problem, reconstructing a three-dimensional geometry of a human body from two-dimensional observation is essentially an uncertainty problem (ill-posed projection), and in order to overcome the problem, a great deal of research is currently attempted to integrate human body shape and posture prior to guide reconstruction of the three-dimensional geometry of the human body, and the reconstruction problem of the three-dimensional geometry of the human body is converted into a parameter estimation problem. SMPL (published Multi-Person Linear Model) is a commonly used parameterized model of the human body that describes the pose and shape changes of the body by a set of low dimensional parameters, which are simple in form and easy to optimize. Currently, the mainstream methods for estimating the SMPL human body parameters include SMPLify, SPIN and the like, and the methods are all established on the weak perspective projection assumption. Because the camera parameter information (focal length, orientation and the like) cannot be acquired by the pictures shot in the natural environment, the calculation process can be simplified by using weak perspective projection, and the optimization is facilitated. The weak perspective projection assumption applies to cases where the human body is approximately perpendicular to the camera principal axis and far from the camera, but in most real world character images, the perspective effect is obvious, such as the near-far-small effect in self-photographing. Omitting the perspective projection can cause deviation of the human body posture and the morphological estimation, and the final reconstruction effect is affected. And the methods are based on single pictures for estimation, so that the continuity of data between video frames cannot be ensured for video data, and the estimated human body shape and motion can possibly have the problems of jitter and the like.
Finally, since the neural radiation field generates a composite image by means of volume rendering, this requires hundreds of sampling queries for each pixel in the image, and about three tens of millions of neural network computations are required for a 512x512 resolution picture, even though modern GPUs can implement vectorization computation, parallel acceleration, and still 2 to 3 minutes are required for rendering a picture. In addition, for dynamic human Body reconstruction, the prior method (nerve Body and the like) needs tens of hours of training to achieve a better reconstruction effect because of fitting multi-frame images. The position coding used in the nerve radiation field is an important factor affecting the convergence speed, the original nerve radiation field uses frequency coding (Frequency Encoding) as a position coding mode, coordinate points are mapped into a high-dimensional space through trigonometric functions of different frequencies, the detail features of the position are tried to be extracted, more accurate detail indicates that coding with higher frequency is needed, but the accurate detail indicates that the coding with higher frequency is quite sparse, a large amount of unnecessary calculation is generated by using the coding with higher frequency, and the convergence speed of a model is reduced.
The embodiment of the disclosure provides a monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization, which can be applied to the three-dimensional human body image reconstruction process of scenes such as virtual reality, sports broadcasting, remote communication and the like.
Referring to fig. 1, a flow diagram of a monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization is provided in an embodiment of the present disclosure. As shown in fig. 1, the method mainly comprises the following steps:
step 1, inputting a monocular character action video sequence, sequentially performing target detection, character foreground extraction, camera and SMPL human body parameter estimation and SMPL parameter time sequence optimization, and constructing a human body reconstruction training data set;
further, the step 1 specifically includes:
step 1.1, extracting a target detection frame in a monocular character action video sequence by using a YOLOv3 target detector, and screening effective video frames according to detection confidence and detection categories;
step 1.2, extracting a character prospect in an effective video frame by using a PP-marking human body image segmentation method;
step 1.3, estimating camera parameters and SMPL human parameters in each frame of the effective video frame by using a SPEC human SMPL parameter estimation method, wherein the camera parameters comprise a vertical field of view, a pitch angle, a roll angle and relative displacement between the camera and the human body, and the SMPL human parameters comprise 10-dimensional morphological parameters and 24 bone posture parameters represented by shaft angles;
step 1.4, smoothing the pitch angle and the roll angle of each frame of effective video frame by using OneEuro filtering algorithm, taking the average value of each frame of vertical vision as the whole vertical vision, extracting the SMPL parameter of each frame of effective video frame by using HRNet two-dimensional posture estimation algorithm to obtain each frame of two-dimensional human posture point J i est Then using the improved SMPLify method to perform time sequence optimization;
step 1.5, forming the data of steps 1.1 to 1.5 into a human body reconstruction training data set.
Further, the step of performing timing optimization using the modified SMPLify method includes:
and (3) taking the minimized error sum on the video sequence as an optimization target, newly adding two time sequence smoothing items on the basis of the SMPLify error function to obtain a target error function, and performing time sequence optimization by adopting an LBFGS optimization algorithm.
Further, the expression of the target error function is
Wherein N represents the number of active video frames, beta i And theta i Respectively represent the SMPL form parameter and the gesture parameter corresponding to the ith frame, K i For the camera projection matrix corresponding to the frame, lambda s-2d And lambda (lambda) s-3d Loss weights corresponding to the two-dimensional time sequence smoothing term and the three-dimensional time sequence smoothing term respectively,for the three-dimensional human skeleton key points corresponding to the ith frame under the given parameters, the parts are +.>Representation using K i Two-dimensional human skeleton key points obtained after projection, J est Is a frame-by-frame two-dimensional human body key point obtained by using an HRNet two-dimensional human body posture estimation method.
In the implementation, at the data processing stage, a YOLOv3 target detector is used for extracting a target detection frame in a video frame, and effective frames are screened according to the type and confidence probability of target detection, so that the video frame is ensured to only contain single action. The object detection process of fig. 2 illustrates this step.
In order to extract the human foreground more accurately and realize more real human reconstruction, a PP-Matting human image Matting algorithm is used. The method aims at estimating opacity information (opacity) O of a foreground of a target pixel by pixel from an image, wherein each pixel value corresponds to a floating point number between 0 and 1, and the mixing degree of the foreground and the background of each pixel is represented by the floating point number, so that the natural transition of the foreground and the background is realized.
The invention uses a SPEC method to improve the estimation accuracy of SMPL human parameters, the SPEC method is based on a perspective projection camera model, firstly, the camera attitude estimator is used for estimating approximate camera rotation angle (rolling angle and pitch angle) and camera focal length, and then the SMPL parameters are estimated based on the camera parameters. Because SPEC estimates for single frame pictures and does not consider time continuity between video frames, the invention firstly uses OneEuro filtering algorithm to smooth parameters of a camera, and according to the change rate of signals, the OneEuro algorithm can adaptively adjust parameters of a filter so as to achieve the effects of smoothing signals and suppressing noise. In this way, the algorithm can effectively reduce jitter and noise of the signal while maintaining the signal response speed. For the SMPL human body parameters, the invention uses an improved SMPLify method to optimize, adjusts the optimization target to minimize the error sum on the video sequence based on the SMPLify method, and newly adds two time sequence smoothing items based on the SMPLify error function, wherein the specific error function is defined as follows:
wherein N represents the number of effective video frames, beta and theta represent the SMPL human body morphological parameters and posture parameters, K is a perspective projection matrix constructed by the smoothed camera parameters and is used for connecting three-dimensional human body key points J 3d Conversion to the observation space J 2d ,J est Is a frame-by-frame two-dimensional human body key point obtained by using an HRNet two-dimensional human body posture estimation method. The key idea of SMPLify is to optimize the reprojection errors of three-dimensional human body key points and two-dimensional joint points, add regular terms of gestures and forms, and improve SMPL human body parametersRobustness of the number estimation. The added two-dimensional and three-dimensional time sequence smoothing items reduce the gesture shake of the characters among frames by optimizing the absolute error of the time sequence among frames, so that the gestures of the characters are smoother and more natural.
The finally generated human body reconstruction training dataset comprises the following contents: a frame-by-frame image and a corresponding human foreground opacity map, camera intrinsic parameters, extrinsic parameters, human body gestures, human body skeleton, and template human body skeleton.
Step 2, initializing a monocular video dynamic human body three-dimensional reconstruction model, obtaining a synthetic image through a volume rendering algorithm, taking the error between the synthetic image and an actual image as a main loss function, taking the error between volume rendering weight and the foreground of a person as an auxiliary loss function, and obtaining a target model by using a mixed precision training mode and a human body reconstruction training data set training model;
on the basis of the embodiment, the monocular video dynamic human body three-dimensional reconstruction model comprises a coordinate mapping module and a template nerve radiation field module.
Further, the step 2 specifically includes:
step 2.1, the coordinate mapping module maps the coordinate point x of the observation space o Mapping to template space x c
Step 2.2, arbitrary coordinate point x in the template space c Input template neural radiation field module, output x c The corresponding color c and volume density sigma, a synthetic image is obtained through a ray stepping algorithm;
step 2.3, calculating the volume rendering value and opacity of the composite image.
And 2.4, using a rendering error and an opacity loss term between the synthetic image and the actual image as a target loss function, and using a mixed precision training mode and a human body reconstruction training data set training model to obtain a target model.
Further, the step 2.1 specifically includes:
acquiring an initial coordinate mapping point x by using an inverse linear skin method c′
Mapping points x using a coordinate correction network c′ Fine tuning is carried out to obtain a final mapping point x c Wherein the coordinate correction network is a multi-layer perceptron comprising multi-resolution hash codes, and the input information of the coordinate correction network comprises coordinate mapping points x c′ Out of the characteristic information f corresponding to the frame i
Further, the step 2.2 specifically includes:
for any sampling point in a ray stepping algorithm, acquiring a feature code corresponding to the sampling point by using multi-resolution hash codes, inputting the feature code into a network to acquire a color and density value corresponding to the sampling point, setting the ray stepping algorithm to start from a camera for the color value of any pixel in a picture, emitting a ray r passing through the pixel, sampling the ray r, and calculating a volume rendering value by using a volume rendering formula according to the color and density at the sampling point, wherein the expression of the volume rendering formula is as follows
Wherein the method comprises the steps ofThe color value of the pixel point where the ray r passes through is M, the number of sampling points along the ray is w i Representing the weight, delta, at the sample point i =t i+1 -t i Representing the distance between two adjacent vector sampling points, T i Representing the position from the starting position t n To t i Transmittance at the position, and the calculation formula is as follows
Obtaining opacity of a composite image according to a volume rendering formulaIs calculated according to the formula:
further, the expression of the target loss function is
Wherein lambda is opacity For the opacity loss weight to be high,representing an opacity loss term,/->Representing the rendering error between the composite image and the actual image,
and C i Respectively representing an ith frame synthesized image and an actual image, LPIPS representing the similarity of perceived image blocks, MSE being mean square error, lambda LPIPS And lambda (lambda) MSE Representing the loss weights of LPIPS and MSE, respectively.
In particular, fig. 3 shows the main structure of the human body reconstruction model used in the present method, which includes two main modules, a position mapping module and a template neural radiation field.
The position mapping module comprises two substeps, namely firstly, a coordinate point x in an observation space is obtained through an inverse linear skin transformation (inverteLinearBlendSkinning) method o Conversion to templatesSpace x' c Then use the position fine tuning network pair x' c Fine tuning is carried out to obtain a final template space mapping point x c
The inverse linear skin transformation method is based on classical linear skin transformation thought, and relates coordinate transformation of a three-dimensional human model to bone animation, and for any point x in an observation space o The point carries out linear interpolation according to the transformation T of the skeleton to obtain a coordinate point x 'which is deformed to the template space' c . The calculation formula is as follows:
wherein x' and x represent coordinate points before and after transformation, respectively, w i (x) Representing the linear weighting weight, T, of the corresponding ith bone point at coordinate point x i Representing the homogeneous transformation matrix corresponding to the ith bone point. The inverse linear skin transform differs from the linear skin transform in that the weights w i In the process of (2), as the linear skin only needs to consider the transformation condition of the vertexes of the surface of the three-dimensional model of the human body, the skin weight corresponding to each vertex can be directly stored, and the skin weight corresponding to any point in the space needs to be inquired in the inverse linear skin transformation. For this reason, the method uses an explicit volume representation method to store the skin weights in a volume with a fixed size, and obtains the skin weights corresponding to any point in space through a tri-linear interpolation mode for the skin weights of any point in space. In order to optimize the skin weights simultaneously during the training process, the skin weight volumes are generated by a Deconvolution (Deconvolution) network, starting from a randomly initialized fixed dimensional hidden variable z, the final voxel grid is obtained by multi-layer Deconvolution.
The position fine-tuning network is a multi-layer perceptron network containing position codes and is input into the current human body posture p and the mapped coordinate point x' c The network target is to learn deformation conditions of the human body under different postures, such as wrinkles of clothes, changes of hairstyles and the like, so that more accurate human body reconstruction is achieved.
The template nerve radiation field is a multi-layer perceptron network containing position codes and is input as a coordinate point x in the template space c The color value c corresponding to the point and the volume density σ are output. Compared with the original version of the nerve radiation field, the method provided by the invention ignores visual effect differences (such as highlight effect) under different observation angles, namely ignores the direction input d in the original nerve radiation field because the monocular video only comprises one observation view angle. And then synthesizing the image according to the color value and the density value by using a volume rendering formula, wherein the volume rendering formula is as follows:
wherein the calculation formula of T (T) is as follows:
where r denotes a ray emitted from the camera to a pixel in the image, o is the camera position, d is the ray direction, and r (t) =o+td denotes a point on the ray. t is t n And t f The starting position and the ending position respectively representing the volume rendering can be obtained by intersecting a ray with an axial bounding box (Axis-alignedb ingbox) of the object to be sought. Volume rendering assumes that there are countless tiny colored particles in space that will stop and return to the color of the particle when the camera ray collides with the particle. From a probabilistic point of view, bulk density σ (x) describes the probability that a ray stops due to particle collisions within a small differential distance at point x, and T (T) describes the ray from T n The probability of any collision, i.e. the Transmittance (transmissibility), does not occur until t. The relationship between transmittance and bulk density can be simply deduced by differential equations as follows:
T(t+dt)=T(t)·(1-σ(t)dt)
T′(t)=-T(t)σ(t)dt
it can be seen that for a color value corresponding to a pixel in an image, a nested integral needs to be solved. This integral can be approximated using a numerical integration method, calculated as follows:
for T i The same is true for the calculation of (a):
wherein delta i =t i+1 -t i Represents the distance between two adjacent sampling points, M represents the number of sampling points, sigma i And c i Respectively representing the volume density and the color value corresponding to the ith sampling point.
The whole calculation process is differentiable, and through a volume rendering algorithm, the neural radiation field can be optimized directly by calculating the rendering error between the synthesized image and the real image and then through the back propagation of the gradient. The common rendering error measurement standards comprise average error (MSE), absolute error (absoluteL 1) and the like, and the method also adds image block perception similarity (LPIPS) to measure the rendering errors of the synthesized image and the real image, compared with the MSE and the absL1, the LPIPS considers the difference between the whole images through a convolutional neural network and is more in line with the observation of human eyes, so that model optimization is better guided, and the rendering loss used by a final model is the combination of the LPIPS and the MSE, and the LPIPS is taken as a dominant one, as shown below.
Where N represents the number of active video frames,and C i Respectively representing an i-th frame rendered image and an actual image.
To increase the convergence efficiency of the neural radiation field, the method also adds an opacity loss (opaqueness) as a regularization term. Integrating the light along the light can obtain the light from T according to the definition of T (T) and sigma (r (T)) n Up to t f The desired probability of collision will occur. While collision of light rays during propagation indicates that we can see an object through the camera rays, and that no collision indicates that we will see the background, then this expected probability can be understood as opacity in the composite imageThe numerical integral calculation formula is as follows:
/>
and in the data processing stage, a PP-material method is used for obtaining the opacity map of the foreground of the person, and the model can learn the implicit geometric representation of the person as soon as possible by optimizing the error between the opacity of the real image and the opacity of the synthesized image, so that the convergence efficiency is accelerated. The opacity loss is calculated as follows:
the complete loss function of the final model is defined as follows:
wherein lambda is opacity For opacity loss weight, in trainingThe model is set to a larger value (less than 1) in the initial training stage to accelerate convergence of the volume density, and the weight of the partial loss is reduced along with the training, so that the main loss function of the model is rendering loss.
Aiming at the problem of low convergence rate of the position coding used in the existing mode, the method provided by the invention adjusts the position coding mode to be multi-resolution hash coding, uses a plurality of grids with different resolutions to represent space, uses a hash table with fixed size to store the feature coding under different resolutions due to sparsity of high-frequency feature distribution, and continuously transmits the high-frequency features into the hash table along with gradients in the optimization process, thereby effectively improving the calculation efficiency. Each resolution corresponds to a hash code table, and the coordinate point x is firstly queried in the coding process c And 4 (two-dimensional case) or 8 (three-dimensional case) grid points of the neighbor are used for obtaining the coding result by using a linear interpolation mode. For the multi-layer perceptron part, because tens of millions of network calculations are needed, the method stores network parameters as 16-bit half-precision floating points, uses the 16-bit precision floating points for calculation in the forward calculation process, uses 32-bit full-precision for calculation in the backward gradient propagation process, and ensures the model convergence efficiency. The optimizer used by the model is an Adam optimizer, the learning rate is set to be 1e-4, and the relevant parameters can be adjusted according to the requirements in practical application.
And 3, inputting any frame in the target video as a target frame into a target model, adjusting the posture and the position of a camera according to the three-dimensional posture information of the person in the target frame, and rendering and generating a human body action image under a new visual angle.
On the basis of the above embodiment, the step 3 specifically includes:
and taking any frame in the target video as a target frame to be input into a target model, adjusting the orientation and position of a camera for the target frame, generating rays of picture pixels from the camera, sampling on the rays through a ray stepping algorithm, mapping sampling points to a template space through a coordinate mapping module, inquiring a color value and a volume density value corresponding to the points through a template nerve radiation field module, calculating the color value corresponding to the pixels through a volume rendering formula, and synthesizing a final image.
When the model training is implemented, the orientation and the position of a camera are adjusted for any frame, rays of picture pixels are generated from the camera, the rays are sampled, sampling points are mapped to a template space through a coordinate mapping module, then a color value and a volume density value corresponding to the points are inquired through a template nerve radiation field module, a volume rendering formula is used for calculating the color value corresponding to the pixels, and a final image is synthesized to serve as a human body action image under a new visual angle.
According to the monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization, a more accurate human body reconstruction data set can be obtained through optimizing a human body prospect generation method and an SMPL parameter estimation strategy. And meanwhile, the position coding mode of the human nerve radiation field and the trained loss function are adjusted, so that the model convergence efficiency is improved, and the reconstruction efficiency, the accuracy and the adaptability are further improved.
The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware.
It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof.
The foregoing is merely specific embodiments of the disclosure, but the protection scope of the disclosure is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the disclosure are intended to be covered by the protection scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (10)

1. The monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization is characterized by comprising the following steps of:
step 1, inputting a monocular character action video sequence, sequentially performing target detection, character foreground extraction, camera and SMPL human body parameter estimation and SMPL parameter time sequence optimization, and constructing a human body reconstruction training data set;
step 2, initializing a monocular video dynamic human body three-dimensional reconstruction model, obtaining a synthetic image through a volume rendering algorithm, taking the error between the synthetic image and an actual image as a main loss function, taking the error between volume rendering weight and the foreground of a person as an auxiliary loss function, and obtaining a target model by using a mixed precision training mode and a human body reconstruction training data set training model;
and 3, inputting any frame in the target video as a target frame into a target model, adjusting the posture and the position of a camera according to the three-dimensional posture information of the person in the target frame, and rendering and generating a human body action image under a new visual angle.
2. The method according to claim 1, wherein the step 1 specifically comprises:
step 1.1, extracting a target detection frame in a monocular character action video sequence by using a YOLOv3 target detector, and screening effective video frames according to detection confidence and detection categories;
step 1.2, extracting a character prospect in an effective video frame by using a PP-marking human body image segmentation method;
step 1.3, estimating camera parameters and SMPL human parameters in each frame of the effective video frame by using a SPEC human SMPL parameter estimation method, wherein the camera parameters comprise a vertical field of view, a pitch angle, a roll angle and relative displacement between the camera and the human body, and the SMPL human parameters comprise 10-dimensional morphological parameters and 24 bone posture parameters represented by shaft angles;
step 1.4, smoothing the pitch angle and the roll angle of each frame of effective video frame by using OneEuro filtering algorithm, taking the average value of each frame of vertical vision as the whole vertical vision, extracting the SMPL parameter of each frame of effective video frame by using HRNet two-dimensional posture estimation algorithm to obtain each frame of two-dimensional human posture point J i est Then using the improved SMPLify method to perform time sequence optimization;
step 1.5, forming the data of steps 1.1 to 1.5 into a human body reconstruction training data set.
3. The method of claim 2, wherein the step of performing timing optimization using a modified SMPLify method comprises:
and (3) taking the minimized error sum on the video sequence as an optimization target, newly adding two time sequence smoothing items on the basis of the SMPLify error function to obtain a target error function, and performing time sequence optimization by adopting an LBFGS optimization algorithm.
4. A method according to claim 3, wherein the expression of the objective error function is
Wherein N represents the number of active video frames, beta i And theta i Respectively represent the SMPL form parameter and the gesture parameter corresponding to the ith frame, K i For the camera projection matrix corresponding to the frame, lambda s-2d And lambda (lambda) s-3d Loss weights corresponding to the two-dimensional time sequence smoothing term and the three-dimensional time sequence smoothing term respectively,for the three-dimensional human skeleton key points corresponding to the ith frame under the given parameters, the parts are +.>Representation using K i Two-dimensional human skeleton key points obtained after projection, J est Is a frame-by-frame two-dimensional human body key point obtained by using an HRNet two-dimensional human body posture estimation method.
5. The method of claim 4, wherein the monocular video dynamic human three-dimensional reconstruction model includes a coordinate mapping module and a template neuro-radiation field module.
6. The method according to claim 5, wherein the step 2 specifically comprises:
step 2.1, the coordinate mapping module observes the empty spaceCoordinate point x between o Mapping to template space x c
Step 2.2, arbitrary coordinate point x in the template space c Input template neural radiation field module, output x c The corresponding color c and volume density sigma, a synthetic image is obtained through a ray stepping algorithm;
step 2.3, calculating the volume rendering value and the opacity of the composite image;
and 2.4, using a rendering error and an opacity loss term between the synthetic image and the actual image as a target loss function, and using a mixed precision training mode and a human body reconstruction training data set training model to obtain a target model.
7. The method according to claim 6, wherein the step 2.1 specifically comprises:
acquiring an initial coordinate mapping point x by using an inverse linear skin method c′
Mapping points x using a coordinate correction network c′ Fine tuning is carried out to obtain a final mapping point x c Wherein the coordinate correction network is a multi-layer perceptron comprising multi-resolution hash codes, and the input information of the coordinate correction network comprises coordinate mapping points x c′ Out of the characteristic information f corresponding to the frame i
8. The method according to claim 7, wherein the step 2.2 specifically comprises:
for any sampling point in a ray stepping algorithm, acquiring a feature code corresponding to the sampling point by using multi-resolution hash codes, inputting the feature code into a network to acquire a color and density value corresponding to the sampling point, setting the ray stepping algorithm to start from a camera for the color value of any pixel in a picture, emitting a ray r passing through the pixel, sampling the ray r, and calculating a volume rendering value by using a volume rendering formula according to the color and density at the sampling point, wherein the expression of the volume rendering formula is as follows
Wherein the method comprises the steps ofThe color value of the pixel point where the ray r passes through is M, the number of sampling points along the ray is w i Representing the weight, delta, at the sample point i =t i+1 -t i Representing the distance between two adjacent vector sampling points, T i Representing the position from the starting position t n To t i Transmittance at the position, and the calculation formula is as follows
Obtaining opacity of a composite image according to a volume rendering formulaIs calculated according to the formula:
9. the method of claim 8, wherein the expression of the objective loss function is
L=L renderopacity L opacity
Wherein lambda is opacity For opacity loss weight, L opacity Representing an opacity loss term, L render Representing the rendering error between the composite image and the actual image,
and C i Respectively representing an ith frame synthesized image and an actual image, LPIPS representing the similarity of perceived image blocks, MSE being mean square error, lambda LPIPS And lambda (lambda) MSE Representing the loss weights of LPIPS and MSE, respectively.
10. The method according to claim 9, wherein the step 3 specifically comprises:
and taking any frame in the target video as a target frame to be input into a target model, adjusting the orientation and position of a camera for the target frame, generating rays of picture pixels from the camera, sampling on the rays through a ray stepping algorithm, mapping sampling points to a template space through a coordinate mapping module, inquiring a color value and a volume density value corresponding to the points through a template nerve radiation field module, calculating the color value corresponding to the pixels through a volume rendering formula, and synthesizing a final image.
CN202310831742.8A 2023-07-07 2023-07-07 Monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization Pending CN116681838A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310831742.8A CN116681838A (en) 2023-07-07 2023-07-07 Monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310831742.8A CN116681838A (en) 2023-07-07 2023-07-07 Monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization

Publications (1)

Publication Number Publication Date
CN116681838A true CN116681838A (en) 2023-09-01

Family

ID=87785639

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310831742.8A Pending CN116681838A (en) 2023-07-07 2023-07-07 Monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization

Country Status (1)

Country Link
CN (1) CN116681838A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117173343A (en) * 2023-11-03 2023-12-05 北京渲光科技有限公司 Relighting method and relighting system based on nerve radiation field
CN117496072A (en) * 2023-12-27 2024-02-02 南京理工大学 Three-dimensional digital person generation and interaction method and system
CN117994708A (en) * 2024-04-03 2024-05-07 哈尔滨工业大学(威海) Human body video generation method based on time sequence consistent hidden space guiding diffusion model

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117173343A (en) * 2023-11-03 2023-12-05 北京渲光科技有限公司 Relighting method and relighting system based on nerve radiation field
CN117173343B (en) * 2023-11-03 2024-02-23 北京渲光科技有限公司 Relighting method and relighting system based on nerve radiation field
CN117496072A (en) * 2023-12-27 2024-02-02 南京理工大学 Three-dimensional digital person generation and interaction method and system
CN117496072B (en) * 2023-12-27 2024-03-08 南京理工大学 Three-dimensional digital person generation and interaction method and system
CN117994708A (en) * 2024-04-03 2024-05-07 哈尔滨工业大学(威海) Human body video generation method based on time sequence consistent hidden space guiding diffusion model
CN117994708B (en) * 2024-04-03 2024-05-31 哈尔滨工业大学(威海) Human body video generation method based on time sequence consistent hidden space guiding diffusion model

Similar Documents

Publication Publication Date Title
Yuan et al. Star: Self-supervised tracking and reconstruction of rigid objects in motion with neural rendering
CN112465955B (en) Dynamic human body three-dimensional reconstruction and visual angle synthesis method
CN106803267B (en) Kinect-based indoor scene three-dimensional reconstruction method
CN116681838A (en) Monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization
CN109544677A (en) Indoor scene main structure method for reconstructing and system based on depth image key frame
CN103559737A (en) Object panorama modeling method
Weng et al. Vid2actor: Free-viewpoint animatable person synthesis from video in the wild
CN103530907B (en) Complicated three-dimensional model drawing method based on images
CN113096234A (en) Method and device for generating three-dimensional grid model by using multiple color pictures
CN114863038B (en) Real-time dynamic free visual angle synthesis method and device based on explicit geometric deformation
CN116664782B (en) Neural radiation field three-dimensional reconstruction method based on fusion voxels
CN115170741A (en) Rapid radiation field reconstruction method under sparse visual angle input
CN116416376A (en) Three-dimensional hair reconstruction method, system, electronic equipment and storage medium
CN115428027A (en) Neural opaque point cloud
CN114996814A (en) Furniture design system based on deep learning and three-dimensional reconstruction
Kolos et al. TRANSPR: Transparency ray-accumulating neural 3D scene point renderer
CN116452752A (en) Intestinal wall reconstruction method combining monocular dense SLAM and residual error network
CN116134491A (en) Multi-view neuro-human prediction using implicit differentiable renderers for facial expression, body posture morphology, and clothing performance capture
Zhang et al. Adaptive joint optimization for 3d reconstruction with differentiable rendering
Ehret et al. Regularization of NeRFs using differential geometry
CN115375839A (en) Multi-view hair modeling method and system based on deep learning
CN116883524A (en) Image generation model training, image generation method and device and computer equipment
CN111932670A (en) Three-dimensional human body self-portrait reconstruction method and system based on single RGBD camera
Raj et al. DRaCoN--Differentiable Rasterization Conditioned Neural Radiance Fields for Articulated Avatars
CN113034671B (en) Traffic sign three-dimensional reconstruction method based on binocular vision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination