CN111932678A

CN111932678A - Multi-view real-time human motion, gesture, expression and texture reconstruction system

Info

Publication number: CN111932678A
Application number: CN202010810382.XA
Authority: CN
Inventors: 张宇翔; 安亮; 戴翘楚; 于涛
Original assignee: Beijing Weilan Technology Co ltd
Current assignee: Beijing Weilan Technology Co ltd
Priority date: 2020-08-13
Filing date: 2020-08-13
Publication date: 2020-11-13
Anticipated expiration: 2040-08-13
Also published as: CN111932678B

Abstract

The invention provides a multi-viewpoint real-time human body motion, gesture, expression and texture reconstruction system, which comprises: enclosing a plurality of camera frames to form a capture area, and calibrating camera internal parameters and camera external parameters of the plurality of cameras by a camera calibration method; acquiring human body images through the calibrated cameras, processing the human body images to enable the human body images to be transcoded into RGB images, and then finishing single-purpose human body posture estimation; obtaining a heat point diagram and joint affinity of each joint of the human body through monocular human body posture estimation, and performing non-maximum suppression on the joint heat point diagram to obtain each joint coordinate; thereby obtaining the three-dimensional joint coordinates of the human body and further obtaining the three-dimensional reconstruction model of the human body. The system completes estimation of human body postures by means of deep learning, and can fit and render human body models of multiple persons in real time in a test environment.

Description

Multi-view real-time human motion, gesture, expression and texture reconstruction system

Technical Field

The invention relates to the technical field of computer vision, in particular to a multi-viewpoint real-time human body motion, gesture, expression and texture reconstruction system.

Background

With the improvement of the computing power of the computer and the continuous iteration of the graphic card, the deep learning technology is rapidly developed, and the field of computer vision is greatly promoted. The current reconstruction technology is mainly divided into two types, one is to use a common RGB camera to obtain depth information through multi-view feature point matching and triangulation, the other is to directly use a depth camera to obtain a depth map for reconstruction, for example, a new iPhone X issued by apple Inc. carries a depth camera to complete face reconstruction, and the technology is pushed to the consumption field.

However, compared with the RGB camera, the depth camera has the disadvantages of large interference by ambient light, limited depth detection distance, high price, and the like, so the RGB camera with high popularity rate has a greater potential for human body reconstruction, and can be mainly applied to the fields of virtual fitting, CG games, and the like. Most of the traditional methods for human body reconstruction adopt methods of wearing sensors or green curtain segmentation and the like, and have high requirements on environment.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a multi-view real-time human body motion, gesture, expression and texture reconstruction system.

The method solves the problem that the traditional method mostly adopts a method of wearing a sensor or green curtain segmentation and the like for human body reconstruction, and has high requirement on the environment.

The invention is realized by the following technical scheme:

the invention provides a multi-viewpoint real-time human body motion, gesture, expression and texture reconstruction system, which comprises the following steps:

s1, enclosing a plurality of camera frames to form a capture area, and calibrating camera internal parameters and camera external parameters of the cameras by a camera calibration method;

s2, collecting human body images through the calibrated cameras, processing the human body images to enable the human body images to be transcoded into RGB images, and then finishing single-purpose human body posture estimation;

s3, obtaining a hotspot graph and joint affinity of each joint of the human body through monocular human body posture estimation, and performing non-maximum suppression on the joint hotspot graph to obtain each joint coordinate;

s4, constraining the monocular detection result by using polar constraint of multi-view information to obtain two-dimensional joint coordinates of the human body under each view angle, and constructing a sparse 4D image by using polar geometric constraint, joint affinity, thermodynamic diagram and time domain constraint of the previous frame of three-dimensional result;

s5, segmenting the sparse 4D graph by means of a greedy algorithm to obtain 2D joint coordinates of each person at each view angle, and triangularizing the coordinates of the matched joints at each view angle by means of camera parameters to obtain three-dimensional joint coordinates, so that a three-dimensional skeleton of each person is obtained;

s6, re-projecting the hand joints of the three-dimensional skeleton onto each camera picture, obtaining a rectangular area corresponding to the hand by combining scale information, estimating hand parameters by using a hand detector to obtain PCA (principal component analysis) coefficients and affine relations of the hand posture, and then denoising the hand detection results at different viewing angles;

s7, projecting the nose joints of the three-dimensional skeleton to each view angle, obtaining a rectangular region of the face at each view angle by combining the scale information and the face orientation, estimating face parameters by using a face detector, and then denoising face detection results at different view angles;

s8, fitting a three-dimensional human body model through the obtained three-dimensional joint coordinates and the human hand and human face parameters, then re-projecting the optimized three-dimensional human body model to be aligned with the original image under each visual angle, and projecting color information in the picture back to the parameter model so as to finish texture mapping;

and S9, inputting the texture map into a neural network, predicting the normal offset of each vertex, and completing the reconstruction of the surface details.

Preferably, the calibrating the camera internal parameters and the camera external parameters of the plurality of cameras includes:

and calibrating the internal parameters and the external parameters of the camera by a checkerboard calibration method.

Preferably, the human body images are collected through the calibrated cameras, transmitted to a collection card of a host through a PCIe interface, transcoded through a cuda program and scaled into a three-channel RGB matrix form to obtain the RGB images.

Preferably, the RGB image is corrected by gamma correction, thereby improving the quality of the RGB image.

The invention has the beneficial effects that: the estimation of the human body posture is completed through deep learning, and the human body models of multiple persons can be fitted and rendered in real time.

Drawings

Fig. 1 is a flowchart of a multi-view real-time human motion, gesture, expression, and texture reconstruction system according to an embodiment of the present invention.

Detailed Description

The following detailed description of specific embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

Firstly, in order to facilitate understanding of the multi-view real-time human body motion, gesture, expression and texture reconstruction system provided by the embodiment of the application, an application scene of the system is explained, and the multi-view real-time human body motion, gesture, expression and texture reconstruction system provided by the embodiment of the application is used for providing a system capable of reconstructing a human body three-dimensional model; most of the traditional methods for human body reconstruction adopt methods of wearing sensors or green curtain segmentation and the like, and have high requirements on environment. The multi-view real-time human body movement, gesture, expression and texture reconstruction system provided by the embodiment of the application is described below with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a flowchart of a multi-view real-time human body movement, gesture, expression and texture reconstruction system according to an embodiment of the present invention. As can be seen from fig. 1, the multi-view real-time human body motion, gesture, expression and texture reconstruction system includes the following steps:

s1, enclosing a plurality of camera frames to form a capture area, and calibrating the camera internal parameters and the camera external parameters of the plurality of cameras by a camera calibration method. When a plurality of camera frames are used for enclosing a capture area, four industrial cameras can be erected on a test site, the height from the ground is about 1.2m, the distance between the cameras is about 3-5 m, and the capture area is enclosed in a rectangular shape. In practical application, the number of cameras, the capturing area and the setting parameters can be set according to requirements.

Further, the camera parameters are calibrated, and as a way, a chessboard calibration method can be adopted to calibrate the camera.

Calibrating camera parameters, wherein the parameters to be calibrated comprise camera internal parameters and camera external parameters, and calibrating the camera internal parameters by using a checkerboard. For each camera, about 20 pieces of checkerboards in different handheld postures need to be photographed, and then a matlab calibration tool box is called to calibrate the internal reference, wherein calibration parameters comprise the focal length, distortion parameters and the like of the camera. And after calibrating the internal reference, continuously calibrating the external reference of the camera, similarly calibrating by using a checkerboard and a matlab toolbox, and if the precision requirement is higher, paving a texture-rich material on the center of the scene and using a photoscan to perform auxiliary calibration.

S2, collecting human body images through the calibrated cameras, processing the human body images to enable the human body images to be transcoded into RGB images, and then finishing single-purpose human body posture estimation. Further, in the invention of the present application, the human body images are collected by the plurality of calibrated cameras, transmitted to the collection card of the host through the PCIe interface, transcoded by the cuda program, and scaled into a three-channel RGB matrix form, so as to obtain RGB images.

It is understood that the quality of the image can be improved by gamma correction. Specifically, the plurality of cameras respectively use the acquired three-channel images to complete monocular human body posture estimation, and open-source work openposition, alphaposition, position-pro-position network and the like can be used.

And S3, obtaining a heat point diagram and joint affinity of each joint of the human body through monocular posture estimation of the human body, and carrying out non-maximum suppression on the joint heat point diagram to obtain coordinates of each joint.

S4, using polar constraint of multi-view information to constrain the single-view detection result to obtain two-dimensional joint coordinates of the human body under each view angle, and constructing a sparse 4D image through polar geometric constraint, joint affinity, thermodynamic diagram and time domain constraint of the previous frame of three-dimensional result.

S5, segmenting the sparse 4D graph by means of a greedy algorithm to obtain 2D joint coordinates of each person at each view angle, and triangularizing the coordinates of the matched joints at each view angle by means of camera parameters to obtain three-dimensional joint coordinates and obtain the three-dimensional skeleton of each person.

S6, re-projecting the hand joints of the three-dimensional skeleton onto each camera picture, obtaining a rectangular area corresponding to the hand by combining scale information, estimating hand parameters by using a hand detector, obtaining PCA (principal component analysis) coefficients and affine relations of the hand posture, and then denoising the hand detection results at different viewing angles. Further, the distances between detection results of different visual angles are constructed through the F norm between the multi-visual angle geometric information and the PCA, so that a Laplace matrix is calculated, and the maximum characteristic value obtained through SVD is the human hand parameter obtained after denoising.

S7, projecting the nose joints of the three-dimensional skeleton to each view angle, obtaining a rectangular area of the face at each view angle by combining the scale information and the face orientation, estimating face parameters by using a face detector, and then denoising the face detection results at different view angles. Further, the distances between detection results of different visual angles are constructed through the F norm between the multi-visual angle geometric information and the PCA, so that a Laplace matrix is calculated, and the maximum characteristic value is obtained through SVD, namely the face parameter obtained after denoising.

And S8, fitting a three-dimensional human body model through the obtained three-dimensional joint coordinates and the human hand and human face parameters, then re-projecting the optimized three-dimensional human body model to be aligned with the original image under each visual angle, and projecting the color information in the image back to the parameter model so as to finish texture mapping.

Further, the posture parameters and the shape parameters of the human body model SMPL are optimized by utilizing the three-dimensional joint coordinates and the parameter information of the human hand and the human face, and the result tends to be stable and jitter is eliminated through time domain optimization.

Specifically, a human body model is fitted by using human body three-dimensional joint coordinate information, wherein an open source model of the human body model, namely a preset human body model is an open source linear model SMPL, joint coordinates defined by the SMPL are not identical to definitions given by an attitude detector (openposition), coordinate conversion needs to be completed through a regression matrix, and the regression matrix needs to be obtained by utilizing a data set for training in advance.

Furthermore, parameter estimation can be performed by using the obtained three-dimensional joint coordinate information as a constraint, but since the joint coordinates cannot constrain the rotation of the limb, some distorted postures may occur, and therefore a gaussian mixture model is required to be added for constraint. The Gaussian mixture model represents reasonable posture distribution of the human body and needs to be trained in advance.

Further, the shape parameters cannot be well estimated only by the joint coordinates, so that further human body contour information needs to be added for optimization, and the shape parameters tend to be stable through time domain optimization.

An embodiment of a multi-view real-time human motion, gesture, expression, and texture reconstruction system is described as follows:

step 1: and (5) building a platform. Erecting cameras according to the interval of 3-5 m to enclose a rectangular area, enabling the height of the cameras to be about 1.2m from the ground, and completing calibration of internal reference and external reference of the cameras by using a matlab tool box.

And 2, processing data. And transcoding digital information acquired by the industrial camera into RGB (red, green and blue) images, and detecting by using a pre-trained convolutional neural network to obtain human joint information under each visual angle. And finding out matched joints through multi-purpose constraint conditions to carry out triangulation, thereby obtaining three-dimensional joint coordinates. And (4) re-projecting the three-dimensional skeleton information to obtain a human face and a human hand area, and estimating by using a neural network to obtain corresponding parameters.

And step 3, posture reconstruction. And estimating and optimizing the posture parameters and the shape parameters of the human body model SMPL by utilizing the three-dimensional joint coordinate information and the human face and hand parameters, stabilizing the result through time domain optimization, and eliminating jitter.

And 4, reconstructing texture. And re-projecting the reconstructed three-dimensional model back to the original image, performing texture mapping, and further predicting surface vertex deviation by using the texture mapping through a neural network to complete surface reconstruction.

All the components in the invention of the present application are the components commonly used in the prior art.

In the above embodiments, the multi-view real-time human motion, gesture, expression and texture reconstruction system provided by the embodiment of the present application can complete estimation of human body posture through deep learning, and can perform fitting and rendering on human body models of multiple people in real time.

The above-described embodiments are merely illustrative of the preferred embodiments of the present invention and do not limit the spirit and scope of the present invention. Various modifications and improvements of the technical solutions of the present invention may be made by those skilled in the art without departing from the design concept of the present invention, and the technical contents of the present invention are all described in the claims.

Claims

1. A multi-viewpoint real-time human body motion, gesture, expression and texture reconstruction system is characterized by comprising the following steps:

2. The system of claim 1, wherein the calibrating the camera internal parameters and the camera external parameters of the plurality of cameras comprises:

3. The system of claim 1, wherein the human body images are collected by the calibrated cameras, transmitted to a collection card of a host through a PCIe interface, transcoded by a cuda program and scaled into a three-channel RGB matrix form to obtain the RGB images.

4. The system of claim 3, wherein the RGB image is corrected by gamma correction to improve the quality of the RGB image.