CN113658303A

CN113658303A - Monocular vision-based virtual human generation method and device

Info

Publication number: CN113658303A
Application number: CN202110726704.7A
Authority: CN
Inventors: 徐枫; 周玉枭
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-11-16

Abstract

The application provides a method for generating a virtual human based on monocular vision, which comprises the following steps: acquiring a monocular color image; extracting image characteristics of the monocular color image, estimating postures of the trunk and limbs of the human body, and completing reconstruction of the trunk and the limbs of the virtual human model; positioning hand position information in the monocular color image, intercepting a hand image area in the monocular color image, extracting image features of the hand image area, and estimating hand rotation parameters; positioning face position information in the monocular color image, intercepting a face image area in the monocular color image, extracting image features of the face image area, estimating shape parameters and expression coefficients of a face, and generating a face model; and completing reconstruction of the hand by using the hand rotation parameters, replacing the face region of the three-dimensional human body model by using the human face model, and completing reconstruction of the face to obtain the virtual human body model. Therefore, the application provides a method for generating a virtual human body model in real time through a monocular color image.

Description

Monocular vision-based virtual human generation method and device

Technical Field

The present application relates to the field of computer vision and computer graphics technologies, and in particular, to a method and an apparatus for generating a virtual human based on monocular vision, and a storage medium.

Background

The generation of virtual human has very wide application in the fields of virtual reality and mixed reality. The generation of the virtual human requires that the reconstructed result is consistent with the real human and needs to contain many details, so that the target object needs to be completely motion-captured, namely, the large-scale movement of the body trunk and limbs, the gesture movement of the two hands and the expression change of the face are simultaneously reconstructed.

However, in the related art, when the three-dimensional virtual human is generated only by the monocular color image, a large amount of input information is lacking, and thus the generation of the three-dimensional virtual human is difficult.

Disclosure of Invention

The present disclosure provides a method, an apparatus and a storage medium for generating a virtual human based on monocular vision, so as to provide a method for generating a virtual human by a monocular color image.

An embodiment of a first aspect of the present application provides a method for generating a virtual human based on monocular vision, including:

acquiring a monocular color image;

extracting image features of the monocular color image, and estimating postures of the trunk and limbs of the human body based on the image features of the monocular color image so as to complete reconstruction of the trunk and the limbs of the virtual human model;

respectively positioning hand position information in the monocular color image by utilizing the estimation results of the postures of the trunk and the limbs of the human body, intercepting a hand image area in the monocular color image according to the hand position information, extracting the image characteristics of the hand image area, and estimating hand rotation parameters based on the image characteristics of the hand image area;

positioning face position information in the monocular color image by using the estimation result of the postures of the trunk and the limbs of the human body, intercepting a face image area in the monocular color image according to the face position information, extracting image features of the face image area, estimating shape parameters and expression coefficients of the face based on the image features of the face image area, and generating a face model according to the shape parameters and the expression coefficients of the face;

and respectively applying the hand rotation parameters to a three-dimensional human body model obtained after the trunk and the limbs of the virtual human body model are reconstructed, completing the reconstruction of the hand, replacing the face area of the three-dimensional human body model with the human face model, and completing the reconstruction of the face to obtain the virtual human body model.

The embodiment of the second aspect of the present application provides a virtual human generating device based on monocular vision, including:

the acquisition module is used for acquiring a monocular color image;

the first estimation module is used for extracting the image characteristics of the monocular color image and estimating the postures of the trunk and the limbs of the human body based on the image characteristics of the monocular color image so as to complete the reconstruction of the trunk and the limbs of the virtual human model;

the second estimation module is used for respectively positioning hand position information in the monocular color image by utilizing the estimation results of the postures of the trunk and the limbs of the human body, intercepting a hand image area in the monocular color image according to the hand position information, extracting the image characteristics of the hand image area, and estimating hand rotation parameters based on the image characteristics of the hand image area;

the third estimation module is used for positioning the face position information in the monocular color image by utilizing the estimation result of the postures of the trunk and the limbs of the human body, intercepting a face image area in the monocular color image according to the face position information, extracting the image characteristics of the face image area, estimating the shape parameters and the expression coefficients of the face based on the image characteristics of the face image area, and generating a face model according to the shape parameters and the expression coefficients of the face;

a reconstruction module for applying the hand rotation parameters to a three-dimensional human body model obtained by reconstructing the trunk and the limbs of the virtual human body model respectively to complete the reconstruction of the hand, and replacing the face region of the three-dimensional human body model with the human face model to complete the reconstruction of the face so as to obtain the virtual human body model

A non-transitory computer-readable storage medium as set forth in an embodiment of the third aspect of the present application, wherein the non-transitory computer-readable storage medium stores a computer program; which when executed by a processor implements the method as shown in the first aspect above.

The technical scheme provided by the embodiment of the application at least has the following beneficial effects:

in the monocular vision-based virtual human generation method, the monocular vision-based virtual human generation device and the storage medium, the postures of the trunk and the limbs of a human body are estimated by extracting the image features of a monocular color image, the reconstruction of the trunk and the limbs of the virtual human model is completed, then the hand position information and the face position information of the hands and the faces in the monocular color image are positioned by utilizing the estimation result of the postures of the trunk and the limbs of the human body, the hand image area and the face image features in the monocular color image are intercepted, the image features of the hand image area and the face image area are extracted at the same time, and the hand rotation parameters, the shape parameters and the expression coefficients of the faces are estimated, so that the reconstruction of the hands and the faces is completed, and the virtual human model is obtained. Therefore, the method for generating the virtual human body model in real time through the monocular color image is provided, and meanwhile the virtual human body model can be rendered into the display equipment for a user to watch, so that a better interaction effect is achieved.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart of a method for generating a virtual human based on monocular vision according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a virtual human generating device based on monocular vision according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

The following describes a method and an apparatus for generating a virtual human in an embodiment of the present application with reference to the drawings.

Example one

Fig. 1 is a schematic flowchart of a method for generating a virtual human based on monocular vision according to an embodiment of the present application, and as shown in fig. 1, the method may include:

step 101, obtaining a monocular color image.

And 102, extracting image characteristics of the monocular color image, and estimating postures of the trunk and limbs of the human body based on the image characteristics of the monocular color image so as to complete reconstruction of the trunk and the limbs of the virtual human model.

The method comprises the steps of processing an acquired monocular color image by using a depth residual error network to obtain image characteristics of the monocular color image in a high-dimensional space, estimating three-dimensional coordinates of key points of a human body by using a convolution neural network based on the image characteristics, regressing rotation parameters of joints of the human body and body type parameters of the human body by using a full-connection neural network according to the obtained three-dimensional coordinates of the key points of the human body, and applying the rotation parameters of the joints of the human body and the body type parameters of the human body to a predefined parameterized three-dimensional human body model to obtain postures of the trunk and limbs of the human body so as to complete reconstruction of the trunk and the limbs of the virtual human body model.

And, the deep residual error network, the convolutional neural network, and the fully-connected network in the present application are trained using the disclosed human motion capture data. The disclosed human body motion capture data are acquired according to the motion of a real human body, and the priori knowledge of the human body motion is hidden, so that the depth residual error network, the convolutional neural network and the full-connection network can learn the priori knowledge of the hidden human body motion in the training process, the motion of the human body can be pre-judged, and the obtained virtual character is more real.

And 103, respectively positioning hand position information in the monocular color image by using the estimation results of the postures of the trunk and the limbs of the human body, intercepting a hand image area in the monocular color image according to the hand position information, extracting the image characteristics of the hand image area, and estimating hand rotation parameters based on the image characteristics of the hand image area.

In the method, the positions of the arms of the human body can be obtained by using the estimation results of the postures of the trunk and the limbs of the human body in the step 102, so that the positions of the two hands in the image can be deduced. For example, according to the estimated postures of the trunk and the limbs of the human body, the positions of the corresponding left elbow and the left wrist are obtained, and therefore the position information of the left hand is deduced.

And extracting features of the intercepted hand image region by using a convolutional neural network, combining the extracted hand image features with the image features of the body part extracted in the step 102, and estimating joint rotation of the left hand by using a full-connection neural network, namely the estimated left hand gesture of the human body.

The convolutional neural network and the fully-connected neural network used in step 103 are obtained by training motion capture data of the hand.

And 104, positioning the facial position information in the monocular color image by using the estimation result of the postures of the trunk and the limbs of the human body, intercepting a facial image area in the monocular color image according to the facial position information, extracting the image characteristics of the facial image area, estimating the shape parameters and the expression coefficients of the face based on the image characteristics of the facial image area, and generating a face model according to the shape parameters and the expression coefficients of the face.

The position of the neck is obtained by utilizing the estimation result of the postures of the trunk and the limbs of the human body, so that the position of the face in the monocular color image is deduced.

And extracting the features of the intercepted facial image area by using a convolutional neural network, and estimating facial shape parameters and expression coefficients by using a full-connection network so as to generate a corresponding face model according to the shape parameters and the expression coefficients of the face.

And 105, respectively applying the hand rotation parameters to the three-dimensional human body model obtained after the trunk and the limbs of the virtual human body model are reconstructed, completing the reconstruction of the hands, replacing the face area of the three-dimensional human body model with the human face model, and completing the reconstruction of the face to obtain the virtual human body model.

The virtual character model obtained by the application not only has the movement of the trunk and the limbs, but also contains the details of gestures and faces.

The method for generating the virtual human based on the monocular vision estimates the postures of the trunk and the limbs of the human body by extracting the image characteristics of a monocular color image, completes reconstruction of the trunk and the limbs of the virtual human model, then positions hand position information and face position information of two hands and faces in the monocular color image by utilizing the estimation result of the postures of the trunk and the limbs of the human body, intercepts hand image areas and face image characteristics in the monocular color image, extracts the image characteristics of the hand image areas and the face image areas at the same time, estimates the shape parameters and the expression coefficients of the hand rotation parameters and the faces, completes reconstruction of the hands and the faces, and obtains the virtual human model. Therefore, the method for generating the virtual human body model in real time through the monocular color image is provided, and meanwhile the virtual human body model can be rendered into the display equipment for a user to watch, so that a better interaction effect is achieved.

Example two

Further, based on the method for generating a virtual human based on monocular vision provided in the foregoing embodiment, an embodiment of the present application further provides a device 200 for generating a virtual human based on monocular vision, and fig. 2 is a schematic structural diagram of the device for generating a virtual human based on monocular vision provided in an embodiment of the present application, and as shown in fig. 2, the method may include:

an obtaining module 201, configured to obtain a monocular color image;

the first estimation module 202 is configured to extract image features of a monocular color image, and estimate postures of a trunk and limbs of a human body based on the image features of the monocular color image, so as to complete reconstruction of the trunk and the limbs of the virtual human body model;

the second estimation module 203 is configured to separately position hand position information in the monocular color image according to estimation results of postures of the trunk and the limbs of the human body, intercept a hand image region in the monocular color image according to the hand position information, extract image features of the hand image region, and estimate hand rotation parameters based on the image features of the hand image region;

the third estimation module 204 is configured to locate face position information in the monocular color image according to an estimation result of postures of a trunk and limbs of the human body, intercept a face image region in the monocular color image according to the face position information, extract image features of the face image region, estimate shape parameters and expression coefficients of a face based on the image features of the face image region, and generate a face model according to the shape parameters and the expression coefficients of the face;

a reconstruction module 205, configured to apply the hand rotation parameters to the three-dimensional human body model obtained after the trunk and the limbs of the virtual human body model are reconstructed, complete the reconstruction of the hand, and replace the face region of the three-dimensional human body model with the human face model, complete the reconstruction of the face, so as to obtain the virtual human body model

Wherein the first estimating module may further include:

the processing module is used for processing the monocular color image by utilizing the depth residual error network to obtain the image characteristics of the monocular color image in a high-dimensional space;

the regression module is used for estimating the three-dimensional coordinates of the key points of the human body by using a convolutional neural network based on the image characteristics, and regressing the rotation parameters of the joints of the human body and the body type parameters of the human body by using a full-connection neural network according to the three-dimensional coordinates of the key points of the human body;

and the reconstruction module is used for applying the rotation parameters of the human joints and the body type parameters of the human body to a predefined parameterized three-dimensional human body model to obtain the postures of the trunk and the limbs of the human body so as to complete the reconstruction of the postures of the trunk and the limbs of the virtual human model.

And the depth residual error network, the convolution neural network and the full-connection network in the first estimation module are obtained by training the disclosed human body motion capture data. The disclosed human body motion capture data are acquired according to the motion of a real human body, and the priori knowledge of the human body motion is hidden, so that the depth residual error network, the convolutional neural network and the full-connection network can learn the priori knowledge of the hidden human body motion in the training process, the motion of the human body can be pre-judged, and the obtained virtual character is more real.

To implement the above embodiments, the present disclosure also proposes a non-transitory computer-readable storage medium.

A non-transitory computer-readable storage medium provided by an embodiment of the present disclosure stores a computer program; when executed by a processor, the computer program can implement the method for generating a virtual human based on monocular vision as shown in any one of fig. 1.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A method for generating a virtual human based on monocular vision is characterized by comprising the following steps:

acquiring a monocular color image;

2. The virtual human generation method according to claim 1, wherein the extracting of the image features of the monocular color image and the estimating of the postures of the trunk and the limbs of the human body based on the image features of the monocular color image to complete the reconstruction of the trunk and the limbs of the virtual human model comprises:

processing the monocular color image by using a depth residual error network to obtain the image characteristics of the monocular color image in a high-dimensional space;

estimating three-dimensional coordinates of key points of the human body by using a convolutional neural network based on the image characteristics, and regressing rotation parameters of joints of the human body and body type parameters of the human body by using a fully-connected neural network according to the three-dimensional coordinates of the key points of the human body;

and applying the rotation parameters of the human body joints and the body type parameters of the human body to a predefined parameterized three-dimensional human body model to obtain the postures of the trunk and the limbs of the human body so as to complete the reconstruction of the postures of the trunk and the limbs of the virtual human body model.

3. A method for generating a avatar according to claim 2, wherein the deep residual network, the convolutional neural network, and the fully-connected network are trained using published human motion capture data.

4. A method for generating a virtual human being as defined in claim 3, wherein the disclosed human body motion capture data implicitly includes a priori knowledge of human body motion, and is implicitly learned during the network training process.

5. A device for generating a virtual human based on monocular vision is characterized by comprising:

the acquisition module is used for acquiring a monocular color image;

a third estimation module, configured to locate face position information in the monocular color image according to an estimation result of the postures of the trunk and the limbs of the human body, intercept a face image region in the monocular color image according to the face position information, extract an image feature of the face image region, estimate a shape parameter and an expression coefficient of a face based on the image feature of the face image region, and generate a face model according to the shape parameter and the expression coefficient of the face;

and the reconstruction module is used for respectively applying the hand rotation parameters to a three-dimensional human body model obtained after the trunk and the limbs of the virtual human body model are reconstructed, completing the reconstruction of the hand, replacing the face area of the three-dimensional human body model with the human face model, and completing the reconstruction of the face so as to obtain the virtual human body model.

6. The virtual human generation apparatus of claim 5, wherein the first estimation module comprises:

the processing module is used for processing the monocular color image by utilizing a depth residual error network to obtain the image characteristics of the monocular color image in a high-dimensional space;

the regression module is used for estimating the three-dimensional coordinates of the key points of the human body by using a convolutional neural network based on the image characteristics, and regressing the rotation parameters of the joints of the human body and the body type parameters of the human body by using a fully-connected neural network according to the three-dimensional coordinates of the key points of the human body;

and the reconstruction module is used for applying the rotation parameters of the human body joints and the body type parameters of the human body to a predefined parameterized three-dimensional human body model to obtain the postures of the trunk and the limbs of the human body so as to complete the reconstruction of the postures of the trunk and the limbs of the virtual human body model.

7. The avatar generation apparatus of claim 6, wherein the deep residual network, convolutional neural network, and fully-connected network are trained using published human motion capture data.

8. The avatar generation apparatus of claim 7, wherein the disclosed human motion capture data implicitly includes a priori knowledge of human motion, and is implicitly learned during network training.

9. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the monocular vision-based avatar generation method according to any one of claims 1 to 4.