CN118053189A

CN118053189A - Sparse multi-view dynamic face reconstruction method and device

Info

Publication number: CN118053189A
Application number: CN202410130872.3A
Authority: CN
Inventors: 徐枫; 明鑫; 雍俊海
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2024-01-30
Filing date: 2024-01-30
Publication date: 2024-05-17

Abstract

The invention discloses a sparse multi-view dynamic face reconstruction method and device, wherein the method comprises the steps of obtaining multi-view face video; inputting the face video into the deformed face parameterized geometric model to output deformed face geometry; performing nerve rendering and rasterization on the face geometry to obtain a multi-view nerve rendering image; and performing pixel-by-pixel image loss calculation on the neural rendering image and the image in the face video, and reconstructing a complete face sequence according to a loss calculation result to complete dynamic face reconstruction. The invention can reconstruct high-quality face geometry from sparse multi-view video.

Description

Sparse multi-view dynamic face reconstruction method and device

Technical Field

The invention relates to the technical field of computer graphics and computer vision, in particular to a sparse multi-view dynamic face reconstruction method and device.

Background

3DMM is a three-dimensional model of human face expressed by statistical model, generally, it controls the identity information of human face by controlling shape parameter, such as fat and thin; and controlling the expression information of the face through the expression parameter. A commonly used 3DMM model such as flag. Neural rendering is a method of rendering using a neural network, which is different from conventional physical-based rendering in that it can implicitly express some ideas of rendering, such as illumination, materials, etc.

However, the current face reconstruction work mainly aims at the static face reconstruction of a monocular single picture, the reconstruction effect is not obvious, and few people try to reconstruct a dynamic face with high precision and high quality by using a sparse multi-view video.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems in the related art to some extent.

Therefore, the invention provides a sparse multi-view dynamic face reconstruction method which can reconstruct a complete face sequence from a sparse multi-view face video shot by a mobile phone and express the complete face sequence in a form of 3 DMM.

Another object of the present invention is to provide a sparse multi-view dynamic face reconstruction device.

A third object of the invention is to propose a computer device.

A fourth object of the present invention is to propose a non-transitory computer readable storage medium.

In order to achieve the above objective, an aspect of the present invention provides a sparse multi-view dynamic face reconstruction method, including:

Acquiring a multi-view face video;

inputting the face video into a deformed face parameterized geometric model to output deformed face geometry;

performing nerve rendering and rasterization processing on the face geometry to obtain a multi-view nerve rendering image;

And carrying out pixel-by-pixel image loss calculation on the neural rendering image and the image in the face video, and reconstructing a complete face sequence according to a loss calculation result to complete dynamic face reconstruction.

In one embodiment of the present invention, before inputting the face video into the warped face parametric geometric model to output the warped face geometry, the method further comprises:

Constructing a face parameterized geometric model;

Adding an offset to each vertex of the face parameterized geometric model to obtain an offset adding result;

and obtaining a deformed face parameterized geometric model based on the offset increasing result and the face parameterized geometric model.

In one embodiment of the invention, the face geometry is represented based on a Mesh three-dimensional geometric representation method.

In one embodiment of the invention, the face parameterized geometric model is a face 3DMM model.

In one embodiment of the invention, the neural rendering of the face geometry comprises:

Obtaining a face geometric sample and rendering parameters; the rendering parameters comprise illumination, materials and geometric information under different scenes;

training a neural network model by using the face geometric sample and the rendering parameters to update network parameters based on the iteration result of the neural network model to obtain a trained neural network model;

and inputting the face geometry into a trained neural network model for rendering to obtain a face rendering result.

In order to achieve the above object, another aspect of the present invention provides a sparse multi-view dynamic face reconstruction device, including:

The face video acquisition module is used for acquiring a multi-view face video;

The face geometry acquisition module is used for inputting the face video into the deformed face parameterized geometric model to output deformed face geometry;

the neural network rendering module is used for performing neural rendering and rasterization processing on the face geometry to obtain a multi-view neural rendering image;

And the dynamic face reconstruction module is used for carrying out pixel-by-pixel image loss calculation on the neural rendering image and the image in the face video, and reconstructing a complete face sequence according to the loss calculation result so as to complete dynamic face reconstruction. The method and the system for reconstructing the sparse multi-view dynamic face can reconstruct a complete face sequence from the sparse multi-view face video shot by a mobile phone, and express the complete face sequence in a 3DMM form, and the reconstructed face has higher precision and more obvious effect.

To achieve the above object, an embodiment of a third aspect of the present application provides a computer apparatus, including: a processor and a memory; wherein the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory, for implementing the sparse multi-view dynamic face reconstruction method according to the embodiment of the first aspect.

To achieve the above object, a fourth aspect of the present application provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the sparse multi-view dynamic face reconstruction method according to the first aspect.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flow chart of a sparse multi-view dynamic face reconstruction method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a sparse multi-view dynamic face reconstruction device according to an embodiment of the present invention;

fig. 3 is a computer device according to an embodiment of the invention.

Detailed Description

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments.

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

The following describes a sparse multi-view dynamic face reconstruction method, a system, a computer device and a storage medium according to an embodiment of the present invention with reference to the accompanying drawings.

Fig. 1 is a flowchart of a sparse multi-view dynamic face reconstruction method according to an embodiment of the present invention.

As shown in fig. 1, the method includes, but is not limited to, the steps of:

s1, acquiring a multi-view face video.

Specifically, the invention can acquire the multi-view face video under various fields by using the camera, and extract RGB frame pictures in the face video.

The RGB frame picture of the present invention may be a single face picture, for example.

S2, inputting the face video into the deformed face parameterized geometric model to output deformed face geometry.

It will be appreciated that the present invention utilizes a pre-trained face parameterized geometric model, which may be a face 3DMM model, for output.

It can be understood that the 3DMM model of a face is a statistical-based face parameterized geometric model, which is equivalent to a three-dimensional model of a face that can output different styles and expressions according to different parameters. But 3DMM models are typically relatively coarse and lack of detail.

In one embodiment of the invention, an offset is added to each vertex of the 3DMM model, and based on the offset addition result, the 3DMM model can express personalized face three-dimensional models with details, the relationship between the 3DMM model and the multi-view face video is that the input is the multi-view face video, the expected output is the geometry of the personalized face, and the main supervision information of the method is RGB frame pictures of the multi-view face video.

Further, the face geometry after deformation can be represented by a Mesh three-dimensional geometric representation method, and can be understood as a geometric surface formed by a plurality of triangular patches.

It is understood that a Mesh model is a data structure used to represent a three-dimensional model. It consists of a set of points, lines, planes, commonly used in computer graphics. Specifically, a mesh model contains the following elements:

1) Vertex: a location point in 3D space containing X, Y, Z coordinates;

2) And (3) surface: a polygon, typically a triangle, made up of segments connecting different vertices;

1) Normal: each surface has a normal vector for controlling the illumination effect;

4) Texture coordinates: for specifying a location on the surface to which the texture maps;

Thus, in most 3D modeling software, a user can construct a mesh model by creating mesh objects of various shapes. For example, a set of vertices may be formed into a cube of graphics and then each face assigned texture coordinates and normal vectors to form a complete 3D model.

And S3, performing nerve rendering and rasterization processing on the face geometry to obtain a multi-view nerve rendering image.

It can be appreciated that the neural rendering technique is an image rendering method based on an artificial neural network, which uses principles of deep learning and computer vision to generate a realistic image in a more intelligent and efficient manner. The traditional rendering method often needs to manually set complex parameters and illumination conditions, and the neural rendering technology can automatically learn and understand name elements such as illumination, materials, geometry and the like through training a neural network, so that a more real image is generated.

In one embodiment of the invention, the invention utilizes a neural network for rendering. Firstly, obtaining a face geometric sample and rendering parameters; the rendering parameters comprise illumination, materials and geometric information under different scenes; training a neural network model by using the face geometric sample and the rendering parameters to update network parameters based on the iteration result of the neural network model to obtain a trained neural network model; and inputting the face geometry into the trained neural network model for rendering to obtain a face rendering result.

Illustratively, the neural rendering of the present invention refers to attaching a layer of neural material to a face geometry (mesh) and then regressing the color under a specific viewing angle through a neural network. Rasterization refers to the process of projecting a three-dimensional mesh into a pixel plane for visualization.

Specifically, the invention uses the neural rendering technology to render the face, and can be divided into the following key steps:

data collection and preparation: a large number of images and associated rendering parameters need to be prepared as training data. These data will be used to train the neural network so that it can understand the illumination, texture and geometry information in different scenarios.

Training a neural network: a neural network model is constructed using deep learning techniques. The model iterates over the training data, gradually optimizing its rendering capabilities by continually adjusting the parameters. After training is completed, the neural network will be able to generate highly realistic images from the entered scene information.

And (3) image generation: after training, the neural network can accept the description of the scene as input and generate a realistic image conforming thereto. It can simulate the propagation and interaction of light rays, generating images that are almost indistinguishable from the real world.

Thus, a face rendering result is obtained based on the trained neural network model.

And S4, performing pixel-by-pixel image loss calculation on the neural rendering image and the image in the face video, and reconstructing a complete face sequence according to the loss calculation result to complete dynamic face reconstruction.

Specifically, the embodiment of the invention can obtain a multi-view rendering picture by using rasterization and nerve rendering based on the face three-dimensional model deformed by vertex, and can make a pixel-by-pixel picture loss by using the rendering picture and the video obtained by shooting.

Specifically, pixel-by-pixel picture loss is calculated using landmark loss and render loss, and gradients are counter-transferred. Wherein landmark loss refers to 68 key points of the face, and refers to the distance loss between the projected key points of the face and the key points of the GT.

Further, reconstructing the complete face sequence based on the loss calculation result to complete dynamic face reconstruction.

According to the sparse multi-view dynamic face reconstruction method provided by the embodiment of the invention, the high-quality face geometry can be reconstructed from the sparse multi-view video, the face material is not limited to the diffuse reflection material in the expression mode of the nerve material, and the reconstruction effect is good and the accuracy is high.

In order to implement the above embodiment, as shown in fig. 2, a sparse multi-view dynamic face reconstruction device 10 is further provided in this embodiment, where the device 10 includes a face video acquisition module 100, a face geometry acquisition module 200, a neural network rendering module 300, and a dynamic face reconstruction module 400;

The face video acquisition module 100 is configured to acquire a multi-view face video;

the face geometry acquisition module 200 is configured to input a face video to the deformed face parameterized geometric model to output a deformed face geometry;

The neural network rendering module 300 is configured to perform neural rendering and rasterization processing on the face geometry to obtain a multi-view neural rendering image;

the dynamic face reconstruction module 400 is configured to perform pixel-by-pixel image loss calculation on the neural rendering image and the image in the face video, and reconstruct a complete face sequence according to the loss calculation result to complete dynamic face reconstruction.

Further, before the face geometry acquisition module 200, the method further includes: the facial parameterized geometric model deformation module is used for:

Constructing a face parameterized geometric model;

and obtaining the deformed face parameterized geometric model based on the offset increasing result and the face parameterized geometric model.

Further, the face geometry is represented based on a Mesh three-dimensional geometry representation method.

Further, the face parameterized geometric model is a face 3DMM model.

Further, the neural network rendering module 300 is further configured to:

and inputting the face geometry into the trained neural network model for rendering to obtain a face rendering result.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

According to the sparse multi-view dynamic face reconstruction device provided by the embodiment of the invention, high-quality face geometry can be reconstructed from the sparse multi-view video, and the face material is not limited to diffuse reflection materials in the expression mode of nerve materials, so that the reconstruction effect is good and the accuracy is high.

In order to implement the above-described embodiments, the present application also proposes a non-transitory computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements a method as described in the previous embodiments.

FIG. 3 illustrates a schematic block diagram of an example computer device 700 that may be used to implement an embodiment of the present application. Computer devices are intended to represent various forms of digital computers, such as laptops, desktops, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The computer device may also represent various forms of mobile apparatuses, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

As shown in fig. 3, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the respective methods and processes described above, such as a voice instruction response method. For example, in some embodiments, the voice instruction response method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When the computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of the voice instruction response method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the voice instruction response method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present application may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual PRIVATE SERVER" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present application are achieved, and the present application is not limited herein.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

Claims

1. The sparse multi-view dynamic face reconstruction method is characterized by comprising the following steps of:

Acquiring a multi-view face video;

2. The method of claim 1, wherein prior to inputting the face video into the warped face parameterized geometric model to output the warped face geometry, the method further comprises:

Constructing a face parameterized geometric model;

3. The method according to claim 1, wherein the face geometry is represented based on a Mesh three-dimensional geometric representation method.

4. The method of claim 1, wherein the face parameterized geometric model is a face 3DMM model.

5. The method of claim 1, wherein the neural rendering of the face geometry comprises:

6. A sparse multi-view dynamic face reconstruction device, comprising:

and the dynamic face reconstruction module is used for carrying out pixel-by-pixel image loss calculation on the neural rendering image and the image in the face video, and reconstructing a complete face sequence according to the loss calculation result so as to complete dynamic face reconstruction.

7. The apparatus of claim 6, further comprising, prior to the face geometry acquisition module: the facial parameterized geometric model deformation module is used for:

Constructing a face parameterized geometric model;

8. The apparatus of claim 6, wherein the face geometry is represented based on a Mesh three-dimensional geometric representation method.

9. The apparatus of claim 6, wherein the face parameterized geometric model is a face 3DMM model.

10. The apparatus of claim 6, wherein the neural network rendering module is further configured to:

11. A computer device comprising a processor and a memory;

wherein the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory for implementing the sparse multi view dynamic face reconstruction method according to any one of claims 1-4.

12. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the sparse multi-view dynamic face reconstruction method of any one of claims 1-4.