CN113744129A

CN113744129A - Semantic neural rendering-based face image generation method and system

Info

Publication number: CN113744129A
Application number: CN202111050013.6A
Authority: CN
Inventors: 陈元祺; 任俞睿; 龙仕强
Original assignee: Instritute Of Intelligent Video Audio Technology Longgang Shenzhen
Current assignee: Instritute Of Intelligent Video Audio Technology Longgang Shenzhen
Priority date: 2021-09-08
Filing date: 2021-09-08
Publication date: 2021-12-03

Abstract

A face image generation method based on semantic neural rendering comprises the following steps: s1, a mapping network generates a hidden vector from a target face motion descriptor; s2, under the guidance of the hidden vector, a deformation network estimates accurate deformation between a source face image and a required target image, and deforms the source face image by using the estimated deformation parameters to generate a rough deformed image; and S3, generating a final fine image from the roughly deformed image by the editing network. The human face image generation method based on semantic neural rendering can generate images with more accurate actions, can generate more vivid results and accurate movement, and simultaneously still retains the identity information of the source human face image. Not only can a realistic image with the correct global pose be generated, but also vivid micro-presentations, such as pounding mouth and raising eyebrows, can be generated. In addition, information in irrelevant source face images is well preserved.

Description

Semantic neural rendering-based face image generation method and system

Technical Field

The invention relates to face image generation and neural rendering, in particular to a face image generation method and a face image generation system based on semantic neural rendering.

Background

A face image is one of the most important photographic contents widely used in daily life. It is an important task to have a variety of application scenarios to be able to edit portrait images by modifying the pose and expression of a given face. However, achieving such editing is extremely challenging, as it requires the automatic perception of 3D geometry that any given face is authentic. At the same time, the acuity of the human visual system to portrait images requires algorithms to generate realistic faces and backgrounds, which makes the task more difficult.

To achieve intuitive control, the motion descriptors should be semantically meaningful, which requires representing facial expressions, head rotations and translations as completely decoupled variables. Parametric face modeling methods provide a powerful tool for describing 3D faces with semantic parameters. These methods allow controlling the shape, expression, etc. characteristics of the 3D face through parameters. In conjunction with the priors of these techniques, one may desire to control the generation of realistic face images similar to the graphics rendering process. Currently, some model-based methods combine rendered images of a three-dimensional deformable face model (3DMM) and edit portrait images by modifying expression or pose parameters. These methods achieve impressive results, but they are target person specific methods, which means that they cannot be applied to arbitrary person portraits.

In 3DMM, the 3D shape S of a face is parametrically characterized as:

in which the number of the first and second groups is reduced,

average of 3D shape of human face, B_idAnd B_expIs the base of identity and expression obtained after scanning 200 faces and performing principal component analysis. The parameters alpha and beta are respectively 80-dimension and 64-dimension, and describe the identity of the human faceAnd expressive features. The rotation and translation of the face is expressed as R ∈ SO (3) and t ∈ R³. Up to this point, the motion information in the human face can be clearly expressed by (β, R, t) in 3 DMM.

Disclosure of Invention

The invention provides a face image generation method and system based on semantic neural rendering, which can generate an image with more accurate action.

The technical scheme of the invention is as follows:

according to one aspect of the invention, a face image generation method based on semantic neural rendering is provided, which comprises the following steps: s1, a mapping network generates a hidden vector from a target face motion descriptor; s2, under the guidance of the hidden vector, a deformation network estimates accurate deformation between a source face image and a required target image, and deforms the source face image by using the estimated deformation parameters to generate a rough deformed image; and S3, generating a final fine image from the roughly deformed image by the editing network.

Preferably, in the above-mentioned human face image generation method based on semantic neural rendering, in step S1, the target facial motion descriptor includes expression, rotation and conversion information of the target face, and after obtaining the target facial motion descriptor, the mapping network generates a hidden vector from the target facial motion descriptor.

Preferably, in the above method for generating a face image based on semantic neural rendering, in step S2, under the guidance of the hidden vector z, the deformation network estimates an accurate deformation between the source face image and the desired target image to obtain an optical flow field, and deforms the source face image by using the estimated optical flow field to generate a rough deformed image.

Preferably, in the above method for generating a face image based on semantic neural rendering, in step S3, the editing network receives the coarse deformed image obtained in the previous step, and combines the source face image and the hidden vector to obtain a final fine image.

According to another aspect of the invention, a face image generation system based on semantic neural rendering is provided, which comprises a mapping network, a deformation network and an editing network, wherein the mapping network is used for mapping an object motion descriptor to a hidden vector; the deformation network is used for estimating the accurate deformation between the source face image and the required target image under the guidance of the hidden vector, and deforming the source face image by using the estimated deformation parameters to generate a rough deformed image; and an editing network for generating a clear image with rich details by editing the coarse morphed image, and generating a final fine image from the coarse morphed image.

According to the technical scheme of the invention, the beneficial effects are as follows:

the semantic neural rendering-based face image generation method and system can generate images with more accurate actions, can generate more vivid results and accurate movement, and simultaneously still retain the identity information of the source face image. Not only can a realistic image with the correct global pose be generated, but also vivid micro-presentations, such as pounding mouth and raising eyebrows, can be generated. In addition, information in irrelevant source face images is well preserved.

For a better understanding and appreciation of the concepts, principles of operation, and effects of the invention, reference will now be made in detail to the following examples, taken in conjunction with the accompanying drawings, in which:

drawings

In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below.

FIG. 1 is a flow chart of a semantic neural rendering based face image generation method of the present invention;

FIG. 2 is a network overall frame diagram of the semantic neural rendering-based face image generation method of the present invention;

FIG. 3 is a qualitative comparison graph of the present invention and other algorithms on the task of intuitive face image control;

fig. 4 is an effect diagram of the indirect human face image editing task according to the present invention.

Detailed Description

In order to make the objects, technical means and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific examples. These examples are merely illustrative and not restrictive of the invention.

A face image generation method and system based on semantic neural rendering relates to a novel neural rendering model, and after a source face image and target 3DMM parameters are given, the model can generate a vivid result with accurate target motion. The proposed system model can be divided into three parts: a mapping network, a morphing network, and an editing network, wherein the mapping network generates hidden vectors from the motion descriptors. Under the guidance of the implicit vector, the deformation network estimates the accurate deformation between the source face image and the required target image, and deforms the source face image by using the estimated deformation parameters to generate a rough result. Finally, the editing network generates a final fine image from the coarse image.

Fig. 1 is a flowchart of a semantic-neural-rendering-based face image generation method of the present invention, and fig. 2 is an overall framework diagram of a semantic-neural-rendering-based face image generation system of the present invention, which is described with reference to fig. 1 and fig. 2, and includes the following steps:

s1. the mapping network generates a hidden vector from the target face motion descriptor (as shown in FIG. 2). In this step, as shown in fig. 2, the target face motion descriptor p includes expression, rotation, and conversion information of the target face. After the target face motion descriptor p is obtained, the mapping network generates a hidden vector z from p.

S2, under the guidance of the hidden vector, the deformation network estimates accurate deformation between the source face image and the required target image, and deforms the source face image by using the estimated deformation parameters to generate a rough deformed image. In the step, under the guidance of a hidden vector z, a deformation network estimates a source face image I_sAnd accurate deformation between the required target image to obtain an optical flow field w, and aligning the source face image I by using the estimated w_sPerforming deformation to generate rough deformed image

And S3, generating a final fine image from the roughly deformed image by the editing network. In this step, the editing network receives the rough deformed image obtained in the previous step

Combining source face images I_sAnd the hidden vector z to obtain the final fine image

I.e. the generated image in fig. 2

FIG. 2 is a network overall framework diagram of the semantic neural rendering-based face image generation method of the present invention. Given the source facial image (source image Is in fig. 2) and the target facial motion descriptor, the output of the model Is a facial image with accurate target motion, while retaining other information of the source facial image, such as identity, lighting, and background. As shown in fig. 2, the face image generation system model based on semantic neural rendering of the present invention can be divided into three parts: mapping networks, morphing networks, and editing networks. Firstly, mapping a target motion descriptor to a hidden vector; then generating a rough image through a deformation network; finally, the editing network is responsible for generating a sharp image with rich details by editing the coarse results (i.e., generating an image)

)。

The invention also provides a face image generation system based on semantic neural rendering, which comprises a mapping network, a deformation network and an editing network, wherein the mapping network is used for mapping the target motion descriptor to the hidden vector; the deformation network is used for estimating the accurate deformation between the source face image and the required target image under the guidance of the hidden vector, and deforming the source face image by using the estimated deformation parameters to generate a rough deformed image; and an editing network for generating a clear image with rich details by editing the coarse morphed image, and generating a final fine image from the coarse morphed image.

Fig. 3 shows a qualitative comparison of the present invention (i.e., the present model in fig. 3) with other algorithms on the task of intuitive face image control. It can be seen that the compared style manipulation network (styleirg) model produces impressive results with realistic details. However, it tends to generate images with a conservative strategy: face motion away from the distribution center is attenuated or ignored for better image quality. Meanwhile, some factors (such as glasses and clothes) which are not related to the facial movement are changed in the modification process. Although the proposed system is not trained using the FFHQ dataset, it still achieves impressive results when tested using this dataset. The system model of the present invention can not only generate realistic images with correct global poses, but also vivid micro-presentations, such as pounding mouth and raising eyebrows. In addition, information in irrelevant source face images is well preserved.

Compared with the existing human face image generation method, the method provided by the invention has the following two advantages: with better generation quality and with higher accuracy of the face movements. Two concepts of generation quality and face movement accuracy in face image generation and related evaluation indexes are explained below respectively: generating quality and face motion accuracy, wherein:

the quality of generation: and measuring whether the generated face image has higher image quality. On the evaluation index, the evaluation is divided into objective evaluation and subjective evaluation. Fraich perceptual distance is a commonly used objective assessment method of production quality. To calculate the Frey's perception distance of a face image generation model, a batch of face images is first generated using the model, and a batch of images is sampled from the data set for comparison. Then, the characteristics of the two batches of images are extracted, the statistical characteristics of the two batches of images are calculated, and the difference of distribution between the generated image and the real image is measured based on the statistical characteristics to serve as the evaluation of the quality of the generated image.

Face motion accuracy: and measuring whether the generated face image has the target face motion characteristics.

Specifically, the accuracy of the facial movement is measured by calculating the average distance of the expression and the posture in the 3d mms of the generated image and the target image as the average expression distance and the average posture distance, respectively.

Table 1 shows the quantitative comparison of the present invention with other algorithms on the task of intuitive face image control. As can be seen from table 1, by using a stylized generation confrontation network (StyleGAN) model as the final generator, the styleig model is able to generate a more realistic image, resulting in a lower fregue perceived distance (FID) score. However, a higher average expressive distance and average pose distance indicates that it may not be able to faithfully reconstruct the target facial motion. Unlike the stylerrig model, the method and system model provided by the invention can generate images with more accurate actions.

TABLE 1 quantitative comparison of the present invention to other algorithms on intuitive face image control task

	Sensing distance of Frey cut	Mean expression distance	Mean attitude distance
				StyleRig model	47.37	0.316	0.0919
The model	65.97	0.257	0.0252

Fig. 4 is an effect diagram of the indirect human face image editing task according to the present invention. It can be seen that the system model proposed by the present invention can generate more realistic results and accurate motion while still preserving the identity information of the source face image.

In summary, in order to realize controllable face image generation, the invention provides a novel neural rendering model. Given the source face image and the target 3DMM parameters, the model will produce a realistic result with accurate target motion. The proposed model can be divided into three parts: mapping networks, morphing networks, and editing networks. The mapping network generates hidden vectors from the motion descriptors. Under the guidance of the implicit vector, the deformation network estimates the accurate deformation between the source face image and the required target image, and deforms the source face image by using the estimated deformation parameters to generate a rough result. Finally, the editing network generates a final fine image from the coarse image.

Experiments prove that the model provided by the invention has superiority and versatility. Experiments have shown that this model not only enables intuitive image control through user-specified facial actions, but also generates realistic results in an indirect portrait editing task (also called face reproduction) with the goal of mimicking another person's facial actions.

The foregoing description is of the preferred embodiment of the concepts and principles of operation in accordance with the invention. The above-described embodiments should not be construed as limiting the scope of the claims, and other embodiments and combinations of implementations according to the inventive concept are within the scope of the invention.

Claims

1. A face image generation method based on semantic neural rendering is characterized by comprising the following steps:

s1, a mapping network generates a hidden vector from a target face motion descriptor;

s2, under the guidance of the hidden vector, a deformation network estimates accurate deformation between a source face image and a required target image, and deforms the source face image by using an estimated deformation parameter to generate a rough deformed image; and

and S3, generating a final fine image from the rough deformed image by the editing network.

2. The method for generating a facial image based on semantic neural rendering of claim 1, wherein in step S1, the target facial motion descriptor comprises expression, rotation and transformation information of a target face, and after obtaining the target facial motion descriptor, the mapping network generates the hidden vector from the target facial motion descriptor.

3. The semantic neural rendering-based face image generation method according to claim 1, wherein in step S2, under the guidance of the implicit vector z, the deformation network estimates an accurate deformation between the source face image and the desired target image, obtains an optical flow field, and generates a rough deformed image by deforming the source face image using the estimated optical flow field.

4. The method for generating a facial image based on semantic neural rendering of claim 1, wherein in step S3, the editing network receives the coarse deformed image obtained in the previous step, and combines the source facial image and the hidden vector to obtain a final fine image.

5. A face image generation system based on semantic neural rendering is characterized by comprising a mapping network, a deformation network and an editing network, wherein,

a mapping network for mapping the target motion descriptor to the hidden vector;

the deformation network is used for estimating the accurate deformation between the source face image and the required target image under the guidance of the hidden vector, and deforming the source face image by using the estimated deformation parameters to generate a rough deformed image; and

and the editing network is used for generating a clear image with rich details by editing the rough deformed image and generating a final fine image from the rough deformed image.