CN114581612A

CN114581612A - High-fidelity face reproduction method represented by mixed actions

Info

Publication number: CN114581612A
Application number: CN202210459830.5A
Authority: CN
Inventors: 邵长乐; 耿嘉仪; 练智超; 韦志辉
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2022-06-03
Anticipated expiration: 2042-04-28
Also published as: CN114581612B

Abstract

The invention discloses a high-fidelity face reproduction method represented by mixed actions, and belongs to the field of deep face counterfeiting. Extracting action units and posture information of a driving face and key point information of a source face; converting the key point information of the source face by using a key point conversion module according to the action unit and the posture information of the driving face; separating a source face picture into a face region and a background region by using a pre-trained segmentation network; inputting the action units, the converted key point information and the face area into a reproduction network to generate a target face; and inputting the target face and the background area into a background fusion module to generate a final result. The method mixes a plurality of action representations to be used as guide signals for human face reproduction, and inserts action features by utilizing space self-adaptive regularization, so that semantic features can be better maintained in the reproduction process; meanwhile, by combining the background separation technology, the authenticity and the interframe continuity of the generated face are further improved, and high-fidelity face reproduction is realized.

Description

High-fidelity face reproduction method represented by mixed actions

Technical Field

The invention relates to deep face counterfeiting, in particular to a high-fidelity face reproduction method represented by mixed actions.

Background

The face reproduction is a process of generating animation for a source face according to actions (postures and expressions) of a driving face, and has wide application prospects in the fields of film making, augmented reality and the like. In general, the process comprises three main steps:

1) a representation of the identity of the source face is created,

2) the motion of the face is driven by extraction and coding,

3) a fake source face is generated in combination with the identity and the action representation. Each step has a significant impact on the quality of the production.

Currently, the face reconstruction technology can be mainly divided into a synthesis method based on a traditional 3D model and a generation method based on generation of countermeasure networks (GANs). In a 3D face model based approach, identity and motion features are first encoded using 3D model parameters. And then rendering the reproduced face by using the identity parameters of the source face and the motion parameters of the driving face. Although this approach can achieve high quality output, it requires a significant amount of effort to obtain a true 3D representation of a human face. The methods based on the GANs can be classified into methods based on face key points (landworks), methods based on self-supervised learning, and methods based on Action Units (AUs) according to differences in face action representations. The face key point-based method faces the problem of identity leakage because the face key points also contain face shape features while providing expression and pose information. Self-supervision based approaches also have difficulty distinguishing between identity and action. AUs-based methods are less constrained to human face shape and are difficult to produce with high quality reproductions.

Disclosure of Invention

The technical problems solved by the invention are as follows: a high-fidelity face reproduction method with multiple motion representations mixed is provided.

The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a high-fidelity face reproduction method represented by mixed actions mainly comprises the following steps:

step 1: extracting action units and posture information of a driving face and key point information of a source face;

step 2: inputting the extracted action units of the face and the key point information of the source face into a key point conversion module to obtain converted key point information of the source face;

and step 3: separating a source face picture into a face region and a background region by using a pre-trained segmentation network;

and 4, step 4: inputting the action unit of the face in the step 1, the source face key point information converted in the step 2 and the face area in the step 3 into a reproduction network to generate a target face;

and 5: and (4) inputting the target face in the step (4) and the background area in the step (3) into a background fusion module to generate a final result.

Preferably, in step 1, the method of extracting the motion unit and pose information of the driving face and the key point information of the source face is as follows:

step 1.1: let drive the human face picture as

The source face picture is

；

Which represents the linear space in which the picture is located,

the dimensional information representing the picture is displayed on the display,

respectively representing the height and width of the picture;

step 1.2: extracting action units and posture information of the driving human face, and splicing the action units and the posture information to obtain a 20-dimensional vector

；

Representing the linearity of the vectorSpace, 20x1 denotes dimensional information of the vector;

step 1.3: extracting 106 key point information of source face and regulating shape to

；

Representing the linear space in which the keypoints are located, and 212x1 representing the dimensional information of the keypoints.

Preferably, in step 2, the extracted action units and the key point information of the source face are input to a key point conversion module to obtain the converted key point information of the source face, and the method includes:

step 2.1: the key point conversion module comprises two encoders and a decoder, wherein the two encoders are respectively used for extracting the action unit for driving the face and the characteristics of the key point information of the source face, and the decoder is used for predicting the key point information of the source face

Is offset amount of

And the final converted source face key point information is

；

Step 2.2: the keypoint conversion module is trained using three loss functions, pixel level L1 loss, two countermeasures loss.

Preferably, the specific content of the loss function at pixel level L1 is: source face picture during training

And driving the face picture

Taking the same video from the same identity, thus driving the face mapSheet

Key point information of

As converted source face key point information

The true value of (c) is given,

representing a linear space in which the key points are located, and 212x1 representing dimension information of the key points; the loss function is as follows:

use of two discriminators TD against losses_rAnd TD to make the key point converter accurate and robust, wherein TDr is used to judge the converted source face key point information

The true or false of (a) is true,TDfor evaluating converted source face key point information

And source face key point information before conversion

The identity similarity of (2), the loss function of the two is defined as follows:

wherein the content of the first and second substances,

representing keypoint information for driving a face

The expected value of the distribution function of (a),

representing transformed source face keypoint information

Is determined by the expected value of the distribution function of (c),

representing source face keypoint information prior to conversion

And key point information for driving human face

Is determined by the expected value of the distribution function of (c),

representing source face keypoint information prior to conversion

And the converted source face key point information

The expected value of the distribution function of (a);

presentation discriminatorTD _rFor key point information of driving human face

The result of the authentication of the authenticity of (b),

presentation discriminatorTD _rFor the converted source face key point information

The result of the authentication of the authenticity of (b),

presentation discriminatorTDFor the source face key point information before conversion

And key point information for driving human face

The discrimination result of the identity similarity between them,

And the converted key point information of the source face

The identification result of identity similarity between the two groups;

the complete loss function of the final key point conversion module is a linear combination of the three:

in the formula (I), the compound is shown in the specification,

weights representing three loss functions respectivelyAnd (4) heavy.

Preferably, in step 3, the source face image is separated into a face region and a background region by using a pre-trained segmentation network, and the method is as follows: processing source face pictures using a pre-trained BiSeNet based face segmentation network

Obtaining a mask of the face area, respectively filling 0 pixel in the mask area and the area outside the mask to obtain the face area of the source face

And a background region

Two pictures.

Preferably, in step 4, the method for generating the target face in step 1 is as follows:

step 4.1: the source face key point information converted in the step 2 is processed

Mapping into a three-channel picture, and driving action units and posture information of the human faceAU∈R ^20×1Stitching to obtain an action representationM _d ∈R ^23×H×W，R ^23×H×WWhich represents the linear space in which the picture is located,23xHxWthe dimensional information representing the picture is displayed on the display,HandWrespectively representing the height and width of the picture;M _dface region from source face

An input together forming a reproduction network;

step 4.2: face region of source face in prediction

As input to the network, and employing a motion encoder for extracting the motion representationM _dThen inserting the extracted features into the output of 3 groups of ResBlock of the reproduction network to obtain the reproduction human face

；

Step 4.3: during training, the recurrence network is trained by using the following 3 loss functions: pixel level L1 penalty, countermeasure penalty, and perceptual penalty.

Preferably, pixel level L1 loses: using face regions for driving faces during training

As a reproduced face

The loss function is as follows:

。

preferably, the resistance to loss: using two discriminatorsGDAndGD _mto improve the realism of the generated result, whereinGDFor judging and reproducing human face

The true or false of (a) is true,GD _mfor evaluating driving actionsM _dAnd reproducing the face

The loss function is defined as follows:

in the formula (I), the compound is shown in the specification,

face region representing a source face

Is determined by the expected value of the distribution function of (c),

representing a recurring face

Is determined by the expected value of the distribution function of (c),

indicating a driving actionM _dAnd a face region for driving a face

Is determined by the expected value of the distribution function of (c),

indicating a driving actionM _dAnd reproducing the face

Is determined by the expected value of the distribution function of (c),

presentation discriminatorGDFace region of source face

The result of the authentication of the authenticity of (b),

presentation discriminatorGDFor reproducing human face

Is true ofThe result of the sexual authentication is that,

presentation discriminatorGD _mTo the driving actionM _dAnd a face region for driving a face

The result of discrimination of the correlation between them,

presentation discriminatorGD _mTo the driving actionM _dAnd reproducing the face

The correlation between the two groups.

Preferably, for perceptual loss: for minimizing reproduction of human face

And its true value

The loss function is defined as follows, whereinVFeature extraction operation on behalf of the VGG-16 model:

the final integrity loss function of the recurrent network is:

in the formula (I), the compound is shown in the specification,

the weights of the three loss functions are represented separately.

Preferably, the implementation method in step 5 is as follows:

the human face in the step 4 is reproduced

And the background area of the source face in step 3

Splicing is carried out to be used as the input of a background fusion network, and the network generates a picture

And a single-channel maskMThe final fusion result is obtained by the following formula:

in this way, the fused result will retain the input reconstructed face

The pixel content of the picture, and the final fusion result of the module during training

Loss and countermeasure loss with L2:

in the formula:

representing source face pictures

Is determined by the expected value of the distribution function of (c),

indicating the result of fusion

The expected value of the distribution function of (a),

presentation discriminatorDSource face picture

The result of the authentication of the authenticity of (b),

presentation discriminatorDFor the fusion result

The result of authentication of authenticity of (a); the complete loss function of the final background fusion module is a linear combination of the two:

in the formula (I), the compound is shown in the specification,

representing the weights of the two loss functions, respectively.

Has the advantages that: compared with the prior art, the invention has the following advantages:

(1) the invention integrates two characteristic expressions of key point information and an action unit driving the face as a guide signal for face reproduction, and can keep more face details while keeping the identity of the face.

(2) According to the method, the action characteristics are inserted through the space self-adaptive regularization, so that semantic information loss in the reproduction process is reduced, and the authenticity of a generated result is further improved.

(3) The invention utilizes the background separation technology to ensure that the reproduction network can concentrate on generating sharper faces, better maintains background information and realizes high-fidelity face reproduction.

Drawings

Fig. 1 is a flow chart diagram of the high-fidelity face reproduction method represented by the mixed actions of the invention.

Fig. 2 is a model structure diagram of a reproduction network in the method.

Detailed Description

The present invention will be further illustrated with reference to the accompanying drawings and specific examples, which are carried out on the premise of the technical solution of the present invention, and it should be understood that these examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.

The high-fidelity face reproduction method expressed by mixed actions comprises the steps of firstly extracting Action Units (AUs) and attitude information of a driving face and key point (landworks) information of a source face; converting the key point information of the source face by using a key point conversion module according to the action unit and the posture information of the driving face; then, separating the source face picture into a face region and a background region by using a pre-trained segmentation network; inputting the action unit, the converted key point information and the face area into a reproduction network to generate a target face; and finally, fusing the target face and the background area by using a background fusion network to generate a final result. The specific implementation steps of the invention are shown in fig. 1, and mainly comprise the following five steps 1-5:

step 1: extracting action units and posture information of a driving face and key point information of a source face; the specific method comprises the following steps:

step 1.1: let the human face picture be driven

The source face picture is

；

Which represents the linear space in which the picture is located,

respectively representing the height and the width of the picture;

step 1.2: extracting action units and posture information of a driving face by using a face behavior analysis tool OpenFace, wherein the action units of the face comprise the intensity of 17 action units, the posture information comprises rotation angles along 3 axes of pitch, yaw and roll, and the motion units and the posture information are spliced to obtain a 20-dimensional vector

；

Representing the linear space in which the vector is located, and 20x1 representing the dimensional information of the vector;

step 1.3: extracting 106 key point information of a source face by using a face key point detection method HyperLandmark, and adjusting the shape to

；

Step 2: inputting the extracted action units of the face and the key point information of the source face into a key point conversion module to obtain converted key point information of the source face; the method comprises the following steps:

Is offset amount of

And the final converted source face key point information is

；

Step 2.2: the keypoint conversion module is trained by three loss functions, pixel level L1 loss and two countervailing losses.

The specific content of the pixel level L1 loss function is: source face picture during training

And driving the face picture

Taking the same video from the same identity, thus driving a picture of the face

Key point information of

As converted source face key point information

The true value of (c) is given,

use of two discriminators TD against loss_rAnd TD to make the keypoint converter accurate and robust, where TDr is used to determine converted source face keypointsInformation

The true or false of (1) is true or false,TDfor evaluating converted source face key point information

And source face key point information before conversion

The loss function of the two is defined as follows:

wherein the content of the first and second substances,

representing key point information of driving human face

Is determined by the expected value of the distribution function of (c),

representing transformed source face keypoint information

Is determined by the expected value of the distribution function of (c),

representing source face keypoint information prior to conversion

And key point information for driving human face

Is determined by the expected value of the distribution function of (c),

representing source face keypoint information prior to conversion

And the converted source face key point information

The expected value of the distribution function of (a);

presentation discriminatorTD _rFor key point information of driving human face

The result of the authentication of the authenticity of (b),

The result of the authentication of the authenticity of (a),

And key point information for driving human face

The discrimination result of the identity similarity between them,

And the converted key point information of the source face

The identification result of identity similarity between the two groups;

in the formula (I), the compound is shown in the specification,

representing the weights of the three loss functions, respectively.

And step 3: separating a source face picture into a face region and a background region by using a pre-trained segmentation network; the method comprises the following steps:

processing source face pictures using a pre-trained BiSeNet-based face segmentation network

And a background region

Two pictures.

And 4, step 4: inputting the action unit of the face in the step 1, the source face key point information converted in the step 2 and the face area in the step 3 into a reproduction network shown in the figure 2 to generate a target face; the method comprises the following steps:

Mapping into a three-channel picture, and driving action units and posture information of the human faceAU∈R ^20×1Stitching to obtain an action representation

，

Which represents the linear space in which the picture is located,

respectively representing the height and width of the picture;M _dface region from source face

An input together forming a reproduction network;

step 4.2: as shown in fig. 2, the playback network adopts a Pix2 Pix-based network framework, which includes 3 sets of ResBlock residual blocks, and predicts the face region of the source face

As input to the network, and employing a motion encoder for extracting the motion representationM _dThen inserting the extracted features into the output of 3 groups of ResBlock of the recurrent network by utilizing a space regularization (SPADE) module, wherein the SPADE module is mainly used for reducing semantic loss in the generation process and finally obtaining the recurrent face

。

Pixel level L1 penalty: similar to the key point conversion module, the face region for driving the face is adopted during training

As a reproduced face

The loss function is as follows:

。

the resistance loss: using two discriminatorsGDAndGD _mto improve the realism of the generated result, whereinGDFor judging and reproducing human face

The loss function is defined as follows:

in the formula (I), the compound is shown in the specification,

face region representing a source face

Is determined by the expected value of the distribution function of (c),

means of weightModern face

The expected value of the distribution function of (a),

indicating a driving actionM _dAnd a face region for driving a face

Is determined by the expected value of the distribution function of (c),

indicating a driving actionM _dAnd reproducing the face

Is determined by the expected value of the distribution function of (c),

presentation discriminatorGDFace region of source face

The result of the authentication of the authenticity of (b),

presentation discriminatorGDFor reproducing human face

The result of the authentication of the authenticity of (b),

presentation discriminatorGD _mTo drive actionM _dAnd a face region for driving a face

The result of discrimination of the correlation between them,

The correlation between the two groups.

Loss of perception: for minimizing reproduction of human face

And its true value

The loss function is defined as follows, whereinVFeature extraction operations on behalf of the VGG-16 model:

the final integrity loss function of the recurrent network is:

in the formula (I), the compound is shown in the specification,

the weights of the three loss functions are represented separately.

And 5: and (4) inputting the target face in the step (4) and the background area in the step (3) into a background fusion module to generate a final result. The method comprises the following steps:

the human face in the step 4 is reproduced

And the background area of the source face in step 3

in this way, the fused result will retain the input reconstructed face

Loss and countermeasure loss with L2:

in the formula:

representing source face pictures

Is determined by the expected value of the distribution function of (c),

indicating the result of fusion

Is determined by the expected value of the distribution function of (c),

presentation discriminatorDSource face picture

The result of the authentication of the authenticity of (b),

presentation discriminatorDFor the fusion result

in the formula (I), the compound is shown in the specification,

representing the weights of the two loss functions, respectively.

The effectiveness and efficiency of the method of the invention are verified by the following experiments:

the evaluation criteria were Structural Similarity (SSIM) and freschel perceptual distance (FID). SSIM evaluates low-level similarity between the generated image and the truth value, with larger values being better. The FID uses a pre-trained inclusion V3 network to evaluate the perceived distance between the generated image and the real image, with smaller values being better.

The experiment employed a VoxCeleb1 data set that included a cumulative 24997 segments of real video of 1251 different identities. The data set provides a picture of a human face extracted and cropped at 1 frame per second. The experiment used video segments with an average resolution greater than 300x300, resulting in 29891 training pictures and 4284 test pictures. And the pictures are scaled to 256x256, and then the keypoint information of 106 points is extracted by using HyperLandmark, and AUs and the posture information are extracted by using OpenFace.

The results were compared with those generated by FreNet, a landworks-based method, and ICface, a AUs-based method, respectively. The results of the three methods are shown in table 1 for two evaluation indices:

TABLE 1 test results of the inventive method on the VoxColeb 1 data set

The results in table 1 show that the method of the present invention achieves better results than the method based on only the keypoint information (landworks) and only the Action Units (AUs). Specifically, for the SSIM index, the background separation technology of the invention better retains the background, thereby improving the low-level similarity between the generated result and the original image; for the FID index, the method integrates the two characteristic representations to better reserve the details of the source face, thereby reducing the perception distance between the generated result and the original image. The results show that the invention combines two characteristic representations and shows certain effectiveness on separating the background. In general, the method of the invention can fully reserve the semantic features of the human face and generate a more real human face and a more real background. Based on the above-mentioned results, the face reconstruction method using hybrid motion representation creates a high fidelity face falsification tool.

The method mixes a plurality of motion representations to be used as guide signals for human face reproduction, and inserts motion features by utilizing space self-adaptive regularization, so that semantic features can be better maintained in the reproduction process. Meanwhile, by combining the background separation technology, the authenticity and the interframe continuity of the generated face are further improved, and high-fidelity face reproduction is realized.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims

1. A high-fidelity face reproduction method represented by mixed actions is characterized by mainly comprising the following steps:

2. A high fidelity face representation method of mixed motion representation as claimed in claim 1, wherein in step 1, the motion unit and pose information of the driving face and the key point information of the source face are extracted as follows:

step 1.1: let drive the human face picture as

The source face picture is

；

Which represents the linear space in which the picture is located,

respectively representing the height and width of the picture;

；

；

3. The high-fidelity face reproduction method expressed by mixed actions as claimed in claim 1, wherein in step 2, the extracted action units and the key point information of the source face are input into a key point conversion module to obtain the converted key point information of the source face, and the method comprises the following steps:

Is offset amount of

And the final converted source face key point information is

；

4. A high fidelity face representation method as claimed in claim 3, wherein the specific content of the pixel level L1 loss function is: source face picture during training

And driving the face picture

Key point information of

As converted source face key point information

The true value of (c) is given,

use of two discriminators TD against loss_rAnd TD to make the key point converter accurate and robust, wherein TDr is used to judge the converted source face key point information

And source face key point information before conversion

The loss function of the two is defined as follows:

wherein, the first and the second end of the pipe are connected with each other,

representing key point information of driving human face

Is determined by the expected value of the distribution function of (c),

representing transformed source face keypoint information

The expected value of the distribution function of (a),

representing source face keypoint information prior to conversion

And key point information for driving human face

Is determined by the expected value of the distribution function of (c),

representing source face keypoint information prior to conversion

And the converted key point information of the source face

The expected value of the distribution function of (a);

presentation discriminatorTD _rFor key point information of driving human face

The result of the authentication of the authenticity of (b),

The result of the authentication of the authenticity of (b),

And key point information for driving human face

The discrimination result of the identity similarity between them,

And the converted key point information of the source face

The identification result of identity similarity between the two groups;

in the formula (I), the compound is shown in the specification,

representing the weights of the three loss functions, respectively.

5. A high-fidelity face reproduction method of mixed motion representation as claimed in claim 1, characterized in that in step 3, the pre-trained segmentation network is used to separate the source face picture into a face region and a background region, and the method is as follows: processing source face pictures using a pre-trained BiSeNet-based face segmentation network

And a background region

Two pictures.

6. A high fidelity face representation method of mixed motion representation as claimed in claim 1, wherein in step 4, the method of generating the target face in step 1 is as follows:

Mapping the three-channel image into a three-channel image, and driving the action unit and the attitude information of the human faceAU∈R ^20×1Stitching to obtain an action representationM _d ∈R ^23×H×W，R ^23×H×WWhich represents the linear space in which the picture is located,23xHxWthe dimensional information representing the picture is displayed on the display,HandWrespectively representing the height and width of the picture;M _dface region from source face

An input together forming a reproduction network;

step 4.2: face region of source face in prediction

As input to the network, and employing a motion encoder for extracting a representation of the motionM _dThen inserting the extracted features into the output of 3 groups of ResBlock of the reproduction network to obtain the reproduction human face

；

7. The high fidelity face reproduction method of blended motion representations as claimed in claim 6, wherein pixel level L1 loss: using face regions for driving faces during training

As a reproduced face

The loss function is as follows:

。

8. the high-fidelity face reproduction method of hybrid motion representation according to claim 6, characterized in that the face reproduction method is characterized in that the face reproduction method comprises the following steps of: using two discriminatorsGDAndGD _mto improve the realism of the generated result, whereinGDFor judging and reproducing human face

The loss function is defined as follows:

in the formula (I), the compound is shown in the specification,

face representing a source faceRegion(s)

Is determined by the expected value of the distribution function of (c),

representing a recurring face

Is determined by the expected value of the distribution function of (c),

indicating a driving actionM _dAnd a face region for driving a face

Is determined by the expected value of the distribution function of (c),

indicating a driving actionM _dAnd reproducing the face

Is determined by the expected value of the distribution function of (c),

presentation discriminatorGDFace region of source face

The result of the authentication of the authenticity of (b),

indicating signaturePin deviceGDFor reproducing human face

The result of the authentication of the authenticity of (a),

The result of discrimination of the correlation between them,

The discrimination result of the correlation between them.

9. A high fidelity face reproduction method of a hybrid motion representation as claimed in claim 6, characterized in that for perceptual loss: for minimizing reproduction of human face

And its truth value

the final integrity loss function of the recurrent network is:

in the formula (I), the compound is shown in the specification,

the weights of the three loss functions are represented separately.

10. The high-fidelity face reproduction method of mixed motion representation according to claim 1, characterized in that the implementation method in step 5 is as follows:

the human face in the step 4 is reproduced