CN114581612A - High-fidelity face reproduction method represented by mixed actions - Google Patents

High-fidelity face reproduction method represented by mixed actions Download PDF

Info

Publication number
CN114581612A
CN114581612A CN202210459830.5A CN202210459830A CN114581612A CN 114581612 A CN114581612 A CN 114581612A CN 202210459830 A CN202210459830 A CN 202210459830A CN 114581612 A CN114581612 A CN 114581612A
Authority
CN
China
Prior art keywords
face
key point
source
point information
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210459830.5A
Other languages
Chinese (zh)
Other versions
CN114581612B (en
Inventor
邵长乐
耿嘉仪
练智超
韦志辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN202210459830.5A priority Critical patent/CN114581612B/en
Publication of CN114581612A publication Critical patent/CN114581612A/en
Application granted granted Critical
Publication of CN114581612B publication Critical patent/CN114581612B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)
  • Collating Specific Patterns (AREA)

Abstract

The invention discloses a high-fidelity face reproduction method represented by mixed actions, and belongs to the field of deep face counterfeiting. Extracting action units and posture information of a driving face and key point information of a source face; converting the key point information of the source face by using a key point conversion module according to the action unit and the posture information of the driving face; separating a source face picture into a face region and a background region by using a pre-trained segmentation network; inputting the action units, the converted key point information and the face area into a reproduction network to generate a target face; and inputting the target face and the background area into a background fusion module to generate a final result. The method mixes a plurality of action representations to be used as guide signals for human face reproduction, and inserts action features by utilizing space self-adaptive regularization, so that semantic features can be better maintained in the reproduction process; meanwhile, by combining the background separation technology, the authenticity and the interframe continuity of the generated face are further improved, and high-fidelity face reproduction is realized.

Description

High-fidelity face reproduction method represented by mixed actions
Technical Field
The invention relates to deep face counterfeiting, in particular to a high-fidelity face reproduction method represented by mixed actions.
Background
The face reproduction is a process of generating animation for a source face according to actions (postures and expressions) of a driving face, and has wide application prospects in the fields of film making, augmented reality and the like. In general, the process comprises three main steps:
1) a representation of the identity of the source face is created,
2) the motion of the face is driven by extraction and coding,
3) a fake source face is generated in combination with the identity and the action representation. Each step has a significant impact on the quality of the production.
Currently, the face reconstruction technology can be mainly divided into a synthesis method based on a traditional 3D model and a generation method based on generation of countermeasure networks (GANs). In a 3D face model based approach, identity and motion features are first encoded using 3D model parameters. And then rendering the reproduced face by using the identity parameters of the source face and the motion parameters of the driving face. Although this approach can achieve high quality output, it requires a significant amount of effort to obtain a true 3D representation of a human face. The methods based on the GANs can be classified into methods based on face key points (landworks), methods based on self-supervised learning, and methods based on Action Units (AUs) according to differences in face action representations. The face key point-based method faces the problem of identity leakage because the face key points also contain face shape features while providing expression and pose information. Self-supervision based approaches also have difficulty distinguishing between identity and action. AUs-based methods are less constrained to human face shape and are difficult to produce with high quality reproductions.
Disclosure of Invention
The technical problems solved by the invention are as follows: a high-fidelity face reproduction method with multiple motion representations mixed is provided.
The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a high-fidelity face reproduction method represented by mixed actions mainly comprises the following steps:
step 1: extracting action units and posture information of a driving face and key point information of a source face;
step 2: inputting the extracted action units of the face and the key point information of the source face into a key point conversion module to obtain converted key point information of the source face;
and step 3: separating a source face picture into a face region and a background region by using a pre-trained segmentation network;
and 4, step 4: inputting the action unit of the face in the step 1, the source face key point information converted in the step 2 and the face area in the step 3 into a reproduction network to generate a target face;
and 5: and (4) inputting the target face in the step (4) and the background area in the step (3) into a background fusion module to generate a final result.
Preferably, in step 1, the method of extracting the motion unit and pose information of the driving face and the key point information of the source face is as follows:
step 1.1: let drive the human face picture as
Figure 409036DEST_PATH_IMAGE001
The source face picture is
Figure 311264DEST_PATH_IMAGE002
Figure 371624DEST_PATH_IMAGE003
Which represents the linear space in which the picture is located,
Figure 924965DEST_PATH_IMAGE004
the dimensional information representing the picture is displayed on the display,
Figure 45368DEST_PATH_IMAGE005
respectively representing the height and width of the picture;
step 1.2: extracting action units and posture information of the driving human face, and splicing the action units and the posture information to obtain a 20-dimensional vector
Figure 166383DEST_PATH_IMAGE006
Figure 30434DEST_PATH_IMAGE007
Representing the linearity of the vectorSpace, 20x1 denotes dimensional information of the vector;
step 1.3: extracting 106 key point information of source face and regulating shape to
Figure 438281DEST_PATH_IMAGE008
Figure 729586DEST_PATH_IMAGE009
Representing the linear space in which the keypoints are located, and 212x1 representing the dimensional information of the keypoints.
Preferably, in step 2, the extracted action units and the key point information of the source face are input to a key point conversion module to obtain the converted key point information of the source face, and the method includes:
step 2.1: the key point conversion module comprises two encoders and a decoder, wherein the two encoders are respectively used for extracting the action unit for driving the face and the characteristics of the key point information of the source face, and the decoder is used for predicting the key point information of the source face
Figure 75247DEST_PATH_IMAGE010
Is offset amount of
Figure 602044DEST_PATH_IMAGE011
And the final converted source face key point information is
Figure 5343DEST_PATH_IMAGE012
Step 2.2: the keypoint conversion module is trained using three loss functions, pixel level L1 loss, two countermeasures loss.
Preferably, the specific content of the loss function at pixel level L1 is: source face picture during training
Figure 342915DEST_PATH_IMAGE013
And driving the face picture
Figure 566086DEST_PATH_IMAGE014
Taking the same video from the same identity, thus driving the face mapSheet
Figure 162152DEST_PATH_IMAGE014
Key point information of
Figure 154379DEST_PATH_IMAGE015
As converted source face key point information
Figure 662852DEST_PATH_IMAGE016
The true value of (c) is given,
Figure 373319DEST_PATH_IMAGE017
representing a linear space in which the key points are located, and 212x1 representing dimension information of the key points; the loss function is as follows:
Figure 507497DEST_PATH_IMAGE018
use of two discriminators TD against lossesrAnd TD to make the key point converter accurate and robust, wherein TDr is used to judge the converted source face key point information
Figure 619809DEST_PATH_IMAGE019
The true or false of (a) is true,TDfor evaluating converted source face key point information
Figure 689397DEST_PATH_IMAGE019
And source face key point information before conversion
Figure 499876DEST_PATH_IMAGE020
The identity similarity of (2), the loss function of the two is defined as follows:
Figure 313112DEST_PATH_IMAGE021
wherein the content of the first and second substances,
Figure 138985DEST_PATH_IMAGE022
representing keypoint information for driving a face
Figure 379474DEST_PATH_IMAGE023
The expected value of the distribution function of (a),
Figure 674320DEST_PATH_IMAGE024
representing transformed source face keypoint information
Figure 884721DEST_PATH_IMAGE019
Is determined by the expected value of the distribution function of (c),
Figure 971626DEST_PATH_IMAGE025
representing source face keypoint information prior to conversion
Figure 383016DEST_PATH_IMAGE020
And key point information for driving human face
Figure 165158DEST_PATH_IMAGE023
Is determined by the expected value of the distribution function of (c),
Figure 320196DEST_PATH_IMAGE026
representing source face keypoint information prior to conversion
Figure 120662DEST_PATH_IMAGE020
And the converted source face key point information
Figure 702953DEST_PATH_IMAGE019
The expected value of the distribution function of (a);
Figure 972391DEST_PATH_IMAGE027
presentation discriminatorTD r For key point information of driving human face
Figure 665541DEST_PATH_IMAGE023
The result of the authentication of the authenticity of (b),
Figure 586092DEST_PATH_IMAGE028
presentation discriminatorTD r For the converted source face key point information
Figure 339285DEST_PATH_IMAGE019
The result of the authentication of the authenticity of (b),
Figure 93089DEST_PATH_IMAGE029
presentation discriminatorTDFor the source face key point information before conversion
Figure 589930DEST_PATH_IMAGE020
And key point information for driving human face
Figure 364988DEST_PATH_IMAGE023
The discrimination result of the identity similarity between them,
Figure 289081DEST_PATH_IMAGE030
presentation discriminatorTDFor the source face key point information before conversion
Figure 267533DEST_PATH_IMAGE020
And the converted key point information of the source face
Figure 302485DEST_PATH_IMAGE019
The identification result of identity similarity between the two groups;
the complete loss function of the final key point conversion module is a linear combination of the three:
Figure 197629DEST_PATH_IMAGE031
in the formula (I), the compound is shown in the specification,
Figure 292624DEST_PATH_IMAGE032
weights representing three loss functions respectivelyAnd (4) heavy.
Preferably, in step 3, the source face image is separated into a face region and a background region by using a pre-trained segmentation network, and the method is as follows: processing source face pictures using a pre-trained BiSeNet based face segmentation network
Figure 758371DEST_PATH_IMAGE013
Obtaining a mask of the face area, respectively filling 0 pixel in the mask area and the area outside the mask to obtain the face area of the source face
Figure 862593DEST_PATH_IMAGE033
And a background region
Figure 487610DEST_PATH_IMAGE034
Two pictures.
Preferably, in step 4, the method for generating the target face in step 1 is as follows:
step 4.1: the source face key point information converted in the step 2 is processed
Figure 878140DEST_PATH_IMAGE016
Mapping into a three-channel picture, and driving action units and posture information of the human faceAU∈R 20×1 Stitching to obtain an action representationM d ∈R 23×H×W R 23×H×W Which represents the linear space in which the picture is located,23xHxWthe dimensional information representing the picture is displayed on the display,HandWrespectively representing the height and width of the picture;M d face region from source face
Figure 955817DEST_PATH_IMAGE035
An input together forming a reproduction network;
step 4.2: face region of source face in prediction
Figure 207938DEST_PATH_IMAGE035
As input to the network, and employing a motion encoder for extracting the motion representationM d Then inserting the extracted features into the output of 3 groups of ResBlock of the reproduction network to obtain the reproduction human face
Figure 687461DEST_PATH_IMAGE036
Step 4.3: during training, the recurrence network is trained by using the following 3 loss functions: pixel level L1 penalty, countermeasure penalty, and perceptual penalty.
Preferably, pixel level L1 loses: using face regions for driving faces during training
Figure 248892DEST_PATH_IMAGE037
As a reproduced face
Figure 414864DEST_PATH_IMAGE038
The loss function is as follows:
Figure 595309DEST_PATH_IMAGE039
preferably, the resistance to loss: using two discriminatorsGDAndGD m to improve the realism of the generated result, whereinGDFor judging and reproducing human face
Figure 53973DEST_PATH_IMAGE038
The true or false of (a) is true,GD m for evaluating driving actionsM d And reproducing the face
Figure 661672DEST_PATH_IMAGE038
The loss function is defined as follows:
Figure 589307DEST_PATH_IMAGE040
in the formula (I), the compound is shown in the specification,
Figure 573444DEST_PATH_IMAGE041
face region representing a source face
Figure 886614DEST_PATH_IMAGE035
Is determined by the expected value of the distribution function of (c),
Figure 930793DEST_PATH_IMAGE042
representing a recurring face
Figure 470359DEST_PATH_IMAGE038
Is determined by the expected value of the distribution function of (c),
Figure 867973DEST_PATH_IMAGE043
indicating a driving actionM d And a face region for driving a face
Figure 176595DEST_PATH_IMAGE037
Is determined by the expected value of the distribution function of (c),
Figure 250730DEST_PATH_IMAGE044
indicating a driving actionM d And reproducing the face
Figure 277592DEST_PATH_IMAGE038
Is determined by the expected value of the distribution function of (c),
Figure 213318DEST_PATH_IMAGE045
presentation discriminatorGDFace region of source face
Figure 376446DEST_PATH_IMAGE035
The result of the authentication of the authenticity of (b),
Figure 887062DEST_PATH_IMAGE046
presentation discriminatorGDFor reproducing human face
Figure 135640DEST_PATH_IMAGE038
Is true ofThe result of the sexual authentication is that,
Figure 872128DEST_PATH_IMAGE047
presentation discriminatorGD m To the driving actionM d And a face region for driving a face
Figure 155342DEST_PATH_IMAGE037
The result of discrimination of the correlation between them,
Figure 571279DEST_PATH_IMAGE048
presentation discriminatorGD m To the driving actionM d And reproducing the face
Figure 307154DEST_PATH_IMAGE038
The correlation between the two groups.
Preferably, for perceptual loss: for minimizing reproduction of human face
Figure 850262DEST_PATH_IMAGE038
And its true value
Figure 987982DEST_PATH_IMAGE037
The loss function is defined as follows, whereinVFeature extraction operation on behalf of the VGG-16 model:
Figure 574822DEST_PATH_IMAGE049
the final integrity loss function of the recurrent network is:
Figure 797993DEST_PATH_IMAGE050
in the formula (I), the compound is shown in the specification,
Figure 879212DEST_PATH_IMAGE051
the weights of the three loss functions are represented separately.
Preferably, the implementation method in step 5 is as follows:
the human face in the step 4 is reproduced
Figure 871439DEST_PATH_IMAGE038
And the background area of the source face in step 3
Figure 160338DEST_PATH_IMAGE034
Splicing is carried out to be used as the input of a background fusion network, and the network generates a picture
Figure 870805DEST_PATH_IMAGE052
And a single-channel maskMThe final fusion result is obtained by the following formula:
Figure 490136DEST_PATH_IMAGE053
in this way, the fused result will retain the input reconstructed face
Figure 602449DEST_PATH_IMAGE054
The pixel content of the picture, and the final fusion result of the module during training
Figure 531090DEST_PATH_IMAGE055
Loss and countermeasure loss with L2:
Figure 728854DEST_PATH_IMAGE056
in the formula:
Figure 154805DEST_PATH_IMAGE057
representing source face pictures
Figure 121624DEST_PATH_IMAGE013
Is determined by the expected value of the distribution function of (c),
Figure 486747DEST_PATH_IMAGE058
indicating the result of fusion
Figure 906227DEST_PATH_IMAGE055
The expected value of the distribution function of (a),
Figure 132940DEST_PATH_IMAGE059
presentation discriminatorDSource face picture
Figure 954265DEST_PATH_IMAGE060
The result of the authentication of the authenticity of (b),
Figure 490289DEST_PATH_IMAGE061
presentation discriminatorDFor the fusion result
Figure 397065DEST_PATH_IMAGE055
The result of authentication of authenticity of (a); the complete loss function of the final background fusion module is a linear combination of the two:
Figure 161890DEST_PATH_IMAGE062
in the formula (I), the compound is shown in the specification,
Figure 103301DEST_PATH_IMAGE063
representing the weights of the two loss functions, respectively.
Has the advantages that: compared with the prior art, the invention has the following advantages:
(1) the invention integrates two characteristic expressions of key point information and an action unit driving the face as a guide signal for face reproduction, and can keep more face details while keeping the identity of the face.
(2) According to the method, the action characteristics are inserted through the space self-adaptive regularization, so that semantic information loss in the reproduction process is reduced, and the authenticity of a generated result is further improved.
(3) The invention utilizes the background separation technology to ensure that the reproduction network can concentrate on generating sharper faces, better maintains background information and realizes high-fidelity face reproduction.
Drawings
Fig. 1 is a flow chart diagram of the high-fidelity face reproduction method represented by the mixed actions of the invention.
Fig. 2 is a model structure diagram of a reproduction network in the method.
Detailed Description
The present invention will be further illustrated with reference to the accompanying drawings and specific examples, which are carried out on the premise of the technical solution of the present invention, and it should be understood that these examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.
The high-fidelity face reproduction method expressed by mixed actions comprises the steps of firstly extracting Action Units (AUs) and attitude information of a driving face and key point (landworks) information of a source face; converting the key point information of the source face by using a key point conversion module according to the action unit and the posture information of the driving face; then, separating the source face picture into a face region and a background region by using a pre-trained segmentation network; inputting the action unit, the converted key point information and the face area into a reproduction network to generate a target face; and finally, fusing the target face and the background area by using a background fusion network to generate a final result. The specific implementation steps of the invention are shown in fig. 1, and mainly comprise the following five steps 1-5:
step 1: extracting action units and posture information of a driving face and key point information of a source face; the specific method comprises the following steps:
step 1.1: let the human face picture be driven
Figure 810226DEST_PATH_IMAGE001
The source face picture is
Figure 204298DEST_PATH_IMAGE002
Figure 772814DEST_PATH_IMAGE003
Which represents the linear space in which the picture is located,
Figure 303152DEST_PATH_IMAGE004
the dimensional information representing the picture is displayed on the display,
Figure 321924DEST_PATH_IMAGE005
respectively representing the height and the width of the picture;
step 1.2: extracting action units and posture information of a driving face by using a face behavior analysis tool OpenFace, wherein the action units of the face comprise the intensity of 17 action units, the posture information comprises rotation angles along 3 axes of pitch, yaw and roll, and the motion units and the posture information are spliced to obtain a 20-dimensional vector
Figure 327926DEST_PATH_IMAGE006
Figure 824766DEST_PATH_IMAGE007
Representing the linear space in which the vector is located, and 20x1 representing the dimensional information of the vector;
step 1.3: extracting 106 key point information of a source face by using a face key point detection method HyperLandmark, and adjusting the shape to
Figure 347627DEST_PATH_IMAGE008
Figure 6142DEST_PATH_IMAGE009
Representing the linear space in which the keypoints are located, and 212x1 representing the dimensional information of the keypoints.
Step 2: inputting the extracted action units of the face and the key point information of the source face into a key point conversion module to obtain converted key point information of the source face; the method comprises the following steps:
step 2.1: the key point conversion module comprises two encoders and a decoder, wherein the two encoders are respectively used for extracting the action unit for driving the face and the characteristics of the key point information of the source face, and the decoder is used for predicting the key point information of the source face
Figure 578068DEST_PATH_IMAGE010
Is offset amount of
Figure 940917DEST_PATH_IMAGE011
And the final converted source face key point information is
Figure 711426DEST_PATH_IMAGE012
Step 2.2: the keypoint conversion module is trained by three loss functions, pixel level L1 loss and two countervailing losses.
The specific content of the pixel level L1 loss function is: source face picture during training
Figure 416208DEST_PATH_IMAGE013
And driving the face picture
Figure 272169DEST_PATH_IMAGE014
Taking the same video from the same identity, thus driving a picture of the face
Figure 235446DEST_PATH_IMAGE014
Key point information of
Figure 594883DEST_PATH_IMAGE015
As converted source face key point information
Figure 736145DEST_PATH_IMAGE016
The true value of (c) is given,
Figure 79402DEST_PATH_IMAGE017
representing a linear space in which the key points are located, and 212x1 representing dimension information of the key points; the loss function is as follows:
Figure 846370DEST_PATH_IMAGE018
use of two discriminators TD against lossrAnd TD to make the keypoint converter accurate and robust, where TDr is used to determine converted source face keypointsInformation
Figure 325893DEST_PATH_IMAGE019
The true or false of (1) is true or false,TDfor evaluating converted source face key point information
Figure 652705DEST_PATH_IMAGE019
And source face key point information before conversion
Figure 217678DEST_PATH_IMAGE020
The loss function of the two is defined as follows:
Figure 522758DEST_PATH_IMAGE064
wherein the content of the first and second substances,
Figure 856787DEST_PATH_IMAGE022
representing key point information of driving human face
Figure 339852DEST_PATH_IMAGE023
Is determined by the expected value of the distribution function of (c),
Figure 392122DEST_PATH_IMAGE024
representing transformed source face keypoint information
Figure 235313DEST_PATH_IMAGE019
Is determined by the expected value of the distribution function of (c),
Figure 689428DEST_PATH_IMAGE025
representing source face keypoint information prior to conversion
Figure 733607DEST_PATH_IMAGE020
And key point information for driving human face
Figure 148539DEST_PATH_IMAGE023
Is determined by the expected value of the distribution function of (c),
Figure 670788DEST_PATH_IMAGE026
representing source face keypoint information prior to conversion
Figure 104043DEST_PATH_IMAGE020
And the converted source face key point information
Figure 53544DEST_PATH_IMAGE019
The expected value of the distribution function of (a);
Figure 690193DEST_PATH_IMAGE027
presentation discriminatorTD r For key point information of driving human face
Figure 16132DEST_PATH_IMAGE023
The result of the authentication of the authenticity of (b),
Figure 303894DEST_PATH_IMAGE028
presentation discriminatorTD r For the converted source face key point information
Figure 424297DEST_PATH_IMAGE019
The result of the authentication of the authenticity of (a),
Figure 810891DEST_PATH_IMAGE029
presentation discriminatorTDFor the source face key point information before conversion
Figure 674942DEST_PATH_IMAGE020
And key point information for driving human face
Figure 817211DEST_PATH_IMAGE023
The discrimination result of the identity similarity between them,
Figure 374094DEST_PATH_IMAGE030
presentation discriminatorTDFor the source face key point information before conversion
Figure 985335DEST_PATH_IMAGE020
And the converted key point information of the source face
Figure 653077DEST_PATH_IMAGE019
The identification result of identity similarity between the two groups;
the complete loss function of the final key point conversion module is a linear combination of the three:
Figure 649852DEST_PATH_IMAGE031
in the formula (I), the compound is shown in the specification,
Figure 377636DEST_PATH_IMAGE032
representing the weights of the three loss functions, respectively.
And step 3: separating a source face picture into a face region and a background region by using a pre-trained segmentation network; the method comprises the following steps:
processing source face pictures using a pre-trained BiSeNet-based face segmentation network
Figure 476173DEST_PATH_IMAGE013
Obtaining a mask of the face area, respectively filling 0 pixel in the mask area and the area outside the mask to obtain the face area of the source face
Figure 682027DEST_PATH_IMAGE033
And a background region
Figure 798887DEST_PATH_IMAGE034
Two pictures.
And 4, step 4: inputting the action unit of the face in the step 1, the source face key point information converted in the step 2 and the face area in the step 3 into a reproduction network shown in the figure 2 to generate a target face; the method comprises the following steps:
step 4.1: the source face key point information converted in the step 2 is processed
Figure 697573DEST_PATH_IMAGE016
Mapping into a three-channel picture, and driving action units and posture information of the human faceAU∈R 20×1 Stitching to obtain an action representation
Figure 17827DEST_PATH_IMAGE065
Figure 292951DEST_PATH_IMAGE066
Which represents the linear space in which the picture is located,
Figure 529897DEST_PATH_IMAGE067
the dimensional information representing the picture is displayed on the display,
Figure 333905DEST_PATH_IMAGE068
respectively representing the height and width of the picture;M d face region from source face
Figure 409964DEST_PATH_IMAGE035
An input together forming a reproduction network;
step 4.2: as shown in fig. 2, the playback network adopts a Pix2 Pix-based network framework, which includes 3 sets of ResBlock residual blocks, and predicts the face region of the source face
Figure 957620DEST_PATH_IMAGE035
As input to the network, and employing a motion encoder for extracting the motion representationM d Then inserting the extracted features into the output of 3 groups of ResBlock of the recurrent network by utilizing a space regularization (SPADE) module, wherein the SPADE module is mainly used for reducing semantic loss in the generation process and finally obtaining the recurrent face
Figure 49073DEST_PATH_IMAGE038
Step 4.3: during training, the recurrence network is trained by using the following 3 loss functions: pixel level L1 penalty, countermeasure penalty, and perceptual penalty.
Pixel level L1 penalty: similar to the key point conversion module, the face region for driving the face is adopted during training
Figure 23982DEST_PATH_IMAGE037
As a reproduced face
Figure 584407DEST_PATH_IMAGE038
The loss function is as follows:
Figure 935754DEST_PATH_IMAGE039
the resistance loss: using two discriminatorsGDAndGD m to improve the realism of the generated result, whereinGDFor judging and reproducing human face
Figure 881714DEST_PATH_IMAGE038
The true or false of (a) is true,GD m for evaluating driving actionsM d And reproducing the face
Figure 293103DEST_PATH_IMAGE069
The loss function is defined as follows:
Figure 75246DEST_PATH_IMAGE040
in the formula (I), the compound is shown in the specification,
Figure 964704DEST_PATH_IMAGE070
face region representing a source face
Figure 30749DEST_PATH_IMAGE035
Is determined by the expected value of the distribution function of (c),
Figure 347461DEST_PATH_IMAGE042
means of weightModern face
Figure 616900DEST_PATH_IMAGE038
The expected value of the distribution function of (a),
Figure 310049DEST_PATH_IMAGE043
indicating a driving actionM d And a face region for driving a face
Figure 230601DEST_PATH_IMAGE037
Is determined by the expected value of the distribution function of (c),
Figure 983793DEST_PATH_IMAGE044
indicating a driving actionM d And reproducing the face
Figure 737598DEST_PATH_IMAGE038
Is determined by the expected value of the distribution function of (c),
Figure 968859DEST_PATH_IMAGE045
presentation discriminatorGDFace region of source face
Figure 743917DEST_PATH_IMAGE035
The result of the authentication of the authenticity of (b),
Figure 933590DEST_PATH_IMAGE046
presentation discriminatorGDFor reproducing human face
Figure 302254DEST_PATH_IMAGE038
The result of the authentication of the authenticity of (b),
Figure 212573DEST_PATH_IMAGE047
presentation discriminatorGD m To drive actionM d And a face region for driving a face
Figure 717503DEST_PATH_IMAGE037
The result of discrimination of the correlation between them,
Figure 937132DEST_PATH_IMAGE048
presentation discriminatorGD m To the driving actionM d And reproducing the face
Figure 793093DEST_PATH_IMAGE038
The correlation between the two groups.
Loss of perception: for minimizing reproduction of human face
Figure 241523DEST_PATH_IMAGE038
And its true value
Figure 866539DEST_PATH_IMAGE037
The loss function is defined as follows, whereinVFeature extraction operations on behalf of the VGG-16 model:
Figure 257069DEST_PATH_IMAGE049
the final integrity loss function of the recurrent network is:
Figure 334746DEST_PATH_IMAGE050
in the formula (I), the compound is shown in the specification,
Figure 852447DEST_PATH_IMAGE051
the weights of the three loss functions are represented separately.
And 5: and (4) inputting the target face in the step (4) and the background area in the step (3) into a background fusion module to generate a final result. The method comprises the following steps:
the human face in the step 4 is reproduced
Figure 597549DEST_PATH_IMAGE038
And the background area of the source face in step 3
Figure 158980DEST_PATH_IMAGE034
Splicing is carried out to be used as the input of a background fusion network, and the network generates a picture
Figure 458374DEST_PATH_IMAGE052
And a single-channel maskMThe final fusion result is obtained by the following formula:
Figure 239818DEST_PATH_IMAGE053
in this way, the fused result will retain the input reconstructed face
Figure 839426DEST_PATH_IMAGE054
The pixel content of the picture, and the final fusion result of the module during training
Figure 571759DEST_PATH_IMAGE055
Loss and countermeasure loss with L2:
Figure 889608DEST_PATH_IMAGE056
in the formula:
Figure 483531DEST_PATH_IMAGE057
representing source face pictures
Figure 937647DEST_PATH_IMAGE013
Is determined by the expected value of the distribution function of (c),
Figure 840881DEST_PATH_IMAGE058
indicating the result of fusion
Figure 380446DEST_PATH_IMAGE055
Is determined by the expected value of the distribution function of (c),
Figure 778061DEST_PATH_IMAGE059
presentation discriminatorDSource face picture
Figure 86682DEST_PATH_IMAGE060
The result of the authentication of the authenticity of (b),
Figure 160818DEST_PATH_IMAGE061
presentation discriminatorDFor the fusion result
Figure 922100DEST_PATH_IMAGE055
The result of authentication of authenticity of (a); the complete loss function of the final background fusion module is a linear combination of the two:
Figure 123405DEST_PATH_IMAGE071
in the formula (I), the compound is shown in the specification,
Figure 286534DEST_PATH_IMAGE063
representing the weights of the two loss functions, respectively.
The effectiveness and efficiency of the method of the invention are verified by the following experiments:
the evaluation criteria were Structural Similarity (SSIM) and freschel perceptual distance (FID). SSIM evaluates low-level similarity between the generated image and the truth value, with larger values being better. The FID uses a pre-trained inclusion V3 network to evaluate the perceived distance between the generated image and the real image, with smaller values being better.
The experiment employed a VoxCeleb1 data set that included a cumulative 24997 segments of real video of 1251 different identities. The data set provides a picture of a human face extracted and cropped at 1 frame per second. The experiment used video segments with an average resolution greater than 300x300, resulting in 29891 training pictures and 4284 test pictures. And the pictures are scaled to 256x256, and then the keypoint information of 106 points is extracted by using HyperLandmark, and AUs and the posture information are extracted by using OpenFace.
The results were compared with those generated by FreNet, a landworks-based method, and ICface, a AUs-based method, respectively. The results of the three methods are shown in table 1 for two evaluation indices:
TABLE 1 test results of the inventive method on the VoxColeb 1 data set
Figure 531570DEST_PATH_IMAGE072
The results in table 1 show that the method of the present invention achieves better results than the method based on only the keypoint information (landworks) and only the Action Units (AUs). Specifically, for the SSIM index, the background separation technology of the invention better retains the background, thereby improving the low-level similarity between the generated result and the original image; for the FID index, the method integrates the two characteristic representations to better reserve the details of the source face, thereby reducing the perception distance between the generated result and the original image. The results show that the invention combines two characteristic representations and shows certain effectiveness on separating the background. In general, the method of the invention can fully reserve the semantic features of the human face and generate a more real human face and a more real background. Based on the above-mentioned results, the face reconstruction method using hybrid motion representation creates a high fidelity face falsification tool.
The method mixes a plurality of motion representations to be used as guide signals for human face reproduction, and inserts motion features by utilizing space self-adaptive regularization, so that semantic features can be better maintained in the reproduction process. Meanwhile, by combining the background separation technology, the authenticity and the interframe continuity of the generated face are further improved, and high-fidelity face reproduction is realized.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims (10)

1. A high-fidelity face reproduction method represented by mixed actions is characterized by mainly comprising the following steps:
step 1: extracting action units and posture information of a driving face and key point information of a source face;
step 2: inputting the extracted action units of the face and the key point information of the source face into a key point conversion module to obtain converted key point information of the source face;
and step 3: separating a source face picture into a face region and a background region by using a pre-trained segmentation network;
and 4, step 4: inputting the action unit of the face in the step 1, the source face key point information converted in the step 2 and the face area in the step 3 into a reproduction network to generate a target face;
and 5: and (4) inputting the target face in the step (4) and the background area in the step (3) into a background fusion module to generate a final result.
2. A high fidelity face representation method of mixed motion representation as claimed in claim 1, wherein in step 1, the motion unit and pose information of the driving face and the key point information of the source face are extracted as follows:
step 1.1: let drive the human face picture as
Figure 979376DEST_PATH_IMAGE001
The source face picture is
Figure 732568DEST_PATH_IMAGE002
Figure 378051DEST_PATH_IMAGE003
Which represents the linear space in which the picture is located,
Figure 874892DEST_PATH_IMAGE004
the dimensional information representing the picture is displayed on the display,
Figure 790895DEST_PATH_IMAGE005
respectively representing the height and width of the picture;
step 1.2: extracting action units and posture information of the driving human face, and splicing the action units and the posture information to obtain a 20-dimensional vector
Figure 714989DEST_PATH_IMAGE006
Figure 83653DEST_PATH_IMAGE007
Representing the linear space in which the vector is located, and 20x1 representing the dimensional information of the vector;
step 1.3: extracting 106 key point information of source face and regulating shape to
Figure 384184DEST_PATH_IMAGE008
Figure 154694DEST_PATH_IMAGE009
Representing the linear space in which the keypoints are located, and 212x1 representing the dimensional information of the keypoints.
3. The high-fidelity face reproduction method expressed by mixed actions as claimed in claim 1, wherein in step 2, the extracted action units and the key point information of the source face are input into a key point conversion module to obtain the converted key point information of the source face, and the method comprises the following steps:
step 2.1: the key point conversion module comprises two encoders and a decoder, wherein the two encoders are respectively used for extracting the action unit for driving the face and the characteristics of the key point information of the source face, and the decoder is used for predicting the key point information of the source face
Figure 249689DEST_PATH_IMAGE010
Is offset amount of
Figure 371229DEST_PATH_IMAGE011
And the final converted source face key point information is
Figure 209872DEST_PATH_IMAGE012
Step 2.2: the keypoint conversion module is trained using three loss functions, pixel level L1 loss, two countermeasures loss.
4. A high fidelity face representation method as claimed in claim 3, wherein the specific content of the pixel level L1 loss function is: source face picture during training
Figure 834888DEST_PATH_IMAGE013
And driving the face picture
Figure 366364DEST_PATH_IMAGE014
Taking the same video from the same identity, thus driving a picture of the face
Figure 945506DEST_PATH_IMAGE014
Key point information of
Figure 853419DEST_PATH_IMAGE015
As converted source face key point information
Figure 332942DEST_PATH_IMAGE016
The true value of (c) is given,
Figure 35319DEST_PATH_IMAGE017
representing a linear space in which the key points are located, and 212x1 representing dimension information of the key points; the loss function is as follows:
Figure 865872DEST_PATH_IMAGE018
use of two discriminators TD against lossrAnd TD to make the key point converter accurate and robust, wherein TDr is used to judge the converted source face key point information
Figure 311897DEST_PATH_IMAGE019
The true or false of (a) is true,TDfor evaluating converted source face key point information
Figure 645926DEST_PATH_IMAGE019
And source face key point information before conversion
Figure 519204DEST_PATH_IMAGE020
The loss function of the two is defined as follows:
Figure 837053DEST_PATH_IMAGE021
wherein, the first and the second end of the pipe are connected with each other,
Figure 555610DEST_PATH_IMAGE022
representing key point information of driving human face
Figure 275304DEST_PATH_IMAGE023
Is determined by the expected value of the distribution function of (c),
Figure 319484DEST_PATH_IMAGE024
representing transformed source face keypoint information
Figure 623164DEST_PATH_IMAGE019
The expected value of the distribution function of (a),
Figure 145412DEST_PATH_IMAGE025
representing source face keypoint information prior to conversion
Figure 719613DEST_PATH_IMAGE020
And key point information for driving human face
Figure 934694DEST_PATH_IMAGE023
Is determined by the expected value of the distribution function of (c),
Figure 695976DEST_PATH_IMAGE026
representing source face keypoint information prior to conversion
Figure 287495DEST_PATH_IMAGE020
And the converted key point information of the source face
Figure 450623DEST_PATH_IMAGE019
The expected value of the distribution function of (a);
Figure 836605DEST_PATH_IMAGE027
presentation discriminatorTD r For key point information of driving human face
Figure 616342DEST_PATH_IMAGE023
The result of the authentication of the authenticity of (b),
Figure 745972DEST_PATH_IMAGE028
presentation discriminatorTD r For the converted source face key point information
Figure 763606DEST_PATH_IMAGE019
The result of the authentication of the authenticity of (b),
Figure 586069DEST_PATH_IMAGE029
presentation discriminatorTDFor the source face key point information before conversion
Figure 811690DEST_PATH_IMAGE020
And key point information for driving human face
Figure 745011DEST_PATH_IMAGE023
The discrimination result of the identity similarity between them,
Figure 882731DEST_PATH_IMAGE030
presentation discriminatorTDFor the source face key point information before conversion
Figure 876095DEST_PATH_IMAGE020
And the converted key point information of the source face
Figure 99266DEST_PATH_IMAGE019
The identification result of identity similarity between the two groups;
the complete loss function of the final key point conversion module is a linear combination of the three:
Figure 570698DEST_PATH_IMAGE031
in the formula (I), the compound is shown in the specification,
Figure 828505DEST_PATH_IMAGE032
representing the weights of the three loss functions, respectively.
5. A high-fidelity face reproduction method of mixed motion representation as claimed in claim 1, characterized in that in step 3, the pre-trained segmentation network is used to separate the source face picture into a face region and a background region, and the method is as follows: processing source face pictures using a pre-trained BiSeNet-based face segmentation network
Figure 727190DEST_PATH_IMAGE013
Obtaining a mask of the face area, respectively filling 0 pixel in the mask area and the area outside the mask to obtain the face area of the source face
Figure 437657DEST_PATH_IMAGE033
And a background region
Figure 712781DEST_PATH_IMAGE034
Two pictures.
6. A high fidelity face representation method of mixed motion representation as claimed in claim 1, wherein in step 4, the method of generating the target face in step 1 is as follows:
step 4.1: the source face key point information converted in the step 2 is processed
Figure 825093DEST_PATH_IMAGE016
Mapping the three-channel image into a three-channel image, and driving the action unit and the attitude information of the human faceAU∈R 20×1 Stitching to obtain an action representationM d ∈R 23×H×W R 23×H×W Which represents the linear space in which the picture is located,23xHxWthe dimensional information representing the picture is displayed on the display,HandWrespectively representing the height and width of the picture;M d face region from source face
Figure 894681DEST_PATH_IMAGE035
An input together forming a reproduction network;
step 4.2: face region of source face in prediction
Figure 856558DEST_PATH_IMAGE035
As input to the network, and employing a motion encoder for extracting a representation of the motionM d Then inserting the extracted features into the output of 3 groups of ResBlock of the reproduction network to obtain the reproduction human face
Figure 669793DEST_PATH_IMAGE036
Step 4.3: during training, the recurrence network is trained by using the following 3 loss functions: pixel level L1 penalty, countermeasure penalty, and perceptual penalty.
7. The high fidelity face reproduction method of blended motion representations as claimed in claim 6, wherein pixel level L1 loss: using face regions for driving faces during training
Figure 636612DEST_PATH_IMAGE037
As a reproduced face
Figure 877101DEST_PATH_IMAGE038
The loss function is as follows:
Figure 562160DEST_PATH_IMAGE039
8. the high-fidelity face reproduction method of hybrid motion representation according to claim 6, characterized in that the face reproduction method is characterized in that the face reproduction method comprises the following steps of: using two discriminatorsGDAndGD m to improve the realism of the generated result, whereinGDFor judging and reproducing human face
Figure 179086DEST_PATH_IMAGE040
The true or false of (a) is true,GD m for evaluating driving actionsM d And reproducing the face
Figure 265991DEST_PATH_IMAGE038
The loss function is defined as follows:
Figure 677381DEST_PATH_IMAGE041
in the formula (I), the compound is shown in the specification,
Figure 849736DEST_PATH_IMAGE042
face representing a source faceRegion(s)
Figure 4774DEST_PATH_IMAGE035
Is determined by the expected value of the distribution function of (c),
Figure 211764DEST_PATH_IMAGE043
representing a recurring face
Figure 528476DEST_PATH_IMAGE038
Is determined by the expected value of the distribution function of (c),
Figure 689592DEST_PATH_IMAGE044
indicating a driving actionM d And a face region for driving a face
Figure 648321DEST_PATH_IMAGE037
Is determined by the expected value of the distribution function of (c),
Figure 444239DEST_PATH_IMAGE045
indicating a driving actionM d And reproducing the face
Figure 463010DEST_PATH_IMAGE038
Is determined by the expected value of the distribution function of (c),
Figure 609958DEST_PATH_IMAGE046
presentation discriminatorGDFace region of source face
Figure 106798DEST_PATH_IMAGE035
The result of the authentication of the authenticity of (b),
Figure 757223DEST_PATH_IMAGE047
indicating signaturePin deviceGDFor reproducing human face
Figure 946895DEST_PATH_IMAGE038
The result of the authentication of the authenticity of (a),
Figure 581139DEST_PATH_IMAGE048
presentation discriminatorGD m To the driving actionM d And a face region for driving a face
Figure 616091DEST_PATH_IMAGE037
The result of discrimination of the correlation between them,
Figure 386601DEST_PATH_IMAGE049
presentation discriminatorGD m To the driving actionM d And reproducing the face
Figure 747175DEST_PATH_IMAGE038
The discrimination result of the correlation between them.
9. A high fidelity face reproduction method of a hybrid motion representation as claimed in claim 6, characterized in that for perceptual loss: for minimizing reproduction of human face
Figure 101671DEST_PATH_IMAGE038
And its truth value
Figure 940314DEST_PATH_IMAGE037
The loss function is defined as follows, whereinVFeature extraction operations on behalf of the VGG-16 model:
Figure 830910DEST_PATH_IMAGE050
the final integrity loss function of the recurrent network is:
Figure 362385DEST_PATH_IMAGE051
in the formula (I), the compound is shown in the specification,
Figure 440063DEST_PATH_IMAGE052
the weights of the three loss functions are represented separately.
10. The high-fidelity face reproduction method of mixed motion representation according to claim 1, characterized in that the implementation method in step 5 is as follows:
the human face in the step 4 is reproduced
Figure 347976DEST_PATH_IMAGE038
And the background area of the source face in step 3
Figure 827499DEST_PATH_IMAGE034
Splicing is carried out to be used as the input of a background fusion network, and the network generates a picture
Figure 529875DEST_PATH_IMAGE053
And a single-channel maskMThe final fusion result is obtained by the following formula:
Figure 360428DEST_PATH_IMAGE054
in this way, the fused result will retain the input reconstructed face
Figure 275295DEST_PATH_IMAGE055
The pixel content of the picture, and the final fusion result of the module during training
Figure 140482DEST_PATH_IMAGE056
Loss and damage tolerance of the upper layer using L2Losing:
Figure 13760DEST_PATH_IMAGE057
in the formula:
Figure 331609DEST_PATH_IMAGE058
representing source face pictures
Figure 551631DEST_PATH_IMAGE013
Is determined by the expected value of the distribution function of (c),
Figure 271326DEST_PATH_IMAGE059
indicating the result of fusion
Figure 315505DEST_PATH_IMAGE056
Is determined by the expected value of the distribution function of (c),
Figure 855071DEST_PATH_IMAGE060
presentation discriminatorDSource face picture
Figure 642898DEST_PATH_IMAGE061
The result of the authentication of the authenticity of (b),
Figure 217099DEST_PATH_IMAGE062
presentation discriminatorDFor the fusion result
Figure 432180DEST_PATH_IMAGE056
The result of authentication of authenticity of (a); the complete loss function of the final background fusion module is a linear combination of the two:
Figure 193462DEST_PATH_IMAGE063
in the formula (I), the compound is shown in the specification,
Figure 784981DEST_PATH_IMAGE064
representing the weights of the two loss functions, respectively.
CN202210459830.5A 2022-04-28 2022-04-28 High-fidelity face reproduction method represented by mixed actions Active CN114581612B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210459830.5A CN114581612B (en) 2022-04-28 2022-04-28 High-fidelity face reproduction method represented by mixed actions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210459830.5A CN114581612B (en) 2022-04-28 2022-04-28 High-fidelity face reproduction method represented by mixed actions

Publications (2)

Publication Number Publication Date
CN114581612A true CN114581612A (en) 2022-06-03
CN114581612B CN114581612B (en) 2022-08-02

Family

ID=81785017

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210459830.5A Active CN114581612B (en) 2022-04-28 2022-04-28 High-fidelity face reproduction method represented by mixed actions

Country Status (1)

Country Link
CN (1) CN114581612B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112233012A (en) * 2020-08-10 2021-01-15 上海交通大学 Face generation system and method
CN112734634A (en) * 2021-03-30 2021-04-30 中国科学院自动化研究所 Face changing method and device, electronic equipment and storage medium
CN113343878A (en) * 2021-06-18 2021-09-03 北京邮电大学 High-fidelity face privacy protection method and system based on generation countermeasure network
CN113762147A (en) * 2021-09-06 2021-12-07 网易(杭州)网络有限公司 Facial expression migration method and device, electronic equipment and storage medium
CN113807265A (en) * 2021-09-18 2021-12-17 山东财经大学 Diversified human face image synthesis method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112233012A (en) * 2020-08-10 2021-01-15 上海交通大学 Face generation system and method
CN112734634A (en) * 2021-03-30 2021-04-30 中国科学院自动化研究所 Face changing method and device, electronic equipment and storage medium
CN113343878A (en) * 2021-06-18 2021-09-03 北京邮电大学 High-fidelity face privacy protection method and system based on generation countermeasure network
CN113762147A (en) * 2021-09-06 2021-12-07 网易(杭州)网络有限公司 Facial expression migration method and device, electronic equipment and storage medium
CN113807265A (en) * 2021-09-18 2021-12-17 山东财经大学 Diversified human face image synthesis method and system

Also Published As

Publication number Publication date
CN114581612B (en) 2022-08-02

Similar Documents

Publication Publication Date Title
CN113194348B (en) Virtual human lecture video generation method, system, device and storage medium
Cao et al. Semi-automatic 2D-to-3D conversion using disparity propagation
Ye et al. Audio-driven talking face video generation with dynamic convolution kernels
CN113112416B (en) Semantic-guided face image restoration method
CN115908659A (en) Method and device for synthesizing speaking face based on generation countermeasure network
CN115527276A (en) Deep pseudo video detection method based on fusion of facial optical flow field and texture characteristics
CN115908789A (en) Cross-modal feature fusion and asymptotic decoding saliency target detection method and device
CN117671764A (en) Transformer-based dynamic speaker face image generation system and method
CN115424310A (en) Weak label learning method for expression separation task in human face rehearsal
CN114581612B (en) High-fidelity face reproduction method represented by mixed actions
CN114119694A (en) Improved U-Net based self-supervision monocular depth estimation algorithm
CN113989709A (en) Target detection method and device, storage medium and electronic equipment
CN117152283A (en) Voice-driven face image generation method and system by using diffusion model
CN115908661A (en) Method for generating singing video from drama character picture based on GAN network
CN116721320A (en) Universal image tampering evidence obtaining method and system based on multi-scale feature fusion
CN113673567B (en) Panorama emotion recognition method and system based on multi-angle sub-region self-adaption
CN115345781A (en) Multi-view video stitching method based on deep learning
CN106023120B (en) Human face portrait synthetic method based on coupling neighbour's index
Gao et al. RGBD semantic segmentation based on global convolutional network
CN116091326A (en) Face image shielding repair algorithm based on deep learning and application research thereof
Xiao et al. Multi-modal weights sharing and hierarchical feature fusion for RGBD salient object detection
CN107770511A (en) A kind of decoding method of multi-view point video, device and relevant device
CN114693565B (en) GAN image restoration method based on jump connection multi-scale fusion
Xu et al. Multi-modal learning with text merging for textvqa
CN115861457A (en) Face replay method based on face action representation fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant