CN111783658B - Two-stage expression animation generation method based on dual-generation reactance network - Google Patents

Two-stage expression animation generation method based on dual-generation reactance network Download PDF

Info

Publication number
CN111783658B
CN111783658B CN202010621885.2A CN202010621885A CN111783658B CN 111783658 B CN111783658 B CN 111783658B CN 202010621885 A CN202010621885 A CN 202010621885A CN 111783658 B CN111783658 B CN 111783658B
Authority
CN
China
Prior art keywords
expression
stage
image
loss
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010621885.2A
Other languages
Chinese (zh)
Other versions
CN111783658A (en
Inventor
郭迎春
王静洁
刘依
朱叶
郝小可
于洋
师硕
阎刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei University of Technology
Original Assignee
Hebei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei University of Technology filed Critical Hebei University of Technology
Priority to CN202010621885.2A priority Critical patent/CN111783658B/en
Publication of CN111783658A publication Critical patent/CN111783658A/en
Application granted granted Critical
Publication of CN111783658B publication Critical patent/CN111783658B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/176Dynamic expression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The application relates to a two-stage expression animation generation method based on a double generation contrast network, which comprises the steps of firstly utilizing an expression migration network faceGAN to extract expression characteristics in a target expression profile graph in a first stage, migrating the expression characteristics to a source face and generating a first-stage prediction graph; in the second stage, a detail generation network FineGAN is utilized to supplement and enrich the details of eyes and mouth areas with relatively large contribution to the expression change in the first-stage prediction graph, a fine-granularity second-stage prediction graph is generated and a face video animation is synthesized, and the expression migration network faceGAN and the detail generation network FineGAN are realized by adopting a generation countermeasure network. The application provides two-stage generation of the countermeasure network to generate the expression animation, the first stage of the expression conversion is performed, the second stage of the optimization of the image details is performed, the appointed area of the image is extracted through the mask vector to perform the emphasis optimization, and meanwhile, the use of the local discriminator is combined, so that the important part generation effect is better.

Description

Two-stage expression animation generation method based on dual-generation reactance network
Technical Field
The technical scheme of the application relates to image data processing in computer vision, in particular to a two-stage expression animation generation method based on a dual-generation contrast network.
Background
Facial expression synthesis refers to the process of transferring an expression from a target expression reference face to a source face, identity information of a newly synthesized source face image is kept unchanged, but the expression of the newly synthesized source face image is consistent with the target expression reference face, and the technology is gradually applied to the fields of film and television production, virtual reality, criminal investigation and the like. Facial expression synthesis has important research value in academia and industry, and how to robustly synthesize natural and lifelike facial expressions becomes a challenging hot research topic.
The existing facial expression synthesis method can be divided into two main types, namely a traditional graphics method and an image generation method based on deep learning. The first type of traditional graphics method generally uses a parameter model to parameterize a source face image, designs the model to convert expressions and generate a new image, or uses feature correspondence and an optical flow diagram to distort the face image, and integrates the existing expression data into a face patch and the like, but the process of designing the model is detailed and complex, high-cost calculation amount can be generated, and generalization capability is poor.
The second category is expression synthesis method based on deep learning. Firstly, extracting facial features by using a deep neural network, mapping an image from a high-dimensional space to feature vectors, then changing source expression features by adding expression labels, synthesizing a target facial image by using the deep neural network, and mapping back to the high-dimensional space. The advent of GAN networks has then brought about dawn to achieve clear image synthesis, which has attracted considerable attention once proposed. In the field of image synthesis, a large number of research methods such as GAN variants have emerged to generate images. For example, the condition generation countermeasure network (Conditional Generative Adversarial Network, CGAN) can generate images under specific supervision information, and in the field of facial expression generation, expression labels can be used as condition supervision information, so that facial images with different expressions can be generated. At present, related methods based on the GAN network have some defects, and when generating the form animation, the problems of unreasonable artifact, fuzzy generated image, low resolution and the like can occur.
The facial expression generation is image-to-image conversion, and the aim of the application is to generate facial animation, which belongs to image-to-video conversion and increases the challenges in time dimension compared with facial expression generation task. Xing et al use a gender-preserving network in the text of GP-GAN Gender Preserving GAN for Synthesizing Faces from Landmarks to enable the network to learn more gender information, but the method still has the defect of preserving the identity information of the face, and can cause the generated face and the target face to have different identity characteristics. CN108288072a discloses a facial expression synthesis method based on a generation countermeasure network, which does not consider fine granularity generation of face images, omits detailed feature extraction of source face images, and has the defects of blurred generation result and low resolution. CN110084121a discloses a method for implementing facial expression migration of a cyclic generation type countermeasure network based on spectrum normalization, the method adopts expression independent heat vector to supervise the training process of the network, the discreteness of independent heat vector limits the learning ability of the network, so that the network can only learn the expression of target emotion, such as happiness, sadness, surprise and the like, and cannot learn the degree of emotion, and the method has a defect in the aspect of continuous generation of expression. CN105069830a discloses a method and a device for generating expression animation, the method can only generate expression animation of six appointed templates, and human expression is quite rich and complex, so that the method has poor expansibility, and can not generate any appointed expression animation according to the requirement of a user. CN107944358A discloses a face generating method based on a deep convolution countermeasure network model, which cannot guarantee invariance of face identity information in the expression generating process, and may have the defect that the generated face is inconsistent with the target face.
Disclosure of Invention
The technical problems to be solved by the application are as follows: the method comprises the steps of providing a two-stage expression animation generation method based on a dual-generation contrast network, firstly extracting characteristics of a target expression by utilizing an expression migration network in a first stage, migrating the characteristics to a source face, generating a first-stage predictive map, and naming the expression migration network in the first stage as FaceGAN (Face Generative Adversarial Network); in the second stage, a detail generation network is utilized to enrich some face details in the first-stage predictive diagram, a fine-grained second-stage predictive diagram is generated, video animation is synthesized, and the detail generation network of the second stage is named FineGAN (Fine Generative Adversarial Network); the method of the application solves the problems of the prior art that the generated image is blurred or has low resolution, and the generated result has unreasonable artifacts, etc.
The technical scheme adopted by the application for solving the technical problems is as follows: in the first stage, under the drive of a target expression profile, capturing expression features in the target expression profile by using an expression migration network faceGAN, and migrating the expression features to a source face to generate a first-stage predictive image; in the second stage, the detail generating network FineGAN is used as supplement to enrich the details of eyes and mouth areas with relatively large contribution to the change of the appearance in the first stage prediction graph, generate a fine-grained second stage prediction graph and synthesize the face animation, and the specific steps are as follows:
the first step, a facial expression profile of each frame of image in a data set is obtained:
collecting a facial expression video sequence data set, extracting a face in each frame of image in a video sequence by using a Dlib machine learning library, simultaneously acquiring a plurality of characteristic points in each face, and sequentially connecting the characteristic points by using line segments to obtain an expression profile of each frame of the video sequence, wherein the expression profile is marked as e= (e) 1 ,e 2 ,···,e i ,···,e n ) Wherein e represents the set of all expression profiles in a video sequence, i.e. expression profile sequence; n represents the number of video frames, e i Representing an expression profile of an ith frame in a certain video sequence;
the first stage, setting up an expression migration network FaceGAN, including the second to fourth steps:
secondly, extracting identity features of a source face and expression features of a target expression profile map, and initially generating a first-stage prediction map:
the expression migration network FaceGAN includes a generator G 1 And a discriminator D 1 Wherein generator G 1 Comprising three sub-networks, two encoders Enc respectively id and Encexp A decoder Dec 1
Firstly, a neutral non-expression image I of a source face is input N And a target expression profile sequence e, then using an identity encoder Enc id Neutral non-expression image I for extracting source face N Identity feature vector f of (1) id Simultaneously using expression encoder Enc exp Extracting expression feature vector set f of target expression profile sequence e exp, wherein fexp =(f exp_1 ,f exp_2 ,···,f exp_i ,···,f exp_n ) The formula is:
f id =Enc id (I N ) (1),
f exp_i =Enc exp (e i ) (2),
by incorporating an identity feature vector f id And expression feature vector f of the i-th frame exp_i Performing series connection to obtain a characteristic vector f with f=f id +f exp_i The feature vector f is fed to a decoder Dec 1 Decoding to generate a first-stage predictive picture I pre-target And I pre-target =Dec 1 (f) Finally, I is pre-target Input to a discriminator D 1 Judging whether the image is true or false;
thirdly, taking the first-stage predictive image as input, and reconstructing a source face neutral image by adopting the concept of CycleGAN:
predicting the first stage of graph I pre-target And neutral blankness image I in the second step N Corresponding expression profile e N Re-use as input to expression migration network FaceGAN, using identity encoder Enc id Extracting image I pre-target Simultaneously using expression encoder Enc exp Extracting expression profile e N Repeating the second step to generate I by decoding with decoder N Is a reconstructed image I of (1) recon Generating a reconstructed image I recon The formula of (c) is expressed as:
I recon =Dec 1 (Enc id (I pre-target )+Enc exp (e N )) (3);
fourth, calculating a loss function in the expression migration network FaceGAN in the first stage:
generator G in the first-stage expression migration network faceGAN 1 The specific formula of the loss function of (2) is:
wherein ,
wherein ,Ireal For the target true value, equation (5) is the counter loss of the generator, D 1 (. Cndot.) representation arbiter D 1 The probability of the object being true, the SSIM (·) function in equation (6) is used to measure the similarity between two images, equation (7) is the pixel loss, and the MAE (·) function is the mean square error function used to measure the true value and the pre-determined valueThe difference between measured values is shown as a formula (8), the perceived loss is shown as a formula (9), the perceived characteristics of the image are extracted by utilizing VGG-19, the characteristics output by the last convolution layer in the VGG-19 network are used as the perceived characteristics of the image, the perceived loss between the image and the real image is calculated and generated, the formula (9) is shown as a formula (9), the reconstructed loss is shown as a formula (9), and the neutral non-expressive image I of the source face is calculated N And reconstructed image I thereof recon A distance therebetween;
the discriminant D in the first-stage expression migration network FaceGAN 1 The specific formula of the loss function of (2) is:
wherein ,
equation (11) is the contrast loss, equation (12) is the contrast loss of the reconstructed image, where λ 1 and λ2 For similarity lossAnd perception loss->Generator G in faceGAN 1 Weight parameter lambda in (a) 3 Contrast loss of reconstructed image>Weight parameters in FaceGAN arbiter loss;
setting up a detail generation network FineGAN of a second stage, which comprises the steps from the fifth step to the seventh step:
fifth, generating a local mask vector adapted to the individual:
the plurality of characteristic points in each face obtained in the first step are used for extracting the eye region I eye And mouth region I mouth Eye mask vectors M are respectively set eye And a mouth mask vector M mouth Taking eyes as an example, the eye mask vector M is formed by setting the pixel value of an eye region in an image to 1 and the pixel value of other regions to 0 eye Mouth mask vector M mouth Is formed with the eye mask vector M eye Similarly;
step six, inputting the first-stage predictive diagram into a network of a second stage for detail optimization:
the detail generation network FinEGAN comprises a generator G 2 Sum discriminator D 2 ,D 2 Is composed of a global arbiter D global And two local discriminators D eye and Dmouth Constructing;
predicting the first stage of graph I pre-target And neutral blankness image I in the second step N Input to generator G 2 In the method, a second-stage predictive picture I with more face details is generated target Then predict the second stage of graph I target Simultaneously input into three discriminators, through global discriminator D global For the second stage predictive diagram I target Global discrimination is performed to make the second stage predictive picture I target With the target real image I real As close as possible by means of an eye local discriminant D eye Mouth part local discriminator D mouth For the second stage predictive diagram I target Is further optimized with respect to the eye and mouth regions such that the second stage predictive map I target More realistic, second stage predictive picture I target The formula of (c) is expressed as:
I target =G 2 (I pre-target ,I N ) (13);
seventh, calculating a loss function in the second stage FinEGAN:
generator G 2 The specific formula of the loss function is as follows:
wherein ,
equation (15) is a counter-loss, comprising global versus counter-loss and local counter-loss, operatorsIs Hadamard product, formula (16) is pixel loss, formula (17) and formula (18) are local pixel loss, calculate L1 norm of pixel difference between local area of generated image and local area of real image, formula (19) is local perception loss, generator G 2 The total loss function is the weighted sum of the loss functions;
distinguishing device D 2 The specific formula of the loss function of (2) is:
wherein ,
equation (21) is the contrast loss of the global arbiter, and equations (22) and (23) are the contrast loss of the local arbiter, where λ 4 and λ5 Local countering losses in FinEGAN generator G 2 Weight parameter lambda in (a) 6 and λ7 Eye pixel loss L respectively eye And mouth pixel lossIn Finegan generator G 2 Weight parameter lambda in (a) 8 For local perceived lossIn Finegan generator G 2 Weight parameter lambda in (a) 9 Loss for global countermeasures>In Finegan discriminator D 2 Weight parameters of (a);
eighth step, synthesizing video:
each frame is independently generated, so that when n frames of images (I target_1 ,I target_2 ,···,I target_i ,···,I target_n ) After the generation of the video frame sequence, synthesizing the video frame sequence into a final face animation;
thus, the generation of the two-stage expression animation based on the dual-generation countermeasure network is completed, the expression in the face image is converted, and the image detail is optimized.
In particular, the identity encoder Enc id The system comprises 4 layers of convolution blocks, wherein a CBAM attention module is added into the front 3 layers of convolution blocks; expression encoder Enc exp Comprising 3 layers of convolutions, a CBAM attention module being added to the final layer of convolutions, a decoder Dec 1 The method comprises 4 layers of deconvolution blocks, wherein a CBAM attention module is added in the first 3 layers of deconvolution blocks, and a network encoder is connected with a decoder by using a jump connection, namely an identity encoder Enc id Layer 1 output and decoder Dec 1 The input of the penultimate layer 1 is connected, the identity encoder Enc id Layer 2 output and decoder Dec 1 The input of the penultimate layer 2 is connected, the identity encoder Enc id Layer 3 output and decoder Dec 1 The input of the penultimate layer 3 is connected. And a CBAM attention module is added, so that the network can learn more important areas in the focused image, and meanwhile, in order to enable the network to learn the detailed information such as face textures of lower layers, the upper layers and the lower layers of the network are combined by using jump connection.
The two-stage expression animation generation method based on the dual generation countermeasure network is characterized in that English abbreviations of generated countermeasure network models are GANs, are totally called GenerativeAdversarial Networks, are algorithms well known in the technical field, and the Dlib library is a public database.
The beneficial effects of the application are as follows: in contrast to the prior art, the method has the advantages that,
the remarkable progress of the application is as follows:
(1) Compared with CN108288072A, the method has the advantages that the fine granularity generation of the face animation can be ensured by the proposed detail generation network, two important areas of a mouth and eyes are emphasized and optimized, and the generation effect is more vivid and natural.
(2) Compared with CN110084121A, the method has the advantages that the expression profile is used for supervising the learning process of the faceGAN network, so that the network can learn the continuous expression of the expression, learn the emotion degree and generate smooth face animation.
(3) Compared with CN105069830A, the method has the advantages that the target expression profile is utilized to guide the expression of the network learning target expression, the expression type restriction is not limited, and the expression animation of any emotion required by the user can be generated.
(4) Compared with CN107944358A, the method has the advantages that the method utilizes the annular network structure training model of the cycleGAN, and meanwhile, jump connection is added in the faceGAN so as to ensure the consistency of the identity information of the generated face and the source face.
(5) The method of the application can ensure the real degree of the whole generated image and can finely generate two important areas of eyes and mouths by arranging the global discriminator, the local discriminator and the local loss function (formula (17) and formula (18)).
(6) The method ensures local detail generation and fine granularity expression of the image by adding the attention module and the detail generation network of the second stage in the FaceGAN.
The outstanding essential characteristics of the application are as follows:
1) The application provides two-stage generation countermeasure network for carrying out expression animation generation, wherein the first stage carries out expression conversion and the second stage carries out optimization of image details; the method is characterized in that a local loss function based on a mask is provided, a designated area of an image is extracted through a mask vector to perform emphasis optimization, and meanwhile, the important part generation effect is better by combining the use of a local discriminator.
2) According to the application, each frame image in the video sequence is generated by a neutral image, and the video frame sequence is generated in a non-recursive mode, so that the problem that the generation quality of the subsequent frames is poorer and worse due to the fact that errors generated by the previous frames are transferred to the subsequent frames and the errors are propagated is avoided; in addition, the image input mode can lead the difficulty of model training to be increased due to larger change from neutral expression to other expression in more network learning. After the predicted image is generated by using the first-stage network, the predicted image is input into the network again, and the source input image is reconstructed by using the cyclic gan annular network idea, so that the network can be forced to retain identity characteristics without increasing the number of parameters of the model, and the loss functions comprise antagonism loss, SSIM similarity loss, pixel loss, VGG perception loss and reconstruction loss. The second stage network of the present application includes a generator and a global arbiter, two local discriminators, with mask-based local discriminators and local loss functions added.
3) In the FaceGAN, the method uses the idea of cyclgan to take the image after expression conversion as the input of the network again, and reconstruct and generate the source face image, thus the network can forcedly keep the identity of the face and only change the expression; meanwhile, in FaceGAN, the high-level features and the low-level features of the network are fused by using a jump connection structure, so that the network can learn more face identity information in the low-level features; the method can realize expression conversion without changing the identity information of the face.
4) The application provides a detail optimization network FineGAN which focuses on the generation of image details and focuses on optimizing important eye areas and mouth areas; the proper weight is set to balance pixel loss and counterloss, and perception loss is added to remove artifacts, so that the generated image does not contain unreasonable artifacts and the like, and the network generates a high-quality lifelike image with rich details and conforming to human vision.
5) The method has the advantages of relatively less network parameters, lower space and time complexity, capability of learning the migration of any expression type by using a unified network, and capability of learning the continuous change of emotion intensity, and good use prospect.
Drawings
The application will be further described with reference to the drawings and examples.
Fig. 1 is a schematic block flow diagram of the method of the present application.
In fig. 2, the odd lines are schematic diagrams of facial feature points of the method of the present application, and the even lines are facial expression contour diagrams.
Fig. 3 is a schematic mask diagram of the present application, wherein the first row is a facial region image extracted from a preprocessed raw data set, the second and fourth rows are visualizations of an eye mask vector and a mouth mask vector, respectively, and the third and fifth rows are partial region images extracted from the source image after the eye mask vector and the mouth mask vector are applied thereto.
FIG. 4 is a graph of 3 experimental effects of the present application, wherein the odd rows are inputs to the method of the present application, comprising a sequence of contour maps of a neutral image of a source face and a target expression; even lines are experimental results, i.e. a sequence of video frames from which the expressive animation is output.
Detailed Description
The embodiment shown in fig. 1 shows that the flow of the two-stage expression animation generation method based on the dual generation countermeasure network of the application is as follows:
obtaining a facial expression contour map of each frame of image in a data set, extracting the identity characteristics of a source face and the expression characteristics of a target expression contour map, initially generating a first-stage predictive map, taking the first-stage predictive map as input, reconstructing a source face neutral image by adopting the concept of CycleGAN, calculating a loss function in the first-stage FaceGAN, generating a local mask vector adapting to an individual, inputting the first-stage predictive map into a second-stage network, performing detail optimization, calculating a loss function in the second-stage FineGAN, and synthesizing a video.
Example 1
The two-stage expression animation generation method based on the dual-generation countermeasure network in the embodiment comprises the following specific steps:
the first step, a facial expression profile of each frame of image in a data set is obtained:
collecting a facial expression video sequence data set, extracting a face in each frame of image in a video sequence by using a Dlib machine learning library, simultaneously acquiring 68 characteristic points in each face (in the expression migration field, 68 characteristic points form a face contour and an eye, mouth and nose contour, and 5 or 81 characteristic points can be set), wherein the characteristic points are shown in an odd line in fig. 2, and then the characteristic points are sequentially connected by using line segments to obtain an expression contour map of each frame of the video sequence, wherein the expression contour map is shown in an even line in fig. 2 and is marked as e= (e) 1 ,e 2 ,···,e i ,···,e n ) Where e represents the set of all facial expression profiles in a video sequence,n represents the number of video frames, e i A facial expression profile representing an i-th frame in a video sequence;
the first stage, setting up an expression migration network FaceGAN, including the second to fourth steps:
secondly, extracting identity features of a source face and expression features of a target expression profile map, and initially generating a first-stage prediction map:
FaceGAN comprises a generator G 1 And a discriminator D 1 Wherein generator G 1 Comprising three sub-networks, two encoders Enc respectively id and Encexp A decoder Dec 1
Firstly, a neutral non-expression image I of a source face is input N And a target expression profile sequence e, the input of this embodiment is a neutral face of the S010 user, the target expression profile sequence is a process from a facial expression to a smile exposure, and the neutral expression profile is a neutral expression-free image I N The extracted expression profile is marked as e N The specific input is shown in the first line of fig. 4, and then the identity encoder Enc is used id Extracting S010 user identity feature vector f id Simultaneously using expression encoder Enc exp Extracting expression feature vector set f of target expression profile exp, wherein fexp =(f exp_1 ,f exp_2 ,···,f exp_i ,···,f exp_n ) The formula is:
f id =Enc id (I N ) (1),
f exp_i =Enc exp (e i ) (2),
by incorporating an identity feature vector f id And expression feature vector f of the i-th frame exp_i Performing series connection to obtain a characteristic vector f with f=f id +f exp_i The feature vector f is fed to a decoder Dec 1 Decoding to generate a first-stage predictive picture I pre-target And I pre-target =Dec 1 (f) Finally, I is pre-target Input to a discriminator D 1 Judging whether the image is true or false;
thirdly, taking the first-stage predictive image as input, and reconstructing a source face neutral image by adopting the concept of CycleGAN:
predicting the first stage of graph I pre-target And neutral blankness image I in the second step N Extracted expression profile e N Re-using the image as the input of FaceGAN, repeating the operation of the second step to generate a reconstructed image I of the neutral expression of the S010 user recon Generate I recon The formula of (c) is expressed as:
I recon =Dec 1 (Enc id (I pre-target )+Enc exp (e N )) (3);
fourth, calculating the loss function in the first-stage FaceGAN:
generator G in FaceGAN of the first stage 1 The specific formula of the loss function of (2) is as follows:
wherein ,
wherein ,Ireal For a target true value (namely, groundtrunk, which is a source face image with a target expression, or a real image with a final predicted value of a model), namely, a real image of S010 user smile, an fight loss of a generator is calculated by using an SSIM (S) function in a formula (6) to measure similarity between two images, a pixel loss is calculated by using a formula (7) to be a pixel loss, a mean square error function to measure a gap between the true value and the predicted value, a perception loss is calculated by using a formula (8), and the perception feature of the image is extracted by using VGG-19, wherein the perception loss between the generated image and the real image is calculated by using the feature output by the last convolution layer in the VGG-19 network, the reconstruction loss is calculated by using a formula (9), and the neutral non-expression image I of the source face is calculated N And reconstructing image I recon Distance between, generator G 1 Is a weighted sum of the loss functions of the parts;
the discriminator D in the first stage FaceGAN 1 The specific formula of the loss function of (2) is as follows:
wherein ,
equation (11) is the contrast loss, and equation (12) is the contrast loss of the reconstructed image;
said identity encoder Enc id The system comprises 4 layers of convolution blocks, wherein a CBAM attention module is added into the front 3 layers of convolution blocks; expression encoder Enc exp Comprising 3 layers of convolutions, a CBAM attention module being added to the final layer of convolutions, a decoder Dec 1 The method comprises 4 layers of deconvolution blocks, and a CBAM attention module is added in the previous 3 layers of convolution blocksCombining the higher and lower layers of the network simultaneously using a jump connection by incorporating an identity encoder Enc id Layer 1 output and decoder Dec 1 The input of the penultimate layer 1 is connected, the identity encoder Enc id Layer 2 output and decoder Dec 1 The input of the penultimate layer 2 is connected, the identity encoder Enc id Layer 3 output and decoder Dec 1 The 3 rd layer inputs are connected and the convolution kernel sizes in this patent are all 3 x 3.
Setting up a detail generation network FineGAN of a second stage, which comprises the steps from the fifth step to the seventh step:
fifth, generating a local mask vector adapted to the individual:
the 68 feature points in each face obtained in the first step are used for extracting the eye region I eye And mouth region I mouth First, eye mask vectors M are set respectively eye And a mouth mask vector M mouth As shown in the second and fourth lines of fig. 3, taking the eye as an example, the pixel values of the eye region in the image are set to 1, and the pixel values of the other regions are set to 0, namely, M is formed eye Mouth mask vector M mouth Is formed by M eye Similarly;
step six, inputting the first-stage predictive diagram into a network of a second stage for detail optimization:
finegan contains generator G 2 Sum discriminator D 2 ,D 2 Is composed of a global arbiter D global And two local discriminators D eye and Dmouth Constructing;
the first stage predictive diagram I is obtained pre-target And neutral blankness image I in the second step N Input to generator G 2 In the step of generating S010 user second-stage predictive picture I containing more face details target Then I is carried out target Simultaneously input into three discriminators through D global For the generated I target Global discrimination is performed to make I target Real image I smiling with S010 user real As close as possible by means of an eye local discriminant D eye Mouth part local discriminator D mouth Pair I target Is further emphasized in the eye and mouth regions so that an image I is generated target More realistic, the formula is described as follows:
I target =G 2 (I pre-target ,I N ) (13);
seventh, calculating a loss function in the second stage FinEGAN:
generator G 2 The specific formula of the loss function is as follows:
wherein ,
equation (15) is a counter-loss, comprising global versus counter-loss and local counter-loss, operatorsIs Hadamard product, formula (16) is pixel loss, formula (17) and formula (18) are local pixel loss, and local of the generated image is calculatedThe L1 norm of the pixel difference between the region and the local region of the real image, the formula (19) is the local perceived loss, and the total loss function is generated, namely the weighted sum of the loss functions;
distinguishing device D 2 The specific formula of the loss function of (2) is as follows:
wherein ,
equation (21) is the contrast loss of the global arbiter, and equations (22) and (23) are the contrast loss of the local arbiter;
eighth step, synthesizing video:
each frame is independently generated, so that when n frames of images (I target_1 ,I target_2 ,···,I target_i ,···,I target_n ) After the generation of (a), generating an expression gradual change process from the facial expression of the S010 user to the facial smile, and synthesizing a video frame sequence into a facial animation of the S010 user, as shown in a second row of fig. 4;
thus, the generation of the two-stage expression animation based on the dual-generation countermeasure network is completed, the expression in the face image is converted, and the image detail is optimized.
In this embodiment, the weight parameter settings involved in the steps are shown in table 1, and the effect in the whole sample database is good.
Table 1 weight parameter settings for each loss in this embodiment
The two-stage expression animation generation method based on the dual generation countermeasure network is characterized in that English abbreviation of a generated countermeasure network model is GAN, which is known as Generative Adversarial Networks, and the method is an algorithm well known in the technical field.
Figure 4 shows an effect diagram of 3 embodiments of the application. The second row is a video frame sequence for generating S010 user from neutral expression to smiling, the fourth row is a video frame sequence for generating S022 user from neutral expression to surprise large mouth, and the sixth row is a video frame sequence for generating S032 user from neutral expression to downward skimming mouth. Fig. 4 shows that the method of the application can complete the migration of the expression under the condition of keeping the identity information of the face, and can generate a continuously graded video frame sequence to synthesize the animation video with the appointed identity and the appointed expression.
The application is applicable to the prior art where it is not described.

Claims (4)

1. The two-stage expression animation generation method based on the dual-generation contrast network is characterized in that firstly, in a first stage, an expression migration network faceGAN is utilized to extract expression features in a target expression profile graph, and the expression features are migrated to a source face to generate a first-stage prediction graph; in the second stage, a detail generation network FineGAN is utilized to supplement and enrich the details of eyes and mouth areas with larger contribution to the expression change in the first-stage predictive image, a fine-granularity second-stage predictive image is generated, a face video animation is synthesized, and an expression migration network faceGAN and a detail generation network FineGAN are realized by adopting a generation countermeasure network;
the expression migration network FaceGAN includes a generator G 1 And a discriminator D 1 Wherein generator G 1 Comprising three sub-networks, respectively an identity encoder Enc id And an expression encoder Enc exp A decoder Dec 1
The detail generation network FinEGAN comprises a generator G 2 Sum discriminator D 2 ,D 2 Is composed of a global arbiter D global Eye part discriminator D eye And a mouth part partial discriminator D mouth Constructing;
the method comprises the following specific steps:
the first step, a facial expression profile of each frame of image in a data set is obtained:
collecting a facial expression video sequence data set, extracting a face in each frame of image in a video sequence by using a Dlib machine learning library, simultaneously acquiring a plurality of characteristic points in each face, and sequentially connecting the characteristic points by using line segments to obtain an expression profile of each frame of the video sequence, wherein the expression profile is marked as e= (e) 1 ,e 2 ,···,e i ,···,e n ) Wherein e represents the set of all expression profiles in a video sequence, i.e. expression profile sequence; n represents the number of video frames, e i Representing an expression profile of an ith frame in a certain video sequence;
the first stage, setting up an expression migration network FaceGAN, including the second to fourth steps:
secondly, extracting identity features of a source face and expression features of a target expression profile map, and initially generating a first-stage prediction map:
the expression migration network FaceGAN includes a generator G 1 And a discriminator D 1 Wherein generator G 1 Comprising three sub-networks, two encoders Enc respectively id and Encexp A decoder Dec 1
Firstly, a neutral non-expression image I of a source face is input N And a target expression profile sequence e, then using an identity encoder Enc id Neutral non-expression image I for extracting source face N Identity feature vector f of (1) id Simultaneously using expression encoder Enc exp Extracting expression feature vector set f of target expression profile sequence e exp, wherein fexp =(f exp_1 ,f exp_2 ,···,f exp_i ,···,f exp_n ) The formula is:
f id =Enc id (I N ) (1),
f exp_i =Enc exp (e i ) (2),
by incorporating an identity feature vector f id And expression feature vector f of the i-th frame exp_i Performing series connection to obtain a characteristic vector f with f=f id +f exp_i The feature vector f is fed to a decoder Dec 1 Decoding to generate a first-stage predictive picture I pre-target And I pre-target =Dec 1 (f) Finally, I is pre-target Input to a discriminator D 1 Judging whether the image is true or false;
thirdly, taking the first-stage predictive image as input, and reconstructing a source face neutral image by adopting the concept of CycleGAN:
predicting the first stage of graph I pre-target And neutral blankness image I in the second step N Corresponding expression profile e N Re-use as input to expression migration network FaceGAN, using identity encoder Enc id Extracting image I pre-target Simultaneously using expression encoder Enc exp Extracting expression profile e N Repeating the second step to generate I by decoding with decoder N Is a reconstructed image I of (1) recon Generating a reconstructed image I recon The formula of (c) is expressed as:
I recon =Dec 1 (Enc id (I pre-target )+Enc exp (e N )) (3);
fourth, calculating a loss function in the expression migration network FaceGAN in the first stage:
generator G in the first-stage expression migration network faceGAN 1 The specific formula of the loss function of (2) is:
wherein ,
wherein ,Ireal As the target true value, the formula (5) is the counter loss of the generator, D1 (-) represents the probability that the object of the discriminator D1 is true, the SSIM (-) function in the formula (6) is used for measuring the similarity between two images, the formula (7) is the pixel loss, the MAE (-) function is the mean square error function and is used for measuring the difference between the true value and the predicted value, the formula (8) is the perception loss, the VGG-19 is utilized to extract the perception feature of the image, the last convolution layer output feature in the VGG-19 network is utilized as the perception feature of the image, the perception loss between the generated image and the real image is calculated, the formula (9) is the reconstruction loss, and the neutral non-expression image I of the source face is calculated N And reconstructed image I thereof recon A distance therebetween;
the discriminant D in the first-stage expression migration network FaceGAN 1 The specific formula of the loss function of (2) is:
wherein ,
equation (11) is the contrast loss, equation (12) is the contrast loss of the reconstructed image, where λ 1 and λ2 For similarity lossAnd perception loss->Generator G in faceGAN 1 Weight parameter lambda in (a) 3 Contrast loss of reconstructed image>Weight parameters in FaceGAN arbiter loss;
setting up a detail generation network FineGAN of a second stage, which comprises the steps from the fifth step to the seventh step:
fifth, generating a local mask vector adapted to the individual:
the plurality of characteristic points in each face obtained in the first step are used for extracting the eye region I eye And mouth region I mouth Eye mask vectors M are respectively set eye And a mouth mask vector M mouth Taking eyes as an example, the eye mask vector M is formed by setting the pixel value of an eye region in an image to 1 and the pixel value of other regions to 0 eye Mouth mask vector M mouth Is formed with the eye mask vector M eye Similarly;
step six, inputting the first-stage predictive diagram into a network of a second stage for detail optimization:
the detail generation network FinEGAN comprises a generator G 2 Sum discriminator D 2 ,D 2 Is composed of a global arbiter D global And two local discriminators D eye and Dmouth Constructing;
predicting the first stage of graph I pre-target And neutral blankness image I in the second step N Input to generator G 2 In the method, a second-stage predictive picture I with more face details is generated target Then predict the second stage of graph I target Simultaneously input into three discriminators, through global discriminator D global For the second stage predictive diagram I target Global discrimination is performed to make the second stage predictive picture I target With the target real image I real As close as possible by means of an eye local discriminant D eye Mouth part local discriminator D mouth For the second stage predictive diagram I target Is further optimized with respect to the eye and mouth regions such that the second stage predictive map I target More realistic, second stage predictive picture I target The formula of (c) is expressed as:
I target =G 2 (I pre-target ,I N ) (13);
seventh, calculating a loss function in the second stage FinEGAN:
generator G 2 The specific formula of the loss function is as follows:
wherein ,
equation (15) is a counter-loss, comprising global versus counter-loss and local counter-loss, operatorsIs Hadamard product, formula (16) is pixel loss, formula (17) and formula (18) are local pixel loss, calculate L1 norm of pixel difference between local area of generated image and local area of real image, formula (19) is local perception loss, generator G 2 The total loss function is the weighted sum of the loss functions;
distinguishing device D 2 The specific formula of the loss function of (2) is:
wherein ,
equation (21) is the contrast loss of the global arbiter, and equations (22) and (23) are the contrast loss of the local arbiter, where λ 4 and λ5 Local countering losses in FinEGAN generator G 2 Weight parameter lambda in (a) 6 and λ7 Eye pixel loss respectivelyAnd mouth pixel loss->In Finegan generator G 2 Weight parameter lambda in (a) 8 For local perceived loss->In Finegan generator G 2 Weight parameter lambda in (a) 9 Loss for global countermeasures>In Finegan discriminator D 2 Weight parameters of (a);
eighth step, synthesizing video:
each frame is independently generated, so that when n frames of images (I target_1 ,I target_2 ,···,I target_i ,···,I target_n ) After the generation of the video frame sequence, synthesizing the video frame sequence into a final face animation;
thus, the generation of the two-stage expression animation based on the dual-generation countermeasure network is completed, the expression in the face image is converted, and the image detail is optimized.
2. The method of claim 1, wherein the identity encoder Enc id The system comprises 4 layers of convolution blocks, wherein a CBAM attention module is added into the front 3 layers of convolution blocks; expression encoder Enc exp Comprising 3 layers of convolutions, a CBAM attention module being added to the final layer of convolutions, a decoder Dec 1 The method comprises 4 layers of deconvolution blocks, wherein a CBAM attention module is added in the first 3 layers of deconvolution blocks, and a network high layer and a network low layer are combined by using jump connection, namely an identity encoder Enc is adopted id Layer 1 output and decoder Dec 1 The input of the penultimate layer 1 is connected, the identity encoder Enc id Layer 2 output and solutionEncoder Dec 1 The input of the penultimate layer 2 is connected, the identity encoder Enc id Layer 3 output and decoder Dec 1 The input of the penultimate layer 3 is connected.
3. The generating method according to claim 1, wherein the weight parameter of each loss is set as:
4. the method according to claim 1, wherein the number of feature points obtained in the first step is 68, and 68 feature points constitute a face contour and an eye, mouth, nose contour.
CN202010621885.2A 2020-07-01 2020-07-01 Two-stage expression animation generation method based on dual-generation reactance network Active CN111783658B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010621885.2A CN111783658B (en) 2020-07-01 2020-07-01 Two-stage expression animation generation method based on dual-generation reactance network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010621885.2A CN111783658B (en) 2020-07-01 2020-07-01 Two-stage expression animation generation method based on dual-generation reactance network

Publications (2)

Publication Number Publication Date
CN111783658A CN111783658A (en) 2020-10-16
CN111783658B true CN111783658B (en) 2023-08-25

Family

ID=72761358

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010621885.2A Active CN111783658B (en) 2020-07-01 2020-07-01 Two-stage expression animation generation method based on dual-generation reactance network

Country Status (1)

Country Link
CN (1) CN111783658B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541477B (en) 2020-12-24 2024-05-31 北京百度网讯科技有限公司 Expression pack generation method and device, electronic equipment and storage medium
CN113033288B (en) * 2021-01-29 2022-06-24 浙江大学 Method for generating front face picture based on side face picture for generating confrontation network
CN113343761A (en) * 2021-05-06 2021-09-03 武汉理工大学 Real-time facial expression migration method based on generation confrontation
CN113326934B (en) * 2021-05-31 2024-03-29 上海哔哩哔哩科技有限公司 Training method of neural network, method and device for generating images and videos
US11900519B2 (en) * 2021-11-17 2024-02-13 Adobe Inc. Disentangling latent representations for image reenactment
CN115100329B (en) * 2022-06-27 2023-04-07 太原理工大学 Multi-mode driving-based emotion controllable facial animation generation method
CN115311261A (en) * 2022-10-08 2022-11-08 石家庄铁道大学 Method and system for detecting abnormality of cotter pin of suspension device of high-speed railway contact network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002304638A (en) * 2001-04-03 2002-10-18 Atr Ningen Joho Tsushin Kenkyusho:Kk Device and method for generating expression animation
WO2019228317A1 (en) * 2018-05-28 2019-12-05 华为技术有限公司 Face recognition method and device, and computer readable medium
CN110689480A (en) * 2019-09-27 2020-01-14 腾讯科技(深圳)有限公司 Image transformation method and device
CN111243066A (en) * 2020-01-09 2020-06-05 浙江大学 Facial expression migration method based on self-supervision learning and confrontation generation mechanism

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002304638A (en) * 2001-04-03 2002-10-18 Atr Ningen Joho Tsushin Kenkyusho:Kk Device and method for generating expression animation
WO2019228317A1 (en) * 2018-05-28 2019-12-05 华为技术有限公司 Face recognition method and device, and computer readable medium
CN110689480A (en) * 2019-09-27 2020-01-14 腾讯科技(深圳)有限公司 Image transformation method and device
CN111243066A (en) * 2020-01-09 2020-06-05 浙江大学 Facial expression migration method based on self-supervision learning and confrontation generation mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于条件生成式对抗网络的面部表情迁移模型;陈军波;刘蓉;刘明;冯杨;;计算机工程(第04期);全文 *

Also Published As

Publication number Publication date
CN111783658A (en) 2020-10-16

Similar Documents

Publication Publication Date Title
CN111783658B (en) Two-stage expression animation generation method based on dual-generation reactance network
Zhou et al. Photorealistic facial expression synthesis by the conditional difference adversarial autoencoder
CN111080511B (en) End-to-end face exchange method for high-resolution multi-feature extraction
CN112149459B (en) Video saliency object detection model and system based on cross attention mechanism
CN111275518A (en) Video virtual fitting method and device based on mixed optical flow
KR102602112B1 (en) Data processing method, device, and medium for generating facial images
CN113807265B (en) Diversified human face image synthesis method and system
CN113255457A (en) Animation character facial expression generation method and system based on facial expression recognition
CN115170559A (en) Personalized human head nerve radiation field substrate representation and reconstruction method based on multilevel Hash coding
WO2024109374A1 (en) Training method and apparatus for face swapping model, and device, storage medium and program product
CN115359534B (en) Micro-expression identification method based on multi-feature fusion and double-flow network
CN116071494A (en) High-fidelity three-dimensional face reconstruction and generation method based on implicit nerve function
CN113362422A (en) Shadow robust makeup transfer system and method based on decoupling representation
Zhou et al. Generative adversarial network for text-to-face synthesis and manipulation with pretrained BERT model
Huang et al. IA-FaceS: A bidirectional method for semantic face editing
CN114549387A (en) Face image highlight removal method based on pseudo label
Liu et al. WSDS-GAN: A weak-strong dual supervised learning method for underwater image enhancement
CN117689592A (en) Underwater image enhancement method based on cascade self-adaptive network
Otto et al. A perceptual shape loss for monocular 3D face reconstruction
CN112686830A (en) Super-resolution method of single depth map based on image decomposition
Fan et al. Facial Expression Transfer Based on Conditional Generative Adversarial Networks
CN115527275A (en) Behavior identification method based on P2CS _3DNet
CN115578298A (en) Depth portrait video synthesis method based on content perception
CN111767842B (en) Micro-expression type discrimination method based on transfer learning and self-encoder data enhancement
Wang et al. Expression-aware neural radiance fields for high-fidelity talking portrait synthesis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant