CN111783658B

CN111783658B - Two-stage expression animation generation method based on dual-generation reactance network

Info

Publication number: CN111783658B
Application number: CN202010621885.2A
Authority: CN
Inventors: 郭迎春; 王静洁; 刘依; 朱叶; 郝小可; 于洋; 师硕; 阎刚
Original assignee: Hebei University of Technology
Current assignee: Hebei University of Technology
Priority date: 2020-07-01
Filing date: 2020-07-01
Publication date: 2023-08-25
Anticipated expiration: 2040-07-01
Also published as: CN111783658A

Abstract

The application relates to a two-stage expression animation generation method based on a double generation contrast network, which comprises the steps of firstly utilizing an expression migration network faceGAN to extract expression characteristics in a target expression profile graph in a first stage, migrating the expression characteristics to a source face and generating a first-stage prediction graph; in the second stage, a detail generation network FineGAN is utilized to supplement and enrich the details of eyes and mouth areas with relatively large contribution to the expression change in the first-stage prediction graph, a fine-granularity second-stage prediction graph is generated and a face video animation is synthesized, and the expression migration network faceGAN and the detail generation network FineGAN are realized by adopting a generation countermeasure network. The application provides two-stage generation of the countermeasure network to generate the expression animation, the first stage of the expression conversion is performed, the second stage of the optimization of the image details is performed, the appointed area of the image is extracted through the mask vector to perform the emphasis optimization, and meanwhile, the use of the local discriminator is combined, so that the important part generation effect is better.

Description

Two-stage expression animation generation method based on dual-generation reactance network

Technical Field

The technical scheme of the application relates to image data processing in computer vision, in particular to a two-stage expression animation generation method based on a dual-generation contrast network.

Background

Facial expression synthesis refers to the process of transferring an expression from a target expression reference face to a source face, identity information of a newly synthesized source face image is kept unchanged, but the expression of the newly synthesized source face image is consistent with the target expression reference face, and the technology is gradually applied to the fields of film and television production, virtual reality, criminal investigation and the like. Facial expression synthesis has important research value in academia and industry, and how to robustly synthesize natural and lifelike facial expressions becomes a challenging hot research topic.

The existing facial expression synthesis method can be divided into two main types, namely a traditional graphics method and an image generation method based on deep learning. The first type of traditional graphics method generally uses a parameter model to parameterize a source face image, designs the model to convert expressions and generate a new image, or uses feature correspondence and an optical flow diagram to distort the face image, and integrates the existing expression data into a face patch and the like, but the process of designing the model is detailed and complex, high-cost calculation amount can be generated, and generalization capability is poor.

The second category is expression synthesis method based on deep learning. Firstly, extracting facial features by using a deep neural network, mapping an image from a high-dimensional space to feature vectors, then changing source expression features by adding expression labels, synthesizing a target facial image by using the deep neural network, and mapping back to the high-dimensional space. The advent of GAN networks has then brought about dawn to achieve clear image synthesis, which has attracted considerable attention once proposed. In the field of image synthesis, a large number of research methods such as GAN variants have emerged to generate images. For example, the condition generation countermeasure network (Conditional Generative Adversarial Network, CGAN) can generate images under specific supervision information, and in the field of facial expression generation, expression labels can be used as condition supervision information, so that facial images with different expressions can be generated. At present, related methods based on the GAN network have some defects, and when generating the form animation, the problems of unreasonable artifact, fuzzy generated image, low resolution and the like can occur.

The facial expression generation is image-to-image conversion, and the aim of the application is to generate facial animation, which belongs to image-to-video conversion and increases the challenges in time dimension compared with facial expression generation task. Xing et al use a gender-preserving network in the text of GP-GAN Gender Preserving GAN for Synthesizing Faces from Landmarks to enable the network to learn more gender information, but the method still has the defect of preserving the identity information of the face, and can cause the generated face and the target face to have different identity characteristics. CN108288072a discloses a facial expression synthesis method based on a generation countermeasure network, which does not consider fine granularity generation of face images, omits detailed feature extraction of source face images, and has the defects of blurred generation result and low resolution. CN110084121a discloses a method for implementing facial expression migration of a cyclic generation type countermeasure network based on spectrum normalization, the method adopts expression independent heat vector to supervise the training process of the network, the discreteness of independent heat vector limits the learning ability of the network, so that the network can only learn the expression of target emotion, such as happiness, sadness, surprise and the like, and cannot learn the degree of emotion, and the method has a defect in the aspect of continuous generation of expression. CN105069830a discloses a method and a device for generating expression animation, the method can only generate expression animation of six appointed templates, and human expression is quite rich and complex, so that the method has poor expansibility, and can not generate any appointed expression animation according to the requirement of a user. CN107944358A discloses a face generating method based on a deep convolution countermeasure network model, which cannot guarantee invariance of face identity information in the expression generating process, and may have the defect that the generated face is inconsistent with the target face.

Disclosure of Invention

The technical problems to be solved by the application are as follows: the method comprises the steps of providing a two-stage expression animation generation method based on a dual-generation contrast network, firstly extracting characteristics of a target expression by utilizing an expression migration network in a first stage, migrating the characteristics to a source face, generating a first-stage predictive map, and naming the expression migration network in the first stage as FaceGAN (Face Generative Adversarial Network); in the second stage, a detail generation network is utilized to enrich some face details in the first-stage predictive diagram, a fine-grained second-stage predictive diagram is generated, video animation is synthesized, and the detail generation network of the second stage is named FineGAN (Fine Generative Adversarial Network); the method of the application solves the problems of the prior art that the generated image is blurred or has low resolution, and the generated result has unreasonable artifacts, etc.

The technical scheme adopted by the application for solving the technical problems is as follows: in the first stage, under the drive of a target expression profile, capturing expression features in the target expression profile by using an expression migration network faceGAN, and migrating the expression features to a source face to generate a first-stage predictive image; in the second stage, the detail generating network FineGAN is used as supplement to enrich the details of eyes and mouth areas with relatively large contribution to the change of the appearance in the first stage prediction graph, generate a fine-grained second stage prediction graph and synthesize the face animation, and the specific steps are as follows:

the first step, a facial expression profile of each frame of image in a data set is obtained:

collecting a facial expression video sequence data set, extracting a face in each frame of image in a video sequence by using a Dlib machine learning library, simultaneously acquiring a plurality of characteristic points in each face, and sequentially connecting the characteristic points by using line segments to obtain an expression profile of each frame of the video sequence, wherein the expression profile is marked as e= (e) ₁ ,e ₂ ,···,e _i ,···,e _n ) Wherein e represents the set of all expression profiles in a video sequence, i.e. expression profile sequence; n represents the number of video frames, e _i Representing an expression profile of an ith frame in a certain video sequence;

the first stage, setting up an expression migration network FaceGAN, including the second to fourth steps:

secondly, extracting identity features of a source face and expression features of a target expression profile map, and initially generating a first-stage prediction map:

the expression migration network FaceGAN includes a generator G ₁ And a discriminator D ₁ Wherein generator G ₁ Comprising three sub-networks, two encoders Enc respectively _id and Enc_exp A decoder Dec ₁ ；

Firstly, a neutral non-expression image I of a source face is input _N And a target expression profile sequence e, then using an identity encoder Enc _id Neutral non-expression image I for extracting source face _N Identity feature vector f of (1) _id Simultaneously using expression encoder Enc _exp Extracting expression feature vector set f of target expression profile sequence e _exp, wherein f_exp ＝(f _{exp_1} ,f _{exp_2} ,···,f _{exp_i} ,···,f _{exp_n} ) The formula is:

f _id ＝Enc _id (I _N ) (1),

f _{exp_i} ＝Enc _exp (e _i ) (2),

by incorporating an identity feature vector f _id And expression feature vector f of the i-th frame _{exp_i} Performing series connection to obtain a characteristic vector f with f=f _id +f _{exp_i} The feature vector f is fed to a decoder Dec ₁ Decoding to generate a first-stage predictive picture I _pre-target And I _pre-target ＝Dec ₁ (f) Finally, I is _pre-target Input to a discriminator D ₁ Judging whether the image is true or false;

thirdly, taking the first-stage predictive image as input, and reconstructing a source face neutral image by adopting the concept of CycleGAN:

predicting the first stage of graph I _pre-target And neutral blankness image I in the second step _N Corresponding expression profile e _N Re-use as input to expression migration network FaceGAN, using identity encoder Enc _id Extracting image I _pre-target Simultaneously using expression encoder Enc _exp Extracting expression profile e _N Repeating the second step to generate I by decoding with decoder _N Is a reconstructed image I of (1) _recon Generating a reconstructed image I _recon The formula of (c) is expressed as:

I _recon ＝Dec ₁ (Enc _id (I _pre-target )+Enc _exp (e _N )) (3)；

fourth, calculating a loss function in the expression migration network FaceGAN in the first stage:

generator G in the first-stage expression migration network faceGAN ₁ The specific formula of the loss function of (2) is:

wherein ,

wherein ,I_real For the target true value, equation (5) is the counter loss of the generator, D ₁ (. Cndot.) representation arbiter D ₁ The probability of the object being true, the SSIM (·) function in equation (6) is used to measure the similarity between two images, equation (7) is the pixel loss, and the MAE (·) function is the mean square error function used to measure the true value and the pre-determined valueThe difference between measured values is shown as a formula (8), the perceived loss is shown as a formula (9), the perceived characteristics of the image are extracted by utilizing VGG-19, the characteristics output by the last convolution layer in the VGG-19 network are used as the perceived characteristics of the image, the perceived loss between the image and the real image is calculated and generated, the formula (9) is shown as a formula (9), the reconstructed loss is shown as a formula (9), and the neutral non-expressive image I of the source face is calculated _N And reconstructed image I thereof _recon A distance therebetween;

the discriminant D in the first-stage expression migration network FaceGAN ₁ The specific formula of the loss function of (2) is:

wherein ,

equation (11) is the contrast loss, equation (12) is the contrast loss of the reconstructed image, where λ ₁ and λ₂ For similarity lossAnd perception loss->Generator G in faceGAN ₁ Weight parameter lambda in (a) ₃ Contrast loss of reconstructed image>Weight parameters in FaceGAN arbiter loss;

setting up a detail generation network FineGAN of a second stage, which comprises the steps from the fifth step to the seventh step:

fifth, generating a local mask vector adapted to the individual:

the plurality of characteristic points in each face obtained in the first step are used for extracting the eye region I _eye And mouth region I _mouth Eye mask vectors M are respectively set _eye And a mouth mask vector M _mouth Taking eyes as an example, the eye mask vector M is formed by setting the pixel value of an eye region in an image to 1 and the pixel value of other regions to 0 _eye Mouth mask vector M _mouth Is formed with the eye mask vector M _eye Similarly;

step six, inputting the first-stage predictive diagram into a network of a second stage for detail optimization:

the detail generation network FinEGAN comprises a generator G ₂ Sum discriminator D ₂ ，D ₂ Is composed of a global arbiter D _global And two local discriminators D _eye and D_mouth Constructing;

predicting the first stage of graph I _pre-target And neutral blankness image I in the second step _N Input to generator G ₂ In the method, a second-stage predictive picture I with more face details is generated _target Then predict the second stage of graph I _target Simultaneously input into three discriminators, through global discriminator D _global For the second stage predictive diagram I _target Global discrimination is performed to make the second stage predictive picture I _target With the target real image I _real As close as possible by means of an eye local discriminant D _eye Mouth part local discriminator D _mouth For the second stage predictive diagram I _target Is further optimized with respect to the eye and mouth regions such that the second stage predictive map I _target More realistic, second stage predictive picture I _target The formula of (c) is expressed as:

I _target ＝G ₂ (I _pre-target ,I _N ) (13)；

seventh, calculating a loss function in the second stage FinEGAN:

generator G ₂ The specific formula of the loss function is as follows:

wherein ,

equation (15) is a counter-loss, comprising global versus counter-loss and local counter-loss, operatorsIs Hadamard product, formula (16) is pixel loss, formula (17) and formula (18) are local pixel loss, calculate L1 norm of pixel difference between local area of generated image and local area of real image, formula (19) is local perception loss, generator G ₂ The total loss function is the weighted sum of the loss functions;

distinguishing device D ₂ The specific formula of the loss function of (2) is:

wherein ,

equation (21) is the contrast loss of the global arbiter, and equations (22) and (23) are the contrast loss of the local arbiter, where λ ₄ and λ₅ Local countering losses in FinEGAN generator G ₂ Weight parameter lambda in (a) ₆ and λ₇ Eye pixel loss L respectively _eye And mouth pixel lossIn Finegan generator G ₂ Weight parameter lambda in (a) ₈ For local perceived lossIn Finegan generator G ₂ Weight parameter lambda in (a) ₉ Loss for global countermeasures>In Finegan discriminator D ₂ Weight parameters of (a);

eighth step, synthesizing video:

each frame is independently generated, so that when n frames of images (I _{target_1} ,I _{target_2} ,···,I _{target_i} ,···,I _{target_n} ) After the generation of the video frame sequence, synthesizing the video frame sequence into a final face animation;

thus, the generation of the two-stage expression animation based on the dual-generation countermeasure network is completed, the expression in the face image is converted, and the image detail is optimized.

In particular, the identity encoder Enc _id The system comprises 4 layers of convolution blocks, wherein a CBAM attention module is added into the front 3 layers of convolution blocks; expression encoder Enc _exp Comprising 3 layers of convolutions, a CBAM attention module being added to the final layer of convolutions, a decoder Dec ₁ The method comprises 4 layers of deconvolution blocks, wherein a CBAM attention module is added in the first 3 layers of deconvolution blocks, and a network encoder is connected with a decoder by using a jump connection, namely an identity encoder Enc _id Layer 1 output and decoder Dec ₁ The input of the penultimate layer 1 is connected, the identity encoder Enc _id Layer 2 output and decoder Dec ₁ The input of the penultimate layer 2 is connected, the identity encoder Enc _id Layer 3 output and decoder Dec ₁ The input of the penultimate layer 3 is connected. And a CBAM attention module is added, so that the network can learn more important areas in the focused image, and meanwhile, in order to enable the network to learn the detailed information such as face textures of lower layers, the upper layers and the lower layers of the network are combined by using jump connection.

The two-stage expression animation generation method based on the dual generation countermeasure network is characterized in that English abbreviations of generated countermeasure network models are GANs, are totally called GenerativeAdversarial Networks, are algorithms well known in the technical field, and the Dlib library is a public database.

The beneficial effects of the application are as follows: in contrast to the prior art, the method has the advantages that,

the remarkable progress of the application is as follows:

(1) Compared with CN108288072A, the method has the advantages that the fine granularity generation of the face animation can be ensured by the proposed detail generation network, two important areas of a mouth and eyes are emphasized and optimized, and the generation effect is more vivid and natural.

(2) Compared with CN110084121A, the method has the advantages that the expression profile is used for supervising the learning process of the faceGAN network, so that the network can learn the continuous expression of the expression, learn the emotion degree and generate smooth face animation.

(3) Compared with CN105069830A, the method has the advantages that the target expression profile is utilized to guide the expression of the network learning target expression, the expression type restriction is not limited, and the expression animation of any emotion required by the user can be generated.

(4) Compared with CN107944358A, the method has the advantages that the method utilizes the annular network structure training model of the cycleGAN, and meanwhile, jump connection is added in the faceGAN so as to ensure the consistency of the identity information of the generated face and the source face.

(5) The method of the application can ensure the real degree of the whole generated image and can finely generate two important areas of eyes and mouths by arranging the global discriminator, the local discriminator and the local loss function (formula (17) and formula (18)).

(6) The method ensures local detail generation and fine granularity expression of the image by adding the attention module and the detail generation network of the second stage in the FaceGAN.

The outstanding essential characteristics of the application are as follows:

1) The application provides two-stage generation countermeasure network for carrying out expression animation generation, wherein the first stage carries out expression conversion and the second stage carries out optimization of image details; the method is characterized in that a local loss function based on a mask is provided, a designated area of an image is extracted through a mask vector to perform emphasis optimization, and meanwhile, the important part generation effect is better by combining the use of a local discriminator.

2) According to the application, each frame image in the video sequence is generated by a neutral image, and the video frame sequence is generated in a non-recursive mode, so that the problem that the generation quality of the subsequent frames is poorer and worse due to the fact that errors generated by the previous frames are transferred to the subsequent frames and the errors are propagated is avoided; in addition, the image input mode can lead the difficulty of model training to be increased due to larger change from neutral expression to other expression in more network learning. After the predicted image is generated by using the first-stage network, the predicted image is input into the network again, and the source input image is reconstructed by using the cyclic gan annular network idea, so that the network can be forced to retain identity characteristics without increasing the number of parameters of the model, and the loss functions comprise antagonism loss, SSIM similarity loss, pixel loss, VGG perception loss and reconstruction loss. The second stage network of the present application includes a generator and a global arbiter, two local discriminators, with mask-based local discriminators and local loss functions added.

3) In the FaceGAN, the method uses the idea of cyclgan to take the image after expression conversion as the input of the network again, and reconstruct and generate the source face image, thus the network can forcedly keep the identity of the face and only change the expression; meanwhile, in FaceGAN, the high-level features and the low-level features of the network are fused by using a jump connection structure, so that the network can learn more face identity information in the low-level features; the method can realize expression conversion without changing the identity information of the face.

4) The application provides a detail optimization network FineGAN which focuses on the generation of image details and focuses on optimizing important eye areas and mouth areas; the proper weight is set to balance pixel loss and counterloss, and perception loss is added to remove artifacts, so that the generated image does not contain unreasonable artifacts and the like, and the network generates a high-quality lifelike image with rich details and conforming to human vision.

5) The method has the advantages of relatively less network parameters, lower space and time complexity, capability of learning the migration of any expression type by using a unified network, and capability of learning the continuous change of emotion intensity, and good use prospect.

Drawings

The application will be further described with reference to the drawings and examples.

Fig. 1 is a schematic block flow diagram of the method of the present application.

In fig. 2, the odd lines are schematic diagrams of facial feature points of the method of the present application, and the even lines are facial expression contour diagrams.

Fig. 3 is a schematic mask diagram of the present application, wherein the first row is a facial region image extracted from a preprocessed raw data set, the second and fourth rows are visualizations of an eye mask vector and a mouth mask vector, respectively, and the third and fifth rows are partial region images extracted from the source image after the eye mask vector and the mouth mask vector are applied thereto.

FIG. 4 is a graph of 3 experimental effects of the present application, wherein the odd rows are inputs to the method of the present application, comprising a sequence of contour maps of a neutral image of a source face and a target expression; even lines are experimental results, i.e. a sequence of video frames from which the expressive animation is output.

Detailed Description

The embodiment shown in fig. 1 shows that the flow of the two-stage expression animation generation method based on the dual generation countermeasure network of the application is as follows:

obtaining a facial expression contour map of each frame of image in a data set, extracting the identity characteristics of a source face and the expression characteristics of a target expression contour map, initially generating a first-stage predictive map, taking the first-stage predictive map as input, reconstructing a source face neutral image by adopting the concept of CycleGAN, calculating a loss function in the first-stage FaceGAN, generating a local mask vector adapting to an individual, inputting the first-stage predictive map into a second-stage network, performing detail optimization, calculating a loss function in the second-stage FineGAN, and synthesizing a video.

Example 1

The two-stage expression animation generation method based on the dual-generation countermeasure network in the embodiment comprises the following specific steps:

collecting a facial expression video sequence data set, extracting a face in each frame of image in a video sequence by using a Dlib machine learning library, simultaneously acquiring 68 characteristic points in each face (in the expression migration field, 68 characteristic points form a face contour and an eye, mouth and nose contour, and 5 or 81 characteristic points can be set), wherein the characteristic points are shown in an odd line in fig. 2, and then the characteristic points are sequentially connected by using line segments to obtain an expression contour map of each frame of the video sequence, wherein the expression contour map is shown in an even line in fig. 2 and is marked as e= (e) ₁ ,e ₂ ,···,e _i ,···,e _n ) Where e represents the set of all facial expression profiles in a video sequence,n represents the number of video frames, e _i A facial expression profile representing an i-th frame in a video sequence;

FaceGAN comprises a generator G ₁ And a discriminator D ₁ Wherein generator G ₁ Comprising three sub-networks, two encoders Enc respectively _id and Enc_exp A decoder Dec ₁ ；

Firstly, a neutral non-expression image I of a source face is input _N And a target expression profile sequence e, the input of this embodiment is a neutral face of the S010 user, the target expression profile sequence is a process from a facial expression to a smile exposure, and the neutral expression profile is a neutral expression-free image I _N The extracted expression profile is marked as e _N The specific input is shown in the first line of fig. 4, and then the identity encoder Enc is used _id Extracting S010 user identity feature vector f _id Simultaneously using expression encoder Enc _exp Extracting expression feature vector set f of target expression profile _exp, wherein f_exp ＝(f _{exp_1} ,f _{exp_2} ,···,f _{exp_i} ,···,f _{exp_n} ) The formula is:

f _id ＝Enc _id (I _N ) (1),

f _{exp_i} ＝Enc _exp (e _i ) (2),

predicting the first stage of graph I _pre-target And neutral blankness image I in the second step _N Extracted expression profile e _N Re-using the image as the input of FaceGAN, repeating the operation of the second step to generate a reconstructed image I of the neutral expression of the S010 user _recon Generate I _recon The formula of (c) is expressed as:

I _recon ＝Dec ₁ (Enc _id (I _pre-target )+Enc _exp (e _N )) (3)；

fourth, calculating the loss function in the first-stage FaceGAN:

generator G in FaceGAN of the first stage ₁ The specific formula of the loss function of (2) is as follows:

wherein ,

wherein ,I_real For a target true value (namely, groundtrunk, which is a source face image with a target expression, or a real image with a final predicted value of a model), namely, a real image of S010 user smile, an fight loss of a generator is calculated by using an SSIM (S) function in a formula (6) to measure similarity between two images, a pixel loss is calculated by using a formula (7) to be a pixel loss, a mean square error function to measure a gap between the true value and the predicted value, a perception loss is calculated by using a formula (8), and the perception feature of the image is extracted by using VGG-19, wherein the perception loss between the generated image and the real image is calculated by using the feature output by the last convolution layer in the VGG-19 network, the reconstruction loss is calculated by using a formula (9), and the neutral non-expression image I of the source face is calculated _N And reconstructing image I _recon Distance between, generator G ₁ Is a weighted sum of the loss functions of the parts;

the discriminator D in the first stage FaceGAN ₁ The specific formula of the loss function of (2) is as follows:

wherein ,

equation (11) is the contrast loss, and equation (12) is the contrast loss of the reconstructed image;

said identity encoder Enc _id The system comprises 4 layers of convolution blocks, wherein a CBAM attention module is added into the front 3 layers of convolution blocks; expression encoder Enc _exp Comprising 3 layers of convolutions, a CBAM attention module being added to the final layer of convolutions, a decoder Dec ₁ The method comprises 4 layers of deconvolution blocks, and a CBAM attention module is added in the previous 3 layers of convolution blocksCombining the higher and lower layers of the network simultaneously using a jump connection by incorporating an identity encoder Enc _id Layer 1 output and decoder Dec ₁ The input of the penultimate layer 1 is connected, the identity encoder Enc _id Layer 2 output and decoder Dec ₁ The input of the penultimate layer 2 is connected, the identity encoder Enc _id Layer 3 output and decoder Dec ₁ The 3 rd layer inputs are connected and the convolution kernel sizes in this patent are all 3 x 3.

fifth, generating a local mask vector adapted to the individual:

the 68 feature points in each face obtained in the first step are used for extracting the eye region I _eye And mouth region I _mouth First, eye mask vectors M are set respectively _eye And a mouth mask vector M _mouth As shown in the second and fourth lines of fig. 3, taking the eye as an example, the pixel values of the eye region in the image are set to 1, and the pixel values of the other regions are set to 0, namely, M is formed _eye Mouth mask vector M _mouth Is formed by M _eye Similarly;

finegan contains generator G ₂ Sum discriminator D ₂ ，D ₂ Is composed of a global arbiter D _global And two local discriminators D _eye and D_mouth Constructing;

the first stage predictive diagram I is obtained _pre-target And neutral blankness image I in the second step _N Input to generator G ₂ In the step of generating S010 user second-stage predictive picture I containing more face details _target Then I is carried out _target Simultaneously input into three discriminators through D _global For the generated I _target Global discrimination is performed to make I _target Real image I smiling with S010 user _real As close as possible by means of an eye local discriminant D _eye Mouth part local discriminator D _mouth Pair I _target Is further emphasized in the eye and mouth regions so that an image I is generated _target More realistic, the formula is described as follows:

I _target ＝G ₂ (I _pre-target ,I _N ) (13)；

seventh, calculating a loss function in the second stage FinEGAN:

generator G ₂ The specific formula of the loss function is as follows:

wherein ,

equation (15) is a counter-loss, comprising global versus counter-loss and local counter-loss, operatorsIs Hadamard product, formula (16) is pixel loss, formula (17) and formula (18) are local pixel loss, and local of the generated image is calculatedThe L1 norm of the pixel difference between the region and the local region of the real image, the formula (19) is the local perceived loss, and the total loss function is generated, namely the weighted sum of the loss functions;

distinguishing device D ₂ The specific formula of the loss function of (2) is as follows:

wherein ,

equation (21) is the contrast loss of the global arbiter, and equations (22) and (23) are the contrast loss of the local arbiter;

eighth step, synthesizing video:

each frame is independently generated, so that when n frames of images (I _{target_1} ,I _{target_2} ,···,I _{target_i} ,···,I _{target_n} ) After the generation of (a), generating an expression gradual change process from the facial expression of the S010 user to the facial smile, and synthesizing a video frame sequence into a facial animation of the S010 user, as shown in a second row of fig. 4;

In this embodiment, the weight parameter settings involved in the steps are shown in table 1, and the effect in the whole sample database is good.

Table 1 weight parameter settings for each loss in this embodiment

The two-stage expression animation generation method based on the dual generation countermeasure network is characterized in that English abbreviation of a generated countermeasure network model is GAN, which is known as Generative Adversarial Networks, and the method is an algorithm well known in the technical field.

Figure 4 shows an effect diagram of 3 embodiments of the application. The second row is a video frame sequence for generating S010 user from neutral expression to smiling, the fourth row is a video frame sequence for generating S022 user from neutral expression to surprise large mouth, and the sixth row is a video frame sequence for generating S032 user from neutral expression to downward skimming mouth. Fig. 4 shows that the method of the application can complete the migration of the expression under the condition of keeping the identity information of the face, and can generate a continuously graded video frame sequence to synthesize the animation video with the appointed identity and the appointed expression.

The application is applicable to the prior art where it is not described.

Claims

1. The two-stage expression animation generation method based on the dual-generation contrast network is characterized in that firstly, in a first stage, an expression migration network faceGAN is utilized to extract expression features in a target expression profile graph, and the expression features are migrated to a source face to generate a first-stage prediction graph; in the second stage, a detail generation network FineGAN is utilized to supplement and enrich the details of eyes and mouth areas with larger contribution to the expression change in the first-stage predictive image, a fine-granularity second-stage predictive image is generated, a face video animation is synthesized, and an expression migration network faceGAN and a detail generation network FineGAN are realized by adopting a generation countermeasure network;

the expression migration network FaceGAN includes a generator G ₁ And a discriminator D ₁ Wherein generator G ₁ Comprising three sub-networks, respectively an identity encoder Enc _id And an expression encoder Enc _exp A decoder Dec ₁ ；

The detail generation network FinEGAN comprises a generator G ₂ Sum discriminator D ₂ ，D ₂ Is composed of a global arbiter D _global Eye part discriminator D _eye And a mouth part partial discriminator D _mouth Constructing;

the method comprises the following specific steps:

f _id ＝Enc _id (I _N ) (1),

f _{exp_i} ＝Enc _exp (e _i ) (2),

I _recon ＝Dec ₁ (Enc _id (I _pre-target )+Enc _exp (e _N )) (3)；

wherein ,

wherein ,I_real As the target true value, the formula (5) is the counter loss of the generator, D1 (-) represents the probability that the object of the discriminator D1 is true, the SSIM (-) function in the formula (6) is used for measuring the similarity between two images, the formula (7) is the pixel loss, the MAE (-) function is the mean square error function and is used for measuring the difference between the true value and the predicted value, the formula (8) is the perception loss, the VGG-19 is utilized to extract the perception feature of the image, the last convolution layer output feature in the VGG-19 network is utilized as the perception feature of the image, the perception loss between the generated image and the real image is calculated, the formula (9) is the reconstruction loss, and the neutral non-expression image I of the source face is calculated _N And reconstructed image I thereof _recon A distance therebetween;

wherein ,

fifth, generating a local mask vector adapted to the individual:

I _target ＝G ₂ (I _pre-target ,I _N ) (13)；

seventh, calculating a loss function in the second stage FinEGAN:

generator G ₂ The specific formula of the loss function is as follows:

wherein ,

equation (21) is the contrast loss of the global arbiter, and equations (22) and (23) are the contrast loss of the local arbiter, where λ ₄ and λ₅ Local countering losses in FinEGAN generator G ₂ Weight parameter lambda in (a) ₆ and λ₇ Eye pixel loss respectivelyAnd mouth pixel loss->In Finegan generator G ₂ Weight parameter lambda in (a) ₈ For local perceived loss->In Finegan generator G ₂ Weight parameter lambda in (a) ₉ Loss for global countermeasures>In Finegan discriminator D ₂ Weight parameters of (a);

eighth step, synthesizing video:

2. The method of claim 1, wherein the identity encoder Enc _id The system comprises 4 layers of convolution blocks, wherein a CBAM attention module is added into the front 3 layers of convolution blocks; expression encoder Enc _exp Comprising 3 layers of convolutions, a CBAM attention module being added to the final layer of convolutions, a decoder Dec ₁ The method comprises 4 layers of deconvolution blocks, wherein a CBAM attention module is added in the first 3 layers of deconvolution blocks, and a network high layer and a network low layer are combined by using jump connection, namely an identity encoder Enc is adopted _id Layer 1 output and decoder Dec ₁ The input of the penultimate layer 1 is connected, the identity encoder Enc _id Layer 2 output and solutionEncoder Dec ₁ The input of the penultimate layer 2 is connected, the identity encoder Enc _id Layer 3 output and decoder Dec ₁ The input of the penultimate layer 3 is connected.

3. The generating method according to claim 1, wherein the weight parameter of each loss is set as:

4. the method according to claim 1, wherein the number of feature points obtained in the first step is 68, and 68 feature points constitute a face contour and an eye, mouth, nose contour.