CN115100329B

CN115100329B - Multi-mode driving-based emotion controllable facial animation generation method

Info

Publication number: CN115100329B
Application number: CN202210744504.9A
Authority: CN
Inventors: 李瑶; 赵子康; 李峰; 郭浩; 杨艳丽; 程忱; 曹锐
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2023-04-07
Anticipated expiration: 2042-06-27
Also published as: CN115100329A

Abstract

The invention relates to an image processing technology, in particular to an emotion controllable facial animation generation method based on multi-mode driving. Step S1: preprocessing an image of a portrait video to obtain a face 3D feature coordinate sequence; step S2: preprocessing the audio of the portrait video, and decoupling the audio into an audio content vector and an audio style vector; and step S3: training a face lip voice coordinate animation generation network consisting of a multilayer perceptron and a long-term and short-term memory network based on the face 3D feature coordinate sequence and the audio content vector; the invention introduces the emotion portrait as an emotion source, realizes emotion remodeling of the target portrait by combining the common drive of the emotion source portrait and the audio, and provides diversified emotion facial animation. The method avoids the over-low robustness of the audio single driving source under the drive of multiple modes, gets rid of the dependency of emotion generation on emotion voice recognition, enhances the complementarity among data and realizes the emotion expression of more real facial animation.

Description

Multi-mode driving-based emotion controllable facial animation generation method

Technical Field

The invention relates to an image processing technology, in particular to an emotion controllable facial animation generation method based on multi-mode driving.

Background

Facial animation generation is a popular area of research in computer vision-generated models. Its purpose is to transform a still portrait into a realistic facial animation with an arbitrary audio. The method has wide application background in the fields of treatment systems of pseudoscopic and pseudoscopic audiometry, virtual anchor, role-defined games and the like. However, the existing facial animation generation method has the defects that due to the limitations of the principle and the characteristics of the existing facial animation generation method, the emotion aspect of the generated portrait animation is always lack of maturity, so that the application value of the portrait animation is seriously influenced.

In recent years, many studies in the field of facial animation generation have been made on realistic lip movement and head posture swing, and this is an important factor of portrait emotion. The existence of the portrait emotional information has an important influence on the expression of the synthesized facial animation expression emotion, different facial expressions often make a sentence with different emotional colors, and the perception of the emotional information in the visual mode is one of the important ways for human audiovisual speech communication. However, most of the facial animation generation driving sources are audio single modes, which are superior in lip movement performance for generating syllables, but relatively poor in generating facial expression effects. The reason for this is that the direct driving of the audio is affected by the complexity of the audio emotion and the noise, so that the generated facial expressions often appear as ghosts and distortions, resulting in poor accuracy and low robustness. However, in the existing methods, emotion voice recognition is introduced to avoid the above problems, but the existing methods are influenced and limited by emotion voice recognition accuracy, so that the efficiency is too low, and the generated facial video emotion is lack of diversity and naturalness.

Disclosure of Invention

The invention provides an emotion controllable facial animation generation method based on multi-mode driving, aiming at solving the problem that the existing facial animation generation method is lack of emotion regulation and control capability.

The invention is realized by adopting the following technical scheme:

the method for generating the emotion controllable facial animation based on multi-mode driving is realized by adopting the following steps:

step S1: the image of the portrait video is preprocessed, and then a face recognition algorithm face alignment is used to obtain a face 3D feature coordinate sequence.

Step S2: the audio of the portrait video is preprocessed and then the preprocessed audio is decoupled into an audio content vector irrelevant to the audio speaker and an audio style vector relevant to the audio speaker by using a voice conversion method.

And step S3: and training a face lip voice coordinate animation generation network consisting of a Multi-Layer Perceptron (MLP) and a Long-Short-Term Memory (LSTM) network based on the face coordinate sequence obtained in the step S1 and the audio content vector obtained in the step S2.

And step S4: and training a facial emotion coordinate animation generation network consisting of MLP, LSTM, self-attention mechanism (Self-attention mechanism) and generation countermeasure network (GAN) based on the facial coordinate sequence obtained in the step S1 and the audio content vector and the audio style vector obtained in the step S2.

Step S5: and training a coordinate-to-video network consisting of GANs based on the face coordinate sequence obtained in the step S1.

Step S6: and (4) inputting any two portrait pictures (one representing identity source and one representing emotion source) and any one section of audio based on the facial lip sound coordinate animation generation network, the facial emotion coordinate animation generation network and the coordinate-to-video network obtained in the steps S3, S4 and S5, and generating a lip sound synchronous video of the target portrait with the emotion corresponding to the emotion source.

The method for generating the emotion-controllable facial animation based on multi-mode driving uses a computer vision-generation model and a deep neural network model as technical supports, and realizes the description of the emotion-controllable facial animation generation network.

The invention has the beneficial effects that: compared with the existing facial animation generation method, the method has the advantages that the problems of double images and distortion of facial expressions and low emotion voice recognition precision caused by single audio features are considered, the emotion portrait is introduced to serve as an emotion source, the emotion of the target portrait is remolded by multi-mode driving of the emotion source portrait features and the audio features, and the facial animation with controllable emotion is generated. The dual driving of the emotion image and the audio can avoid the dependency of emotion generation on single voice information, so that the generated video has controllable emotion while meeting the requirements of lip sound synchronization and spontaneous head swing, namely, the diversity and naturalness of facial animation are ensured, and more real emotional expression of the facial animation is realized.

The method effectively solves the problem that the existing facial animation generation method has low efficiency due to the limitation of speech emotion recognition precision on facial expressions, and can be used in the fields of pseudoscopic and pseudoscopic auxiliary treatment systems, virtual anchor games, role-defined games and the like.

Drawings

FIG. 1 is a schematic diagram of a multi-modal driven emotion controllable facial animation generation structure according to an embodiment of the invention.

Fig. 2 is a schematic diagram comparing the present invention with a conventional face animation method.

Fig. 3 is a sample video schematic of an embodiment of the invention.

Detailed Description

In this embodiment, the portrait video data set used is from the public Multi-view Emotional Audio-visual data set (MEAD).

As shown in FIG. 1, the method for generating emotion controllable facial animation based on multi-modal driving is realized by adopting the following steps:

step S1: the image of the portrait video is preprocessed, and then a face recognition algorithm face alignment is used for obtaining a face 3D feature coordinate sequence.

And step S3: and training a face lip sound coordinate animation generation network consisting of a Multi-Layer Perceptron (MLP) and a Long Short-Term Memory (LSTM) network based on the face coordinate sequence obtained in the step S1 and the audio content vector obtained in the step S2.

And step S4: and training a face emotion coordinate animation generation network consisting of MLP, LSTM, self-attention mechanism (Self-attention mechanism) and generation countermeasure network (GAN) based on the face coordinate sequence obtained in the step S1 and the audio content vector and style vector obtained in the step S2.

Step S5: and training a coordinate-to-video network consisting of GANs based on the face coordinate sequence obtained in the step S1. During this step of training, a loss function is used to calculate the minimum distance in pixels between the reconstructed face and the training target face.

Step S6: and (4) inputting any two portrait pictures (one representing identity source and one representing emotion source) and a section of any audio based on the facial lip sound coordinate animation generation network, the facial emotion coordinate animation generation network and the coordinate-to-video network obtained in the steps S3, S4 and S5, and generating a lip sound synchronous video of the target portrait with the emotion corresponding to the emotion source.

In step S1, the image of the portrait video is preprocessed, and the specific preprocessing process includes frame rate conversion, image resampling, and face coordinate extraction.

First, the video is frame rate converted to 62.5 frames per second. It is then image resampled and cropped to 256 × 256 video containing faces. And finally, extracting face coordinates by using a face identification algorithm face alignment, and acquiring 3D coordinates (with the dimension of 68 x 3) of the face of each frame to form a face 3D feature coordinate sequence.

In addition, the face 3D feature coordinate sequence is saved as an emotion source portrait coordinate sequence (emotion source face coordinates) and an identity source portrait coordinate sequence (identity source face coordinates). Compared with the pixel points of the portrait, the face coordinates can provide natural low-dimensional representation for the portrait and also provide a high-quality bridge for downstream emotion replaying tasks.

In the step S2, the audio of the portrait video is preprocessed, wherein the preprocessing comprises sampling rate conversion, audio vector extraction and audio vector decoupling.

The audio is first sample rate converted to 16000hz using Fast forwarding Moving Picture Experts Group. Then, audio vector extraction is carried out on the audio vector, and the audio vector is obtained by using a rememblyzer library of python. And finally, inputting the audio vector into a voice conversion model AutoVC, and acquiring an audio content vector irrelevant to the audio speaker after decoupling and an audio style vector relevant to the audio speaker.

And step S3, finishing training of the facial lip voice coordinate animation generation network.

The network adopts a self-defined coder-decoder network structure, the coder comprises a facial coordinate coder consisting of two layers of MLPs and a voice content coder consisting of three layers of LSTMs, and the decoder is a facial lip sound coordinate decoder consisting of three layers of MLPs. In order to generate an optimal sequence of the offset of the facial lip voice coordinate, the facial lip voice coordinate animation generation network sets a loss function to continuously adjust the weight and the deviation of the network until the error between the predicted coordinate and the reference coordinate is minimized.

The custom encoder-decoder network structure is as follows:

firstly, two-layer MLP is used for extracting the identity feature of the face 3D feature coordinate sequence (namely the first time point of the face 3D feature coordinate sequence) of the first frame of the video obtained in the step S1. And then, based on the identity characteristics and the audio content vector obtained in the step S2, carrying out linear fusion and extracting the dependency relationship between audio continuous syllables and lip coordinates by using the LSTM of the three-layer unit. Then, based on the output of the encoder in the step, a decoder composed of three layers of MLPs is used for predicting a facial lip voice coordinate offset sequence, and the specific calculation formula is as follows:

ΔP _t ＝MLP _c (LSTM _c (Ec _t→t+λ ，MLP _L (L；W _mlp，l )W _lstm )；W _mlp，c ) (1)

in the formula (1), Δ P _t Indicating the predicted lip sound coordinate offset of the t frame face, wherein t represents the current frame of the portrait video; MLP _L Representing a face coordinate encoder, L being the face coordinates of the first frame of the portrait video, W _mlp，l Representing facial coordinate encoder learnable parameters; LSTM _c Representing phonetic content compilationsA coder, ec representing the audio content vector, t → t + λ representing that the audio content vector is input to the speech content coder in a batch size of λ =18 per frame t, W _lstm Representing speech content encoder learnable parameters; MLP _c Coordinate decoder for lip voice of face, W _mlp，c The coordinate decoder for lip sound on the face can learn the parameters.

Correcting the first frame coordinate of the portrait video through the predicted facial lip tone coordinate offset sequence to obtain a lip tone synchronous coordinate sequence, wherein a specific calculation formula is as follows:

P _t ＝L+ΔP _t (2)

in the formula (2), P _t Representing the lip sound synchronous face coordinates of the tth frame, wherein t represents the current frame of the portrait video; l is the face coordinate of the first frame of the portrait video, Δ P _t Indicating the predicted t-th frame face lip sound coordinate offset.

In order to generate an optimal sequence of the offset of the facial lip coordinates, the weight and the deviation of the loss function adjusting network are set based on the encoder-decoder structure of the facial lip coordinate animation generating network. The objective of the loss function is to minimize the error between the predicted coordinates and the coordinates obtained in step S1, and the specific calculation formula is as follows:

in the formula (3), the first and second groups of the compound,

representing a loss function of a facial lip tone coordinate animation generation network, wherein T represents the total frame rate of the video, T represents the current frame rate of the portrait video, N =68 represents the total number of facial coordinates, and i represents the current facial coordinate number; p _i，t Represents the predicted face coordinates of the ith frame, < > based on the prediction>

Representing the coordinates of the face of the ith frame obtained in step S1;

represents P _i，t And/or>

Squared euclidean norm of.

When the loss function tends to be smooth, i.e.

And when the minimum value is reached, the training of the facial lip sound coordinate animation generation network is finished.

And step S4, finishing the training of the face emotion coordinate animation generation network, and adding rich visual emotion expressions to the generated video.

Human beings rely on visual information in emotion interpretation, and abundant visual emotion expression can give people stronger sense of reality, and the practicality is bigger. Most of the existing face animation generation algorithms are dedicated to expressing the lip movement and the head pose swing of the face animation through audio single modality. Audio single-modality driving works well in lip movement that generates syllables, but relatively poorly in generating facial expressions. The reason for this is that the direct driving of the audio is affected by the complexity of the audio emotion and the noise, so that the generated facial expressions often appear as ghosts and distortions, resulting in poor accuracy and low robustness. However, in the existing methods, emotion voice recognition is introduced to avoid the above problems, but the existing methods are influenced and limited by emotion voice recognition accuracy, so that the efficiency is too low, and the generated facial video emotion is lack of diversity and naturalness.

The patent provides a facial emotion coordinate animation generation network based on multi-mode driving, emotion portraits are introduced to serve as emotion sources, multi-mode driving is performed in combination with audio characteristics to achieve emotion remodeling of target portraits more accurately, and the facial emotion coordinate animation generation network is generated.

The network is a custom encoder-decoder network structure, the encoder comprises an audio encoder and a facial coordinate encoder, and the decoder comprises a coordinate decoder. The encoder can obtain audio features, portrait identity features and portrait emotional features. The decoder is responsible for processing the multi-mode characteristics, and is driven by the audio characteristics and the portrait emotion characteristics together to generate a coordinate offset sequence after the target portrait emotion is remolded, so that rich visual emotion expression is added to the video. Under the driving of the multiple modes, the method avoids the over-low robustness of the audio single driving source, gets rid of the dependency of emotion generation on emotion voice recognition, enhances the complementarity among data and realizes the emotion expression of more real facial animation.

In order to generate an optimal facial emotion coordinate offset sequence, three different loss functions are set to adjust the weight and the deviation of the network based on the encoder-decoder structure of the facial emotion coordinate animation generation network. One of them calculates the distance between the predicted face 3D feature coordinate sequence and the face 3D feature coordinate sequence obtained in step S1. The second and third are the identifier loss function to distinguish the true and false of the generated face coordinate and the similarity of the face coordinate interval frame.

The network structure of the encoder-decoder customized by the facial emotion coordinate animation generation network is as follows:

the encoder consists of an audio encoder, an identity source face coordinate encoder and an emotion source face coordinate encoder. The audio encoder captures audio features through a three-layered LSTM, a three-layered MLP, and a self-attention mechanism.

Specifically, firstly, the LSTM is used to extract the features of the audio content vector obtained in step S2; then, using MLP to extract the characteristics of the audio style vector obtained in the step S2; then, carrying out linear fusion on the audio content vector characteristics and the audio style vector characteristics; and finally, capturing a longer-time structural dependency relationship between the audio content vector and the audio style vector by using a self-attention mechanism to obtain audio features with stronger time dependency, wherein a specific calculation formula is as follows:

S _t ＝Attn(LSTM _c′ (Ec _t→t+λ ；W′ _lstm )，MLP _s (Es；W _mlp，s )；W _attn ) (4)

in the formula (4), S _t Representing processed tth frame audioA feature, t, representing a current frame of the portrait video; MLP _S Representing an audio style vector encoder, es representing an audio style vector, W _mlp，s Representing audio style vector encoder learnable parameters; LSTM _c′ Represents an audio content vector encoder, ec represents an audio content vector, t → t + λ represents that the audio content vector is input to the audio content vector encoder, W ', with a batch size of λ =18 per frame t' _lstm Representing audio content vector encoder learnable parameters; attn denotes the self-attention mechanism, W _attn Indicating a self-attentiveness mechanism learnable parameter.

The two face coordinate encoders are both light neural networks composed of seven layers of MLPs. The two are similar in structure but different in function, one extracts geometric information of identity and one extracts geometric information of facial emotion.

Based on the two different face coordinates (one is regarded as an identity source face coordinate sequence and the other is regarded as an emotion source face coordinate sequence) obtained in the step S1, firstly, an identity source face coordinate encoder composed of seven layers of MLPs is used for extracting portrait identity characteristics of an identity source; secondly, extracting portrait emotional characteristics of an emotion source by using an emotion source face coordinate encoder consisting of seven layers of MLPs; and finally, performing linear fusion on the portrait identity characteristic, the portrait emotion characteristic and the audio characteristic obtained by the formula (4) to obtain a fusion characteristic, wherein the specific calculation formula is as follows:

F _t ＝concat(MLP _LA (L _a ；W _mlp，la )，MLP _LB (L _b ；W _mlp，lb )，S _t ) (5)

in the formula (5), F _t Representing the fusion characteristics of the t frame after linear fusion, and concat representing linear fusion; MLP _LA Identity source face coordinate representation encoder, L _a Face coordinates, W, for the first frame of the identity Source Portrait video _mlp，la Representing identity source facial coordinate encoder learnable parameters; MLP _LB Face coordinate encoder for representing emotion source, L _b Face coordinates for the first frame of the Source of Emotion Portrait video, W _mlp，lb Representing emotion source face coordinate encoder learnable parameters; s _t T frame audio representing step S4And (5) performing characteristic.

Based on the portrait identity characteristic, the portrait emotion characteristic and the fusion characteristic of the audio characteristic obtained by the formula (5), a coordinate decoder consisting of three layers of MLPs is used for predicting a facial emotion coordinate offset sequence, and the specific calculation formula is as follows:

△Q _t ＝MLP _LD (F _t ；W _mlp，ld ) (6)

in the formula (6), Δ Q _t Representing the predicted t frame face emotion coordinate offset, wherein t represents the current frame of the portrait video; MLP _LD Decoder for animation generation network representing facial emotion coordinates, F _t For the t frame fusion feature after the linear fusion in step S4, W _mlp，ld Indicating that the decoder can learn the parameters.

The method comprises the following steps of correcting the first frame coordinate of the identity source portrait video through a predicted facial emotion coordinate offset sequence to obtain a facial emotion coordinate sequence, wherein the specific calculation formula is as follows:

Q _t ＝L _a +△Q _t (7)

in the formula (7), Q _t Representing the face emotion coordinates of a tth frame, wherein t represents the current frame of the portrait video; l is _a Face coordinates, deltaQ, for the first frame of the identity Source Portrait video _t And the predicted t frame emotion coordinate offset is shown.

In order to generate an optimal facial emotion coordinate offset sequence, three different loss function adjusting network weights and deviations are set based on the encoder-decoder structure of the facial emotion coordinate animation generation network, and the specific formula is as follows:

in the formula (8), the first and second groups,

a total penalty function representing the facial emotion coordinate animation generation network, </or >>

A penalty function representing the facial emotion coordinate animation generation network, <' >>

Discriminator D for representing face coordinates _L Based on the loss function of->

Frame similarity discriminator D for representing face coordinate interval _T A loss function of (d); lambda [ alpha ] ₁ ，λ ₂ ，λ ₃ Respectively, weight parameters. />

The face coordinate loss function calculates the distance between the predicted face emotion coordinate sequence and the face coordinate (identity source portrait coordinate sequence with the same emotion as the emotion source) obtained in step S1, and the specific calculation formula is as follows:

in the formula (9), the first and second groups,

representing a loss function of a facial emotion coordinate animation generation network, T representing a total frame rate of a video, T representing a current frame of a portrait video, N =68 representing the total number of facial coordinates, and i representing a current facial coordinate number; q _i，t Represents the predicted face coordinates of the ith frame, < > based on the prediction>

Representing the coordinates of the face of the ith frame obtained in step S1;

represents Q _i，t And &>

Squared euclidean norm of.

Feeling sitting on the faceDiscriminator loss function during training of markup image generation network

For discriminating true or false of the generated face coordinates, a discriminator penalty function->

For estimating the similarity of the face coordinate interval frames, the formula is as follows:

in equations (10) and (11), t represents the current frame of the portrait video, D _L A discriminator for representing the true or false of the face coordinates,

discriminator D for representing face coordinates _L A loss function of (d); d _T A frame similarity discriminator representing the interval of facial coordinates>

Frame similarity discriminator D for representing face coordinate interval _T A loss function of (d); q _t The predicted t-th frame part emotion coordinates are expressed and/or evaluated>

Represents the t-th frame face coordinate, based on the result of step S1, and/or the value of the T-th frame face coordinate>

Represents->

Face coordinates of the previous frame.

When the loss function tends to be smooth, the training of the facial emotion coordinate animation generation network is completed.

In step S5, the coordinate is converted into a video network, and the training of the target network is completed.

And based on the face coordinate sequence obtained in the step S1, connecting discrete coordinates according to numbers, and rendering by using color line segments to create a three-channel face sketch sequence with the size of 256 × 256. The sequence is channel-concatenated with the original pictures of the first frame of the corresponding video to create a six-channel picture sequence with a size of 256 × 256. And generating a reconstructed face video by using the sequence as input and using a coordinate-to-video network.

In order to generate an optimal face video, an L1 loss function (L1-norm loss function) is set to adjust the weight and deviation of the network based on the image conversion network. The loss function aims to minimize the pixel distance between the reconstructed face video and the training target face video.

And S6, inputting any two portrait pictures (one representing identity source and the other representing emotion source) and any one section of audio to generate a target video based on the facial lip sound coordinate animation generation network, the facial emotion coordinate animation generation network and the coordinate-to-video network obtained in the steps S3, S4 and S5.

And respectively acquiring corresponding identity source portrait coordinates and emotion source portrait coordinates by using a face identification algorithm face alignment, and acquiring an audio content vector and an audio style vector of the audio by using a voice conversion method. And (4) generating a network by the audio content vector and the identity source portrait coordinate animation through the facial lip tone coordinate obtained in the step (S3), and generating a lip tone synchronous facial coordinate offset sequence. And (4) generating a network by the audio content vector, the audio style vector, the identity source portrait coordinate and the emotion source portrait coordinate through the facial emotion coordinate animation obtained in the step (S4), and generating a facial emotion coordinate offset sequence. And correcting the identity source portrait coordinates through the two offset sequences to obtain a final coordinate sequence, inputting the final coordinate sequence into the coordinate-to-video network obtained in the step S5, and generating a lip sound synchronous video of the target portrait with the emotion source emotion.

The multi-mode driven emotion controllable facial animation generation method is realized through a voice conversion method, a multi-layer perceptron, a long-term and short-term memory network, a self-attention mechanism and a generation countermeasure network; as shown in FIGS. 2-3, the invention can generate different emotion videos by adjusting the emotion source portrait, thereby having higher application value and overcoming the characteristics of the prior facial animation generation method such as lack of emotion or poor robustness.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. The method for generating the emotion controllable facial animation based on multi-mode driving is characterized by comprising the following steps of:

step S1: preprocessing an image of a portrait video, and extracting a facial 3D feature coordinate sequence from the preprocessed image by using a facial recognition algorithm;

step S2: preprocessing the audio of the portrait video, and then decoupling the preprocessed audio into an audio content vector irrelevant to an audio speaker and an audio style vector relevant to the audio speaker by using a voice conversion method;

and step S3: training a facial lip voice coordinate animation generation network consisting of a multilayer perceptron and a long-time and short-time memory network based on a facial 3D characteristic coordinate sequence and an audio content vector;

and step S4: training a facial emotion coordinate animation generation network consisting of a multilayer perceptron, a long-time memory network, a long-time attention mechanism and a generation countermeasure network based on a facial 3D feature coordinate sequence, an audio content vector and an audio style vector;

step S5: training a coordinate-to-video network consisting of a generated countermeasure network based on the face 3D feature coordinate sequence;

step S6: inputting any two portrait pictures and a section of any audio based on a trained facial lip voice coordinate animation generation network, a facial emotion coordinate animation generation network and a coordinate-to-video network, wherein one of the two portrait pictures represents an identity source and the other represents an emotion source; generating a lip sound synchronous video of a target portrait with a mood corresponding to an emotion source;

in step S3, the facial lip voice coordinate animation generation network adopts a self-defined coder-decoder network structure, the coder comprises a facial coordinate coder consisting of two layers of MLPs and a voice content coder consisting of three layers of LSTMs, and the decoder is a facial lip voice coordinate decoder consisting of three layers of MLPs; the facial lip voice coordinate animation generation network is provided with a loss function used for continuously adjusting the weight and deviation of the network until the error between the predicted coordinate and the reference coordinate is minimized;

in step S3, the network training process for generating the facial lip sound coordinate animation is as follows:

firstly, extracting the identity feature of the 3D feature coordinate sequence of the face of the first frame of the video obtained in the step S1 by using two layers of MLPs (Multi-layer Linear programming) to obtain the identity feature of the first time point of the 3D feature coordinate sequence of the face;

then, based on the identity characteristics and the audio content vector obtained in the step S2, after linear fusion, extracting the coordinate dependency relationship between audio continuous syllables and lip parts by using the LSTM of the three-layer unit;

then, based on the output of the encoder in the step, a decoder composed of three layers of MLPs is used for predicting a facial lip voice coordinate offset sequence, and the specific calculation formula is as follows:

ΔP _t ＝MLP _c (LSTM _c (Ec _t→t+λ ,MLP _L (L；W _mlp,l )；W _lstm )；W _mlp,c )

in the formula,. DELTA.P _t Indicating the predicted lip sound coordinate offset of the t frame face, wherein t represents the current frame of the portrait video; MLP _L Representing a face coordinate encoder, L being the face coordinates of the first frame of the portrait video, W _mlp,l Representing facial coordinate encoder learnable parameters; LSTM _c Representing a speech content encoder, ec representing an audio content vector, t → t + λ representing that the audio content vector is input to the speech content encoder with a batch size of λ =18 per frame t, W _lstm Representing speech content encoder learnable parametersCounting; MLP _c Coordinate decoder for lip voice of face, W _mlp,c Representing facial lip tone coordinate decoder learnable parameters;

P _t ＝L+ΔP _t

in the formula, P _t Representing the lip sound synchronous face coordinates of the tth frame, wherein t represents the current frame of the portrait video; l is the face coordinate of the first frame of the portrait video, Δ P _t Indicating the predicted t frame face lip sound coordinate offset;

in order to generate an optimal sequence of the offset of the facial lip coordinates, based on the encoder-decoder structure of the facial lip coordinate animation generation network, the weight and the deviation of a loss function adjustment network are set, and a specific calculation formula of the loss function is as follows:

in the formula (I), the compound is shown in the specification,

representing a loss function of a facial lip voice coordinate animation generation network, T representing a total frame rate of a video, T representing a current frame rate of a portrait video, N =68 representing the total number of facial coordinates, and i representing a current facial coordinate number; p is _i,t Coordinates representing a predicted ith frame, <' > based on a predicted frame>

Coordinates of the ith frame obtained in the step S1 are represented; />

Represents P _i,t And/or>

The square of the euclidean norm of (d);

when the loss function tends to be smooth, i.e.

When the minimum value is reached, training of the facial lip sound coordinate animation generation network is completed;

in step S4, the facial emotion coordinate animation generation network adopts a self-defined encoder-decoder network structure:

the encoder comprises an audio encoder and a face coordinate encoder, wherein the face coordinate encoder comprises an identity source face coordinate encoder and an emotion source face coordinate encoder, and the audio encoder captures audio features through a three-layer LSTM, a three-layer MLP and a self-attention mechanism;

the decoder comprises a coordinate decoder;

the device comprises a coder, a decoder and a target portrait emotion recognition module, wherein the coder is used for acquiring audio features, portrait identity features and portrait emotion features, the decoder is used for processing multi-mode features, and the multi-mode features are driven together through the audio features and the portrait emotion features to generate a coordinate offset sequence after the target portrait emotion is remolded;

the facial emotion coordinate animation generation network is provided with three different weight and deviation of a loss function adjustment network, wherein one of the weight and deviation is used for calculating the distance between a predicted facial 3D characteristic coordinate sequence and the facial 3D characteristic coordinate sequence obtained in the step S1, and the second and third are identifier loss functions which are used for respectively judging the truth of generated facial coordinates and the similarity of a facial coordinate interval frame;

in step S4, the network training process for generating the facial emotion coordinate animation is as follows:

firstly, using LSTM to extract the characteristics of the audio content vector obtained in the step S2;

then, using MLP to extract the characteristics of the audio style vector obtained in the step S2;

then, carrying out linear fusion on the audio content vector characteristics and the audio style vector characteristics;

and finally, capturing a longer-time structural dependency relationship between the audio content vector and the audio style vector by using a self-attention mechanism to obtain an audio feature with stronger time dependency, wherein a specific calculation formula is as follows:

S _t ＝Attn(LSTM _c′ (Ec _t→t+λ ；W′ _lstm ),MLP _s (Es；W _mlp,s )；W _attn )

in the formula, S _t Representing the processed audio characteristics of the t-th frame, wherein t represents the current frame of the portrait video; MLP _S Representing an audio style vector encoder, es representing an audio style vector, W _mlp,s Representing audio style vector encoder learnable parameters; LSTM _c′ Denotes an audio content vector encoder, ec denotes an audio content vector, t → t + λ denotes that the audio content vector is input to the audio content vector encoder with a batch size of λ =18 per frame t, W' _lstm Representing audio content vector encoder learnable parameters; attn denotes the self-attention mechanism, W _attn Representing a self-attention mechanism learnable parameter;

the two facial coordinate encoders are light neural networks composed of seven layers of MLPs, one for extracting geometric information of identity and the other for extracting geometric information of facial emotion;

based on the two different face coordinates obtained in the step S1, one is regarded as an identity source face coordinate sequence, and the other is regarded as an emotion source face coordinate sequence, firstly, an identity source face coordinate encoder composed of seven layers of MLPs is used for extracting portrait identity characteristics of an identity source; secondly, extracting portrait emotional characteristics of an emotion source by using an emotion source face coordinate encoder consisting of seven layers of MLPs; and finally, performing linear fusion on the portrait identity characteristic, the portrait emotion characteristic and the obtained audio characteristic to obtain a fusion characteristic, wherein the specific calculation formula is as follows:

F _t ＝concat(MLP _LA (L _a ；W _mlp,la ),MLP _LB (L _b ；W _mlp,lb ),S _t )

in the formula, F _t Representing the t frame characteristic after linear fusion, and concat representing linear fusion; MLP _LA Identity source face coordinate representation encoder, L _a Face coordinates, W, for the first frame of the identity Source Portrait video _mlp,la Identity source representative facial coordinate encoder learnableA parameter; MLP _LB Face coordinate encoder for representing emotion source, L _b Face coordinates, W, for the first frame of the Source Anoticeage video _mlp,lb Representing emotion source face coordinate encoder learnable parameters; s _t Representing the tth frame audio feature of step S4;

based on the fusion characteristics of the portrait identity characteristics, the portrait emotion characteristics and the audio characteristics, a coordinate decoder consisting of three layers of MLPs is used for predicting a facial emotion coordinate offset sequence, and the specific calculation formula is as follows:

ΔQ _t ＝MLP _LD (F _t ；W _mlp,ld )

in the formula,. DELTA.Q _t Representing the predicted emotional coordinate offset of the tth frame, wherein t represents the current frame of the portrait video; MLP _LD Decoder for animation generation network representing facial emotion coordinates, F _t For the t frame fusion feature after the linear fusion in step S4, W _mlp,ld Indicating the decoder learnable parameters;

Q _t ＝L _a +ΔQ _t

in the formula, Q _t Representing emotional face coordinates, and t representing a current frame of the portrait video; l is a radical of an alcohol _a Face coordinates, Δ Q, for the first frame of the identity Source Portrait video _t Representing the predicted t frame emotion coordinate offset;

in order to generate an optimal facial emotion coordinate offset sequence, a coder-decoder structure of a network is generated based on facial emotion coordinate animation, three different loss functions are set to adjust the weight and the deviation of the network, and the specific formula is as follows:

in the formula (I), the compound is shown in the specification,

Frame similarity discriminator D for representing face coordinate interval _T A loss function of (d); lambda [ alpha ] ₁ ，λ ₂ ，λ ₃ Respectively are weight parameters;

wherein, the loss function of the facial emotion coordinate animation generation network calculates the distance between the predicted facial emotion coordinate sequence and the facial coordinates obtained in the step S1, and the specific calculation formula is as follows:

in the formula (I), the compound is shown in the specification,

representing a loss function of a facial emotion coordinate animation generation network, T representing a total frame rate of a video, T representing a current frame of a portrait video, N =68 representing the total number of facial coordinates, and i representing a current facial coordinate number; q _i,t Face coordinates representing a predicted ith frame, <' > based on a predicted image>

Representing the coordinates of the face of the ith frame obtained in step S1; />

Represents Q _i,t And &>

The square of the euclidean norm of (d);

discriminator loss function during facial emotion coordinate animation generation network training

For discriminating true or false of generated face coordinates, discriminator loss function>

For estimating the similarity of the face interval frame coordinates, the formula is as follows:

wherein t represents the current frame of the portrait video, D _L A discriminator for representing whether the coordinates of the face are true or false,

Frame similarity discriminator D for representing face coordinate interval _T A loss function of (d); q _t The emotion coordinates of the tth frame representing the prediction are->

Represents->

Face coordinates of a previous frame;

2. The method for generating emotion-controllable facial animation based on multi-modal driving according to claim 1, wherein step S1 specifically comprises:

firstly, performing frame rate conversion on a video, and converting the video into 62.5 frames per second;

then, the image is resampled and is cut into 256 times 256 videos containing faces;

extracting facial coordinates by using a facial recognition algorithm, acquiring 3D coordinates of each frame of face, wherein the dimensionality is 68 x 3, and forming a facial 3D feature coordinate sequence;

and storing the face 3D feature coordinate sequence into an emotion source portrait coordinate sequence and an identity source portrait coordinate sequence, namely, an emotion source face coordinate and an identity source face coordinate.

3. The method for generating emotion-controllable facial animation based on multi-modal driving according to claim 1, wherein step S2 specifically comprises:

carrying out sampling rate conversion on the audio, and converting the audio sampling rate into 16000hz by using Fast forwarding Moving Picture Experts Group;

then, audio vector extraction is carried out on the audio vector, and a rememblyzer library of python is used for obtaining the audio vector;

and finally, inputting the audio vector into a voice conversion model AutoVC, and acquiring the decoupled audio content vector irrelevant to the audio speaker and the audio style vector relevant to the audio speaker.

4. The method for generating emotion-controllable facial animation based on multi-modal driving according to claim 1, wherein in step S5, the training process of the coordinate-to-video network is as follows:

based on the face coordinate sequence obtained in the step S1, connecting discrete coordinates according to numbers, and rendering by using color line segments to create a three-channel face sketch sequence with the size of 256 × 256;

performing channel cascade on the sequence and the original picture of the first frame of the corresponding video to create a six-channel picture sequence with the size of 256 × 256;

using the sequence as input, and generating a reconstructed face video by using a coordinate-to-video network;

and setting the weight and deviation of an L1 loss function adjusting network based on the coordinate-to-video network for generating the optimal face video.

5. The method for generating controllable emotion face animation based on multi-modal driving according to claim 1, wherein in step S6, the lip-sound synchronization video of the target portrait with emotion source emotion is generated by using the trained three network models, which specifically includes:

inputting any two portrait pictures and any section of audio, respectively obtaining an identity source portrait coordinate and an emotion source portrait coordinate by using a face recognition algorithm, and obtaining an audio content vector and an audio style vector of the audio by using a voice conversion method;

generating a network by the audio content vector and the identity source portrait coordinate animation obtained in the step S3 to generate a lip-note synchronous face coordinate offset sequence;

generating a network by the audio content vector, the audio style vector, the identity source portrait coordinate and the emotion source portrait coordinate through the facial emotion coordinate animation obtained in the step S4, and generating a facial emotion coordinate offset sequence;

and correcting the identity source portrait coordinates through the two offset sequences to obtain a final coordinate sequence, inputting the final coordinate sequence into the coordinate-to-video network obtained in the step S5, and generating a lip sound synchronous video of the target portrait with the emotion source emotion.