CN115457169A

CN115457169A - Voice-driven human face animation generation method and system

Info

Publication number: CN115457169A
Application number: CN202211005678.XA
Authority: CN
Inventors: 谢榕; 李耀鹏
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-08-22
Filing date: 2022-08-22
Publication date: 2022-12-09

Abstract

The invention provides a voice-driven human face animation generation method and system, which are used for extracting and standardizing human face key points, and correcting geometric positions of the human face key points by taking lips as main references and utilizing the position relation between eyes and lips; predicting lip key points from Audio features, wherein the predicting comprises Audio feature extraction, data preprocessing, audio2MKP modeling and training and parameter optimization, and the Audio2MKP is a model for realizing the mapping from voice to the lip key points; generating a reference image based on lip key points, wherein the reference image comprises mask image generation, face region division, FTGAN modeling and training and parameter optimization, and the FTGAN is a model for converting a face mask image into a face reference image; on the basis of a reference image, the generation of the face animation is guided by using audio features, including A2FGAN modeling and training, parameter optimization and face animation synthesis, wherein the A2FGAN is a model for realizing the face animation with the lip-sound synchronization effect.

Description

Voice-driven human face animation generation method and system

Technical Field

The invention belongs to the technical field of artificial intelligence virtual face application, and particularly relates to a voice-driven face animation generation method and system.

Background

The voice-driven human face animation generation aims at generating human face animation with smoothness, naturalness and lip synchronization according to input voice information and human face images. The method has wide application prospect in a plurality of fields such as virtual anchor, virtual customer service, online education, movie special effects, game entertainment and the like. A high-quality, high-reality sense and lip-sound synchronous human face animation can well enhance the sense of identity and experience of a user.

In recent years, with the continuous development of deep learning technology, various models such as convolutional neural networks and antagonistic generating networks are proposed and widely applied, so that a human face animation generating technology starts to have a new research direction, and a learning mechanism is adopted to enable a trained human face model to have a good mouth shape expression effect.

The current technical means are mainly summarized into two types, namely face animation generation based on intermediate features and end-to-end voice synchronous face animation generation.

(1) And generating lip key points by the face animation based on the intermediate features, and taking the lip key points as intermediate features to guide the face image so as to obtain the face animation with synchronous lip and voice. Suwajanakorn et al (2017) use a deep learning method to obtain lip movement information from speech as an intermediate feature, and then use a traditional computer vision processing method to synthesize facial textures and generate face animation. Although the methods can generate human face animations with strong synchronization, the edge areas of the synthesized lip textures have the conditions of blurring and shielding, and the image frame quality needs to be improved. In addition, the method carries out research aiming at a specific object, and the model has no generalization capability on the target object. Kumar et al (2017) proposed an ObamaNet model based on the work of Suwajanakakorn et al. The method carries out masking processing on a face image by using lip motion information obtained through prediction, and then converts the masking image into a face animation image frame through an image conversion model Pix2 Pix. Although the method uses the trainable neural network module to replace the traditional computer vision method, the generation efficiency of the face animation can be improved to a certain extent, the lip edge of the generated face image is still fuzzy, and the generalization capability to the target object is very limited.

(2) And the end-to-end voice synchronous human face animation generation directly inputs the audio features and the human face images into the same neural network module for training to generate the human face animation. Vougioukas et al (2018) use a generative confrontation network to establish a mapping from audio to facial animation. They proposed to generate a confrontational network model that includes one generator and three discriminators. The generator is a coder-decoder and is used for receiving audio and a face image as input and generating a face animation image frame; the discriminator consists of a frame discriminator, a sequence discriminator and a synchronization discriminator and is used for guiding face reconstruction, image interframe smoothness and lip sound synchronization. Although the method can obtain the human face animation with better lip sound synchronism and has certain generalization capability, the human face animation is generated by only using one human face image, the character still lacks natural mouth shape movement, and the reality sense is insufficient. Vougioukas et al (2020) subsequently added a human blinking action to the animation, but the facial movement by blinking alone is still insufficient to show the realism of a human face animation. Chung et al (2016) have proposed a lip sync discrimination network SyncNet. The similarity of the audio and the face image in a certain common parameter space is judged, and the cross entropy loss of the audio characteristic and the face image characteristic under the common space parameter is calculated to reflect the lip sound synchronization effect. Aiming at the defect that human beings lack natural facial movement in the face animation generated by Vougioukas et al (2018), prajwal et al further provides a LipGAN (2019) based on the generation of an antagonistic network and an improved model Wav2Lip (2020) thereof on the basis of the research work of Chung et al. Unlike Vougioukas et al, which uses a single facial image, the Wav2Lip model receives as input a sequence of image frames, and generates facial animation by directing changes in Lip movement through audio features while preserving facial movement information for the character over the original sequence of frames. The model can generate the human face animation with natural head movement and good lip sound synchronization effect, but the model still has the problems that the generated human face image frame is not clear enough and the image frame quality needs to be improved.

Summarizing the current research situation of the voice-driven human face animation generation technology, the main problems of general lip sound synchronization effect, low human face animation image frame quality, weak human face model generalization capability, lack of natural head motion of human beings in human face animation and the like still exist. How to control the lip shape and the facial movement of the human face and maintain the local details, the generation of a lip-synchronized human face with realistic sensation still faces a great technical challenge.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a new voice-driven human face animation generation scheme, which generates human face animation with human figure as a speaking main body and synchronous audio according to given audio and images or videos of human face images. The generated human face animation has good performance and meets the following requirements: (1) good lip sound synchronization effect; (2) high-quality human face animation image frames; (3) the human face animation generation model has generalization capability; (4) The generated human face animation has natural head movement of the character.

In order to achieve the purpose, the invention provides a voice-driven human face animation generation method, which comprises the following steps of automatically generating human face animation,

the method comprises the following steps of S1, extracting and standardizing human face key points, wherein the standardized treatment of the human face key points comprises the steps of taking lips as main references and correcting the geometric positions of the human face key points by utilizing the position relation between eyes and the lips;

s2, predicting lip key points from the Audio features, wherein the lip key point prediction comprises Audio feature extraction, data preprocessing, lip key point prediction model Audio2MKP modeling and training, and lip key point prediction model Audio2MKP parameter optimization; the lip key point prediction model Audio2MKP is a model for realizing the mapping from voice to lip key points;

s3, generating a reference image based on the lip key points, wherein the reference image comprises mask image generation, face region division, face conversion generation confrontation network model FTGAN modeling and training, and face conversion generation confrontation network model FTGAN parameter optimization; the face conversion generation confrontation network model FTGAN is a model for converting a face mask image into a face reference image;

step S4, on the basis of the reference image obtained in the step S3, the audio features obtained in the step S2 are utilized to guide the generation of face animation, including modeling and training of a face-to-face generation confrontation network A2FGAN, optimization of face-to-face generation confrontation network A2FGAN parameters and face animation synthesis; the speech-to-face generation confrontation network A2FGAN is a model for realizing a face animation that achieves a lip-sound synchronization effect.

When extracting the key points of the human face, firstly, the key parts of the face of the human face image, including eyebrows, eyes, a nose, lips and the outer contour of the face, are positioned, and the basic key points of the human face are determined.

In step S2, the Audio2MKP model is realized based on the convolutional neural network, and the accurate coordinates of the lip key points are obtained from the voice information, the realization method includes the following steps,

1) The method comprises the following steps of (1) modeling by using the Audio2MKP, wherein the Audio2MKP comprises a lip key point prediction model based on a convolutional neural network, and mapping from voice to lip key points is realized; the lip key point prediction model Audio2MKP comprises a plurality of convolution layers and a plurality of full-connection layers which are sequentially connected, and residual errors from the last convolution layer are respectively added into a part of convolution layers to be connected to form a residual error block;

2) The Audio2MKP training comprises receiving input Audio features, training a lip key point prediction model Audio2MKP, and optimizing model parameters through back propagation to obtain accurate predicted lip key points.

When the mask image is generated in step S3, the size of the mask region is determined so that the lip region extends outward.

When the face region is divided in the step S3, the face image is divided into different regions including a lip region, a face region and a background region according to the importance degree from high to low; meanwhile, weight is set for each area, the weight of the lip area is greater than that of the face area, and the weight of the face area is greater than that of the background area.

Moreover, in step S3, a face conversion generation confrontation network model FTGAN is established based on the structural improvement of the generation confrontation network, and the face mask image is converted into a face reference image, and the implementation mode comprises the following steps,

1) FTGAN modeling, which comprises the step of generating a confrontation network model FTGAN based on the face conversion to realize the conversion of a mask image obtained by predicting lip key points into a face reference image; the face conversion generation confrontation network model FTGAN consists of a generator network and a discriminator network, wherein the generator network receives an input mask image to generate an output face reference image, and the STN module, the face encoder module, the CBAM module and the face encoder module are sequentially connected; the discriminator network comprises a frame discriminator module, a reference image generated by the generator network and a corresponding truth value image are input into the frame discriminator module together, and the mean square error between a generated label and a truth value label is calculated to evaluate the quality of the generated image;

2) And FTGAN training, namely receiving the input mask image, training a face conversion generation confrontation network model FTGAN, and obtaining a high-quality face reference image by reversely propagating and optimizing model parameters.

Furthermore, in step S4, based on the architecture improvement of the generation countermeasure network, a proposed voice is established to the face generation countermeasure network A2FGAN to obtain the lip synchronization effect of the face animation, and the implementation manner includes the following steps,

1) A2FGAN modeling, which comprises that a confrontation network A2FGAN is generated based on voice to human face, and human face animation is generated by utilizing voice information on the basis of a reference image; the voice-to-face generation confrontation network A2FGAN comprises a generator network and two discriminator networks, wherein the generator network consists of a face coder module, a face decoder module, an STN module and a CBAM module, and the two discriminator networks consist of a frame discriminator module and a lip sound synchronization discriminator module respectively;

2) And A2FGAN training, receiving input audio features and reference images, training a voice-to-face generation countermeasure network A2FGAN, and acquiring lip-voice synchronous face animation image frames by reversely propagating and optimizing model parameters.

And when the discriminator of A2FGAN trains, firstly, the lip-sound synchronous discriminator module is trained independently, then the frame discriminator module is trained according to the trained lip-sound synchronous human face animation image, and finally the human face animation image frame which is synchronous in lip sound and keeps the true image effect is obtained.

On the other hand, the invention also provides a voice-driven human face animation generation system which is used for realizing the voice-driven human face animation generation method.

Also, a processor and a memory are included, the memory for storing program instructions, the processor for invoking the stored instructions in the memory to perform a speech driven face animation generation method as described above.

The invention has the following characteristics and beneficial effects:

1. in order to obtain accurate lip key point coordinates from voice information, a Convolutional Neural Network (CNN) -based lip key point prediction model Audio2MKP is proposed. The method solves the problems of large calculation workload and low automation degree of the lip key point acquisition in the early image processing method. Meanwhile, compared with a long short term memory network (LSTM) -based method, the error between the lip key point predicted by the model and the true lip key point is smaller, so that the lip key point predicted by the model is more accurate.

2. In order to improve the quality of the face animation image frame, a face conversion generation countermeasure network FTGAN is provided, and a face mask image is converted into a face reference image. In consideration of the uncertainty of the human face relative to the size and the position of the image, the combination of an STN module and a CBAM module is introduced into the generator of the FTGAN to improve the generation quality of the reference image without increasing the additional overhead; the discriminator of FTGAN then better directs the generator to output the reference image using a multi-scale discriminator (PatchGAN). Meanwhile, in consideration of different importance of regions where different parts are located in the face image, an attention mechanism module is introduced, and an image reconstruction loss function based on face region division is designed. Therefore, the designed model can ensure that a high-quality face reference image is generated to the maximum extent.

3. And (3) putting forward a voice to a face generation countermeasure network A2FGAN, judging the advantage of lip-note synchronization by using a SyncNet network, and introducing a lip-note synchronization discriminator into an A2FGAN discriminator. Meanwhile, the model guides the generation of the human face animation by using the voice information on the basis of the reference image, and further improves the lip sound synchronization effect of the human face animation.

4. Aiming at the problem that human images lack natural head movement in generated human animation when the human images are given in a single image form, a solution based on intermediate human face transition is provided by using a human face posture migration model Few _ Shot _ Vid2Vid, and the human face animation of the intermediate human is migrated to the human images. The scheme can ensure that the character in the human face animation generated by a single image has natural head movement, and the human face animation with generalization capability is generated.

Summarizing, the voice-driven human face animation generation scheme provided by the invention can automatically generate human face animation with high-quality image frames, good lip sound synchronization effect and natural head movement, and the human face animation generation model has generalization capability. The method has wide application prospect in many fields such as virtual anchor, virtual customer service, online education, movie special effects, game entertainment and the like, and has important market value.

Drawings

FIG. 1 is a flow chart of speech-driven face animation generation according to an embodiment of the present invention;

fig. 2 is a schematic diagram of key points of a human face according to an embodiment of the present invention, in which part (a) of fig. 2 is a schematic diagram of positioning key portions of a human face, and part (b) of fig. 2 is a schematic diagram of key points of a human face;

FIG. 3 is a lip keypoint prediction model Audio2MKP according to an embodiment of the present invention;

fig. 4 is a schematic diagram of some unsuitable mask region processing manners according to an embodiment of the present invention, in which part (a) of fig. 4 is a schematic diagram of covering the lower half of the whole face image, part (b) of fig. 4 is a schematic diagram of covering the lower half of the face image, and part (c) of fig. 4 is a schematic diagram of only covering the lip region;

fig. 5 is a schematic diagram of an embodiment of the present invention in which there is an error in predicting lip keypoints, where fig. 5 (a) is a schematic diagram of a mask region of an original image, and fig. 5 (b) is a schematic diagram of a real and predicted error;

FIG. 6 is a schematic diagram of selecting a mask region according to an embodiment of the present invention, in which part (a) of FIG. 6 is a schematic diagram of an original image, and part (b) of FIG. 6 is a schematic diagram of selecting a mask region;

FIG. 7 is a schematic diagram of a mask region segmentation process according to an embodiment of the present invention, in which part (a) of FIG. 7 is a schematic diagram of an original image, and part (b) of FIG. 7 is a schematic diagram of a segmented mask region;

FIG. 8 is a schematic diagram of face region division according to an embodiment of the present invention;

FIG. 9 is a face translation generation confrontation network FTGAN of an embodiment of the present invention;

fig. 10 is an STN module network structure of an embodiment of the present invention;

FIG. 11 is a network structure of a CBAM module of an embodiment of the present invention;

fig. 12 is a speech-to-face generation anti-network A2FGAN of an embodiment of the present invention;

fig. 13 is a lip synchronization decision network according to an embodiment of the present invention.

Detailed Description

The technical solution of the present invention is specifically described below with reference to the accompanying drawings and examples.

The invention generates a human face animation which takes a human face image as a speaking main body and is synchronous with audio according to given audio and a human face image or video, and comprises human face key point extraction and standardization, lip key point prediction from audio features, reference image generation based on the lip key point, and human face animation generation based on the audio features and the reference image. The invention provides a set of complete steps for making the human face animation, and the generated human face animation meets the following requirements: (1) good lip sound synchronization effect; (2) high-quality human face animation image frames; (3) the human face animation generation model has generalization capability; (4) The generated human face animation has natural head movement of the character.

As shown in fig. 1, the method for generating a voice-driven human face animation according to an embodiment of the present invention includes the following steps: step 1 extraction and standardization of human face key points

The extraction of the human face feature points is the basis of the generation of the human face animation, and the accuracy of the positions of the feature points directly influences the lip region processing in the subsequent steps and the effect of the generation of the human face animation. In the step, key points of the human face are extracted according to the input human face image or video, and are subjected to standardization processing, and the embodiment further provides specific steps 1.1 and 1.2.

Step 1.1 face key point extraction

Subsequent training of the lip keypoint prediction model requires obtaining lip keypoint data corresponding to the audio. Therefore, first, the key parts of the face image, including eyebrows, eyes, nose, lips and the outer contour of the face, are located, and the basic key points of the face are determined. See figure 2 for a schematic diagram of extracted face key points. In specific implementation, some face recognition tools, such as a face recognition open source library Dlib, etc., may be used to locate the key positions of the face in the face image shown in fig. 2 (a), and extract 68 key points of the face, where the number of the key points of the lips is 20 (as shown in fig. 2 b). The 1 st keypoint of the face is generally the upper left point of the outer contour of the face. The key points are numbered in sequence as shown in fig. 2 (b), and the first key point to get the lip is numbered 49.

Step 1.2 normalization processing of face key points

The position, size, and rotation angle of the face with respect to the entire image in the face image frame are different. Therefore, the face key points obtained in step 1.1 are normalized. The lip is used as a main reference, and the geometric position of key points of the face is corrected by utilizing the position relation between eyes and lips, so that the influence of inclination angle deviation of the face in the whole image is eliminated, and the position of the face is ensured not to be deformed.

Step 2, lip key point prediction from audio features

The audio features of the voice are extracted from the input original voice, the voice is vectorized and expressed after data are preprocessed, and the coordinates of the key points of the lips are accurately predicted through the audio features by establishing and training a lip key point prediction model. The embodiment further provides a method for specifically relating to the step 2.1 to the step 2.4.

Step 2.1 Audio feature extraction

Audio features of speech are extracted from the original speech. In particular, some speech recognition systems, such as deep speech, may be used to extract audio features. Deep speech is an open source speech recognition system based on a deep learning framework and capable of realizing an end-to-end automatic speech recognition function. The embodiment of the invention preferably adopts a deep speech recognition system to carry out fast Fourier transform on input speech and convert time domain signals into frequency domain signals; meanwhile, obtaining the feature vector of each audio segment on a two-dimensional space consisting of a time domain and a frequency domain; and windowing the acquired feature vectors to obtain audio features. The extracted audio features are 29-dimensional. In the embodiment of the present invention, the audio features extracted using deep speech are defined as DSAudio _ features for the following description.

It should be noted that the current deep learning technology can automatically extract audio features from the original audio. Therefore, instead of extracting the conventional MFCC feature parameters, embodiments of the present invention preferably extract DSAudio _ features audio features.

Step 2.2 data preprocessing

The data preprocessing processes the audio data and the video data respectively, and the embodiment further provides specific steps 2.2.1 and 2.2.2.

Step 2.2.1 Audio data processing

In one embodiment, the original speech may be preprocessed by a multimedia video processing tool, such as FFmpeg, to uniformly convert them into mono speech data with a sampling rate of 16,000hz. The invention further provides that on the basis of obtaining the DSAudio _ features audio features, the dimensionality of 1audio feature is increased to meet the requirement of convolution layer operation of a convolution neural network in a later stage on a data format. After data pre-processing, the voice data per second can be represented as 1audio feature vector, preferably the suggested uniform format is (audio frame number, dimension, window size of windowing process, audio feature vector size), for example (n, 1,16,29), where n is the audio frame number.

Step 2.2.2 video data processing

In particular implementations, the human face image frames may be extracted at a rate of 25 frames/second using a multimedia video processing tool, such as FFmpeg or the like. On the basis of extracting the face key points in the step 1, performing dimensionality reduction processing on 20 lip key points of each frame of the face image by using a Principal Component Analysis (PCA) method, and representing the 20 lip key points by using a certain number of PCA Principal components. After data preprocessing, video data per second can be represented as 1 lip keypoint feature vector in a unified format (video frame number, PCA principal component number), for example (25,8).

Step 2.3 lip key point prediction model Audio2MKP modeling and training

The embodiment of the invention provides a lip Key point prediction model (Audio to Mouth Key Points, audio2 MKP) based on a convolutional neural network, and realizes the mapping from voice to lip Key Points. The method receives input DSAudio _ features Audio features, and repeatedly trains an Audio2MKP model by combining with a subsequent step 2.4 to finally obtain lip key points with accurate prediction. It involves step 2.3.1 and step 2.3.2.

Step 2.3.1audio2MKP modeling

The early image processing method has large calculation workload and low automation degree when the lip key points are acquired; lip key points obtained by a Long Short-Term Memory network (LSTM) based method are not high in precision. A Convolutional Neural Network (CNN) is a neural network specialized for processing data having a grid-like structure, which can capture local properties of an image with good performance in image recognition. The Audio2MKP model is designed based on CNN.

The network structure of the Audio2MKP model is shown in fig. 3, and includes several convolution layers and several full-connection layers connected in sequence, and the residual errors from the last convolution layer are added to part of the convolution layers respectively to form a residual error block.

In the embodiment, 12 convolutional layers (conv, sequentially denoted as conv-1 to conv-12) and 2 fully-connected layers (line, sequentially denoted as line-1 and line-2) are preferably included to learn the mapping of the DSAudio _ features to the lip key points.

In this model, convolutional layers use uniformly sized 3 × 3 × 3 convolutional kernels, each containing a vector convolution operation, and the feature map of the audio feature is activated using the Relu function before entering the next convolutional layer. The function is defined by equation (2).

Relu(x)＝max(0,x) (2)

Wherein x is a feature map of the audio features.

Residual error connections from the previous convolutional layer are respectively added in the partial convolutional layers (such as conv-2, conv-3, conv-5, conv-6, conv-8, conv-9 and conv-12 in the figure) to form residual error blocks (see 7 residual error blocks formed by dot-and-dash arrows in figure 3) so as to reduce the problem of training errors along with the increase of the network depth. Each residual block adds the input to the convolutional layer output before activation. Take the connection between conv-1 and conv-2 as an example. The input of conv-2 (i.e., the activated output of conv-1) is directly added to the conv-2 convolution result and then reactivated. The same applies to other similar cases. Each audio feature vector with the shape of (1,16,29) is processed by 12 convolutional layers to obtain a feature map with the shape of (256,1,1), and dimension reduction processing is further performed on the feature map through reshaping (Reshape) to obtain a one-dimensional column vector containing 256 feature values.

The convolutional layers are connected to the fully-connected layers by Reshape. The fully-connected layer 1 (line-1) extracts a column vector including 64 feature values, and performs a regularization (batch norm) operation to obtain 8 feature values representing lip key points through the fully-connected layer 2 (line-2). And obtaining the predicted 20 lip key point coordinates through PCA restoration and reverse standardization operation. Here, the method skleann. Inverse _ transform () in Python may be called to restore the data when PCA is restored. And in the reverse standardization, the two-dimensional matrix after PCA restoration is multiplied by a two-norm recorded during standardization, and the lip center coordinates are added to obtain the lip key point coordinates after the reverse standardization.

Step 2.3.2Audio2MKP training

As shown in fig. 3, DSAudio _ features Audio features are input into the Audio2MKP model, and the Audio2MKP model is trained in combination with step 2.4 to obtain predicted lip keypoint feature vector representation.

Step 2.4Audio2MKP model parameter optimization

Root Mean Square Error (RMSE) was calculated between the predicted and true lip keypoint feature representations for evaluationEstimate the accuracy of predicted lip keypoints, denoted Loss _RMSE The calculation is performed by equation (1).

Wherein n represents the number of key points of the lip, y _i 、

And respectively representing the true lip key point coordinate corresponding to the face image of the ith frame and the predicted lip key point coordinate.

The model training of step 2.3 is a repeated process, the error is calculated in step 2.4, and the model parameters are updated through back propagation, and finally the well-trained Audio2MKP model is obtained.

Step 3 reference image generation based on lip key points

Processing the lip region of the face image by using the predicted lip key point obtained in the step 2 to generate a mask image, generating a confrontation network model by establishing face conversion, and training to obtain a high-quality face reference image. Further provided in the examples are specific reference to steps 3.1 to 3.4.

Step 3.1 mask image Generation

And selecting a proper region in the face image to perform masking processing by taking the predicted lip key point as a reference so as to mask the original lips of the face, and then drawing a lip outline according to the predicted lip key point to form a masking image. In the invention, the selected image is partially or completely covered to control the image processing area, and the specific image covered is defined as a mask image. The face image masking effect will directly affect the performance of the FTGAN model of step 3.3. It involves step 3.1.1 and step 3.1.2.

Step 3.1.1 mask region selection

Generating the mask image will provide for further FTGAN model training and reference image generation. It needs to select a suitable mask area. If the shade region is larger and the attention of the model is dispersed, the lip region is not favorable for reconstruction; if the mask area is small, errors between predicted and true lip keypoints can result in distortion of the generated reference image. Fig. 4 shows several unsuitable mask area processing methods.

Fig. 4 (a) shows that the mask region covers the lower half of the whole face image. The method simply performs the masking processing on the lower half area of the whole image, so that the facial details of many people are lost, and the model cannot be helped to learn the mapping from the masking area to the human face. Fig. 4 (b) shows the mask region covering the lower half of the face. This approach can lose facial detail in the lower half of the face, such as the chin. Fig. 4 (c) shows that the mask region covers only the lip region. Although the method can retain the details of the human face to the maximum extent, certain errors exist in lip key points predicted by model training, and the authenticity of a reference image is directly influenced.

Figure 5 is a graphical illustration of the error in predicting lip keypoints. In fig. 5 (a) and 5 (b), the portion surrounded by the light gray dot connecting lines is the real lip region of the given face, and the portion surrounded by the dark gray dot connecting lines is the lip region predicted from the audio frequency for the given face. If the predicted lip key points have errors, the lip region of the original face image cannot be completely covered when the image is subjected to masking processing according to the predicted lip region, namely the light gray connecting line part outside the black masking region.

Therefore, when selecting the mask area, it is preferably suggested to consider the following two principles:

(1) The mask region covers as little as possible the facial regions of a given face image except the lip region;

(2) The masking regions can accommodate a degree of error in predicting lip keypoints.

The invention determines the size of the mask area in such a way that the lip area extends outwards. As shown in FIG. 6, for the original image shown in FIG. 6 (a), the lip key points are located in the rectangular region as the reference, and the lip key points extend to the left and right in the width direction

The width of the lip region is extended up and down in the height direction

The lip area height. Meanwhile, the predicted lip key points are shifted and aligned with respect to the first key point of the lip (i.e., the feature point numbered 49 in fig. 2 (b)), so as to further reduce the influence caused by the error of the predicted lip key points, and obtain the mask region shown in fig. 6 (b).

In specific implementation, a preferable suggested mode is that 20 lip key points of the face image are extracted by using a Dlib tool in step 1.1, the key points are surrounded into a lip region, a rectangular region is formed, and the region is used as the center to extend outwards

A mask region is formed.

Step 3.1.2 mask region segmentation

Although step 3.1.1 selects a suitable mask region, there are different components inside the mask region. For subsequent training of the model on these different components, selected mask regions are segmented into lips, tongue, teeth, and others. And then, filling the divided components respectively to obtain a mask image. FIG. 7 is a diagram illustrating a mask segmentation process. Fig. 7 (b) shows the result of the mask region segmentation processing performed on the original image shown in fig. 7 (a), in which the region surrounded by the solid lines for the dots represents the lips, the white filled region represents the teeth and tongue, and the black region represents the rest of the mask region.

Step 3.2 face region partitioning

When the face image is converted, non-mask regions need to be reserved for the reference image, that is, pixel values of the regions are mapped into the generated reference image one by one, and the mask regions are restored to real face lip regions.

The quality of the generated image is generally evaluated based on the face portion of the image. The face area, and particularly the lip area, in the image is more important than the background area. Therefore, the face image is divided into different regions including a lip region, a face region, and a background region as shown in fig. 8 according to the importance degree from high to low. Meanwhile, weight is set for each area, the weight of the lip area is greater than that of the face area, and the weight of the face area is greater than that of the background area.

In fig. 8, a solid line frame, a dashed line frame, and a dotted line frame represent a background region, a face region, and a lip region, respectively. In the embodiment, the solid line frame covers the whole face image; the line drawing and dot frame is a rectangular area formed by enclosing 68 key points of the face and comprises the whole head information including five sense organs; the dashed box is a rectangular area surrounded by 20 key points of the lips. The solid line boxes contain the scribe-dot boxes, which contain the dashed line boxes. Here, the entire image is not divided into several independent regions, but the divided regions have a certain inclusion relationship.

λ in equation (5) when calculating the background region image reconstruction loss in subsequent step 3.4 ₃ ·Loss _MAE (x _bg ,G(z) _bg ) Will be weighted by the weight lambda ₃ And calculating the image reconstruction loss of the face region and the lip region in one pass. Likewise, λ in equation (5) when the face region image reconstruction loss is calculated in the subsequent step 3.4 ₄ ·Loss _MAE (x _f ,G(z) _f ) Will be weighted by λ ₄ And calculating the image reconstruction loss of the lip region in one pass. That is, when the face image is processed in this way, the face region and the lip region are added to the image reconstruction loss calculation process many times. These two regions are just the regions of greater interest to the model, and in essence, this is a method of assigning weights to the face and lip regions. Therefore, the adoption of the inclusion relationship to divide the face image can well adapt to the face image conversion of the TFGAN model.

The scribed box area is actually drawn from 68 face key point information, including location information of the five sense organs in the face and the facial contours. The face area represented by the line dot box of the present invention does not include the whole face based on the following considerations. This is because (1) the head region is an irregular shape, which is inconvenient for calculating the image reconstruction loss of the face region in step 3.4; (2) The face region is not the most interesting region of the model, and the rest of the head part is processed as a background region, so that the method is well suitable for the face image conversion task.

The lip region represented by the dotted line frame is actually a mask region, i.e., extends in both the width and height directions with the lip region as the center

And (4) doubling. The area in which the dashed box is located is the area where the model is most interested, and the quality of the generated reference image depends to a large extent on the accuracy of this area, including whether the lips are accurate, the details of the tooth and tongue areas, whether the borders are smooth, etc.

Step 3.3 face conversion generation confrontation network FTGAN modeling and training

Constructing a Face Transformation Generated Adaptive Network (FTGAN). And (3) training the FTGAN model by combining with the subsequent step 3.4, and converting the mask image obtained by predicting the lip key points into a high-quality face reference image. The invention defines the reference image as an image that restores the mask image to have a real human face image. It involves step 3.3.1 and step 3.3.2.

Step 3.3.1FTGAN modeling

A countermeasure Network (GAN) (2014) is generated as an unsupervised deep learning model, which can obtain good output results through mutual game learning of a Generator (Generator, G) and a Discriminator (Discriminator, D). The generator generates an image which can cheat the discriminator, and the discriminator becomes stronger through counterstudy, so that the truth of the image can be better distinguished. The invention modifies the internal architecture of the GAN to design the FTGAN model.

As shown in fig. 9, the FTGAN model consists of a generator network and a discriminator network. In consideration of uncertainty of human faces relative to image size and position, the invention proposes to introduce an STN (Spatial Transformer Network) (2015) Module and a CBAM (Convolutional Block Attention Module) (2018) Module in a traditional generator when designing a generator Network structure, and to improve the generation quality of a reference image by combining the STN and the CBAM. The STN and the CBMA are used as light-weight general modules, and the integration of the STN and the CBMA into a network can effectively improve the expressive force of the model without adding extra calculation overhead. When designing the arbiter Network structure, a multi-scale arbiter (PatchGate) Network architecture is adopted (2017). Unlike the traditional discriminator network which only outputs one evaluation value, patchGAN evaluates the generated image in a multidimensional way, and can better guide the generator to output the reference image of the human face.

1. Generator network architecture

The generator network consists of a face coder module, a face decoder module, an STN module and a CBAM module, receives an input mask image to generate an output face reference image, and the STN module, the face coder module, the CBAM module and the face coder module are connected in sequence from input to output. And the coder and decoder module extracts the characteristic map representation of the mask image in the coding process by using the face coder and restores image information in the decoding process. Because the size and the position of the face in the image frame are often uncertain, the STN module can be used for eliminating the situation during face image conversion. The CBAM module pays attention to the middle channels which have large influence on the tasks in the convolution operation process of the model.

And a coder and a decoder module. The coder and the decoder are composed of a down-sampling convolutional layer, a residual error connecting block and an up-sampling convolutional layer. Extracting features from the downsampled convolutional layer and reducing dimensions of the features; the residual connecting block comprises a direct mapping part and a residual part, and solves the problems of gradient disappearance and network degradation; the upsampled convolutional layer is used for feature decoding to restore the image.

And (5) an STN module. The STN network structure is shown in fig. 10. The device consists of a localization network, a grid generator and a sampler. The mask image is input, affine transformation parameters theta are obtained through a series of affine transformations of a localization network, and then the affine transformation parameters theta are input into a network generator; the grid generator carries out spatial data transformation according to theta to obtain T theta (G), and then the T theta (G) is input into the sampler; input the methodThe mask image is also input into the sampler, multiplied by the sampler and combined to output the characteristic diagram U ₁ It deals with the situation that the corresponding pixel point can not be indexed when the output coordinate is decimal. And the STN module performs affine transformation on the input and the output to obtain a mapping relation between the input and the output, and obtains an optimal value of the data in a spatial position by using network back propagation optimization parameters.

A CBAM module. The CBAM network structure is shown in fig. 11, and is composed of a Channel Attention Module (CAM) and a Spatial Attention Module (SAM). The CAM utilizes global maximum pooling and average pooling output and a shared module multilayer perceptron to realize a channel attention mechanism; and the SAM utilizes global maximum pooling and average pooling to converge channel characteristics so as to realize a spatial attention mechanism. Input features U ₂ Input channel attention Block CAM, output and input characteristics U of channel attention Block CAM ₂ The intermediate result obtained by multiplying and combining is input into a space attention module SAM, and the output of the space attention module SAM and the intermediate result are multiplied and combined to obtain a refining characteristic U ₃ . Thus the CBAM obtains the locally refined characteristic U through an attention mechanism ₃ 。

Wherein, the multiplication and combination refers to the multiplication operation of the element level of the feature diagram.

2. Network structure of discriminator

The discriminator network structure mainly comprises a frame discriminator module. Preferably, the frame arbiter employs a network structure of PatchGAN. PatchGAN is a discriminator for generating a countermeasure network with a full convolutional network. Cutting an input truth value image into an NxN matrix X, and sending each image block patch after cutting into a PatchGAN discriminator for discrimination, wherein each X in the matrix _ij An evaluation value (true or false) representing an area of a certain coordinate (i, j) in an image. X is to be _ij And obtaining the final output of the discriminator after averaging, and evaluating the image generated by the generator by the value. In general, the GAN discriminator maps an input to a real number and outputs only one evaluation value (true or false). Unlike the GAN discriminator, patchGAN is a full convolution format that allows multidimensional evaluation of the generated image and directs the generator to performMore details are of interest in generating the image.

Step 3.3.2FTGAN training

When the FTGAN model is trained, as shown in fig. 9, a mask image is input to the STN module of the generator layer, and the mask image is affine transformed to obtain a feature map U ₁ . Will U ₁ Inputting the data into a face encoder, and obtaining a feature map U through a downsampling convolution layer and a residual convolution block ₂ . Will U ₂ Inputting into CBAM module, and obtaining characteristic diagram U via combined action of CAM and SAM in CBAM module ₃ . Will U ₃ Inputting the image into a face decoder, and obtaining a preliminary reference image through an upsampling convolution layer.

On the basis, the reference image generated by the generator layer and the corresponding truth value image are input into the frame discriminator network, and the mean square error between the generated label and the truth value label is calculated to evaluate the quality of the generated image. And 3.4, updating the model parameters through back propagation, and repeatedly training the FTGAN model to obtain a reference image with a good training effect. Here, the true value image is a digital representation image that reflects the characteristics of the original image.

Step 3.4FTGAN model parameter optimization

And calculating the image reconstruction loss of the reference image by using an image reconstruction loss function based on the human face region division, inputting the reference image and the corresponding true value image into a frame discriminator to calculate the error between the reference image and the true value image, and updating the model parameters by back propagation.

Loss function Loss of FTGAN model _total Loss function Loss by general GAN _gan And Loss function Loss of image reconstruction _rec Two parts, calculated by the formula (3), i.e.

Loss _total ＝λ ₁ ·Loss _gan +λ ₂ ·Loss _rec (3)

Wherein λ is ₁ And λ ₂ Representing weights of GAN loss and image reconstruction loss, respectively.

Loss _gan And Loss _rec Calculated by the equations (4) and (5), respectively, i.e.

Loss _gan ＝Loss _MSE (D(x),D(G(z)) (4)

Loss _rec ＝λ ₃ ·Loss _MAE (x _bg ,G(z) _bg )+λ ₄ ·Loss _MAE (x _f ,G(z) _f )+λ ₅ ·Loss _MAE (x _m ,G(z) _m ) (5)

Wherein x and z represent a true value image and a noise image, respectively, G (z) represents a generated image, and x _bg 、x _f And x _m Respectively representing an image background region, a face region, a lip region after face division, G (z) _bg Representing the background area of the generated image G (z), G (z) _f Representing the area of the face from which the image G (z) is generated, G (z) _m A lip region representing the generated image G (z); lambda [ alpha ] ₃ 、λ ₄ And λ ₅ The loss weights of the background region, the face region, and the lip region are respectively expressed. When setting the weights calculated by the loss function, the following principle can be followed, namely, the weight of the lip region of the model is ensured to be larger than that of the face region, and the weight of the face region is larger than that of the background region.

Loss _MSE And Loss _MAE The minimum mean square error and the minimum absolute value error are expressed respectively, and are calculated by the expressions (6) and (7), that is, the calculation is performed.

In the formula, n ₁ And n ₂ Respectively representing the evaluation dimension of the discriminator and the number of pixel points of the image.

The loss function uses mean square error calculation to generate the antagonistic loss, uses absolute value error to calculate the image reconstruction loss, and gives different weights to the two parts.

The model training of step 3.3 is a repetitive process, the error is calculated by step 3.4, and the model parameters are updated by back propagation, finally obtaining a well-trained FTGAN model.

Step 4, generating the human face animation based on the audio features and the reference image

In the step, on the basis of the reference image obtained in the step 3, the audio features obtained in the step 2 are utilized again to guide the generation of the human face animation. In order to obtain the lip sound synchronization effect of the face animation, a face-to-face generation confrontation network model is established and trained, and face animation synthesis is carried out to obtain the final face animation. Further set out in the examples relate specifically to step 4.1 and step 4.2.

Step 4.1 Speech to face Generation confrontation network A2FGAN modeling and training

Establishing a voice to Face generation confrontation Network (A2 FGAN), receiving DSAudio _ features and reference images as input, and combining the subsequent step 4.2 to train an A2FGAN model to obtain a Face animation image frame. It involves step 4.1.1 and step 4.1.2.

Step 4.1.1A2FGAN modeling

The preferred solution of the embodiment of the present invention is to design the A2FGAN model using GAN architecture. But unlike the conventional GAN architecture, it contains one generator network and two arbiter networks as shown in fig. 12. In designing the generator network, a combination of STN and CBAM modules is similarly introduced to improve the generation quality of the image. When a discriminator network is designed, a frame discriminator based on a PatchGAN network structure is constructed; meanwhile, in order to judge whether the audio and the video are synchronous, the advantage of lip synchronization is further judged by using SyncNet (2016), and a lip synchronization discriminator based on SyncNet is designed in the discriminator.

1. Generator network architecture

The generator network consists of a face coder, a decoder module, an STN module and a CBAM module.

The encoder module contains a face encoder and a DSAudio _ features audio encoder. The face encoder has the same task as the face encoder in the FTGAN model of step 3.3.1, namely extracting the feature map of the input image and its coded representation. The DSAudio _ features audio encoder extracts a feature map of DSAudio _ features audio features and an encoded representation thereof, and is composed of convolutional layers with residual connection. The face decoder is similar to the face decoder in the FTGAN model. The difference is that the face decoder in the A2FGAN model needs to decode after splicing the feature maps extracted by the two encoders.

Inputting the reference image into the STN module, and obtaining a feature map U by affine transformation of the reference image ₁ (ii) a Will U ₁ Inputting a face encoder to obtain a feature map U ₂ . Then, inputting the DSAudio _ features audio features into a DSAudio _ features audio encoder to obtain a feature graph U ₃ . Then put U in ₂ 、U ₃ Inputting the spliced CBAM module to obtain a characteristic diagram U ₄ . Will U ₄ And inputting the image into a human face decoder to obtain a human face animation image frame.

2. Network structure of discriminator

The two discriminator networks are respectively composed of a frame discriminator module and a lip sound synchronous discriminator module.

And a frame discriminator module. It adopts the same structure as PatchGAN arbiter in FTGAN model.

And a lip synchronization discriminator module. A SyncNet-based lip synchronization determination network is shown in fig. 13, and it determines similarity of audio features and facial image features in a common parameter space, so as to evaluate synchronization effect between generated facial animation image frames and corresponding audio frames. And (3) adaptively modifying the SyncNet model by considering the uncertainty of the position and the size of the human face relative to the whole image and the practical application of the audio features corresponding to lip movement. The similarity of the audio frame and the image frame in the common parameter space is calculated by using the lip coordinates as the center and the lip mask region on the face image data as the input instead of using the lower half region of the face image as the input. In this way, the region can cover the lip region and can be migrated with model parameters pre-trained in this dimension.

In this network, as shown in FIG. 13, two convolutional neural networks are usedCNN ₁ And CNN ₂ And respectively extracting the characteristic information of the audio frame and the human face image frame with the same length. The weight sharing module is used for setting sharing weight for the two pieces of information, and converting the characteristic information of the audio frames and the image frames with different modalities into the same parameter space. Then, the difference between the cosine similarity label vector and the true value label vector is calculated by a loss function.

Step 4.1.2A2FGAN training

When the A2FGAN model is trained, as shown in fig. 12, first, a reference image is input to the STN block, and the reference image is affine-transformed to obtain a feature map U ₁ . Will U ₁ Inputting the data into a face encoder, and obtaining a feature map U through a downsampling convolution layer and a residual convolution block ₂ . Then, the DSAudio _ features audio features are input into a DSAudio _ features audio coder, and a feature graph U is obtained through a convolution layer with residual connection ₃ . Then put U in ₂ 、U ₃ Inputting the spliced data into a CBAM module, and obtaining a characteristic diagram U through a CAM module and an SAM module in the CBAM module ₄ . Will U ₄ Inputting the image into a face decoder, and obtaining a primary face animation image frame through an upsampling convolution layer.

And then carrying out discriminant training. The present invention preferably proposes that the lip-sync discriminator is trained separately from the frame discriminator, i.e. the lip-sync discriminator is trained separately first, and then the frame discriminator is trained.

1. Lip synchronization discriminator training

Firstly, a lip-sound synchronous discriminator is used for training lip-sound synchronous training, namely, whether the speaking mouth shape and the voice of a person are synchronous or not is judged. If synchronous, the model is trained; otherwise, the training is repeated.

When training the lip-sync discriminator, as shown in fig. 12, the face animation image frame generated by the generator layer and the corresponding true value audio are input into the lip-sync discriminator and pass through CNN ₁ 、CNN ₂ And the weight sharing module obtains the feature representation of the human face image frame and the audio frame in a common parameter space. Calculating features of face image frames and audio frames in a common parameter space using cosine similarityAnd (4) characterizing the similarity between the representations to obtain a cosine similarity label vector between the two. The cosine similarity is calculated by equation (8), i.e.

Where n represents the dimension of the image and audio frames in a common parameter space, A _i And V _i Respectively representing the feature values of the audio frame and the image frame sequence in the ith dimension in a common parameter space.

2. Frame discriminator training

When training the frame discriminator, the trained synchronous human face animation image frame is used when training the lip-voice synchronous discriminator; at this time, although lip synchronization is achieved, the lip synchronization is different from the true value image, so training is continued until the effect of the true value image is completely achieved.

Inputting the human face animation image frame synchronized with lip sound and the corresponding true value image into a frame discriminator, and calculating the mean square error between the generated label and the true value label to evaluate the quality of the generated image. And (4.2) repeatedly training the A2FGAN model by updating the model parameters through back propagation to obtain the human face animation image frame which is synchronous with lip sounds and keeps the true image effect.

Step 4.2A2FGAN model parameter optimization

Loss function Loss of A2FGAN model _total Formed of three parts, i.e. Loss of generation versus frame arbiter _gan Lip sound synchronization Loss of lip sound synchronization discriminator _sync And Loss of reconstruction of face images Loss _rec The calculation formula is shown in formula (9).

Loss _total ＝λ ₁ ·Loss _gan +λ ₂ ·Loss _sync +λ ₃ ·Loss _rec (9)

Wherein λ is ₁ 、λ ₂ And λ ₃ Representing the weight lost by each section.

Loss _gan And Loss _rec The calculation of (2) is performed by using the following equations (4) and (5) in the same FTGAN model.

Calculating the binary cross entropy Loss of the cosine similarity label vector and the truth value label vector to obtain the Loss _sync It is calculated by the formula (10).

Loss _sync ＝Loss _BCE (cosine_similarity(D _sync (audio,G(x))),y) (10)

Wherein audio represents audio corresponding to the true value image, G (x) represents an image generated by the generator from the reference image, D _sync Denotes a lip-voice synchronization discriminator constructed based on SyncNet model, D _sync And (audio, G (x)) returning the representation of the audio and G (x) on the same parameter space, calculating the cosine similarity between the audio and G (x) by using a cosine _ similarity () function, wherein the calculation formula is an expression (8), and y represents a true value label vector corresponding to lip sound synchronization. Loss _BCE Calculated as in equation (11), i.e.

Where N represents the dimension of the tag vector, y _i And p _i Respectively representing a corresponding truth label and a corresponding cosine similarity label in the ith dimension.

The model training of step 4.1 is a repeated process, the error is calculated from step 4.2, and the model parameters are updated by back propagation, finally obtaining a well-trained A2FGAN model.

Step 4.3 face animation synthesis

The input human face image generally has two forms, namely a human face video and a human face image. For the former, i.e., given a sequence of facial image frames, the facial images in the video already have natural head motion; for the latter, some processing of the image is also required to generate a face animation with natural head movements. Therefore, in the face animation synthesis, the following two different cases are handled separately.

1. When the human face image is given in the form of video

Because the face image in the video already has natural head movement, the face animation can be directly generated according to the trained A2FGAN model.

2. When the human face image is given in the form of a single image

When the human face image is given in the form of a single image, in order to ensure that the human face in the human face animation generated by driving the single human face image by voice has natural head movement, the voice and the single target human face image are given, a standard intermediate human face image frame sequence is selected, and the intermediate human face animation synchronous with the audio frequency is generated based on the given audio frequency and the intermediate human face image frame sequence. And then constructing a human face posture migration model by utilizing a Few _ Shot _ Vid2Vid model (2019), migrating the middle human face animation to a given single human face image, and generating and outputting the human face animation. Here, the Few _ Shot _ Vid2Vid model is a GAN-based conditional video synthesis network, which can synthesize the same real person motion video with a small number of target images and has a scene generalization capability, and is therefore suitable for face migration and face animation video generation.

Abbreviation comment table

Table: abbreviation annotation table

In specific implementation, a person skilled in the art can implement the automatic operation process by using a computer software technology, and a system device for implementing the method, such as a computer-readable storage medium storing a corresponding computer program according to the technical solution of the present invention and a computer device including a corresponding computer program for operating the computer program, should also be within the scope of the present invention.

In some possible embodiments, a speech-driven face animation generation system is provided, comprising a processor and a memory, the memory storing program instructions, the processor being configured to invoke the stored instructions in the memory to perform a speech-driven face animation generation method as described above.

In some possible embodiments, a speech-driven face animation generation system is provided, which includes a readable storage medium, on which a computer program is stored, and when the computer program is executed, the speech-driven face animation generation method is implemented.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A voice-driven face animation generation method is characterized in that: comprises the following steps of automatically generating the human face animation,

s3, generating a reference image based on the lip key points, wherein the reference image comprises mask image generation, face region division, face conversion generation confrontation network model FTGAN modeling and training, and face conversion generation confrontation network model FTGAN parameter optimization; the face conversion generation confrontation network model FTGAN is a model for realizing the conversion of a face mask image into a face reference image;

step S4, on the basis of the reference image obtained in the step S3, the audio features obtained in the step S2 are used for guiding the generation of the face animation, wherein the generation of the face animation comprises the modeling and training of a voice-to-face generation confrontation network A2FGAN, the optimization of a voice-to-face generation confrontation network A2FGAN parameter and the synthesis of the face animation; the speech-to-face generation confrontation network A2FGAN is a model for realizing face animation for obtaining lip-voice synchronization effect.

2. The speech-driven face animation generation method of claim 1, wherein: when extracting key points of the human face, firstly, positioning key parts of the face of the human face image, including eyebrows, eyes, a nose, lips and the outer contour of the face, and determining the basic key points of the human face.

3. The speech-driven human face animation generation method of claim 1, wherein: in the step S2, a lip key point prediction model Audio2MKP is realized based on a convolutional neural network, accurate lip key point coordinates are obtained through voice information, and the realization mode comprises the following steps,

2) And the Audio2MKP training comprises receiving input Audio features, training a lip key point prediction model Audio2MKP, and obtaining accurate predicted lip key points by reversely propagating and optimizing model parameters.

4. The speech-driven face animation generation method of claim 1, wherein: when the mask image is generated in step S3, the size of the mask region is determined in such a manner that the lip region extends outward.

5. The speech-driven face animation generation method of claim 1, wherein: when the face region is divided in the step S3, dividing the face image into different regions including a lip region, a face region and a background region according to the importance degree from high to low; meanwhile, weight is set for each area, the weight of the lip area is greater than that of the face area, and the weight of the face area is greater than that of the background area.

6. The speech-driven human face animation generation method of claim 1, wherein: in step S3, a face conversion generation confrontation network model FTGAN is established based on the structural improvement of the generation confrontation network, and a face mask image is converted into a face reference image, and the implementation mode comprises the following steps,

1) FTGAN modeling, which comprises the step of generating a confrontation network model FTGAN based on the face conversion to realize the conversion of a mask image obtained by predicting lip key points into a face reference image; the face conversion generation confrontation network model FTGAN consists of a generator network and a discriminator network, wherein the generator network receives an input mask image to generate an output face reference image, and the STN module, the face encoder module, the CBAM module and the face encoder module are sequentially connected; the discriminator network comprises a frame discriminator module, the reference image generated by the generator network and the corresponding true value image are input into the frame discriminator module together, and the mean square error between the generated label and the true value label is calculated to evaluate the quality of the generated image;

7. The speech-driven face animation generation method according to claim 1, 2, 3, 4, 5 or 6, characterized in that: in step S4, based on the structural improvement of the generation countermeasure network, the proposed voice is established to the face generation countermeasure network A2FGAN to obtain the lip sound synchronization effect of the face animation, the realization method comprises the following steps,

2) And (2) A2FGAN training, namely receiving input audio features and reference images, training a confrontation network A2FGAN generated by speech to human face, and obtaining human face animation image frames with synchronous lip sounds by reversely propagating and optimizing model parameters.

8. The speech-driven face animation generation method of claim 7, wherein: when the A2FGAN discriminator is used for training, the lip tone synchronization discriminator module is trained independently, then the frame discriminator module is trained according to the trained lip tone synchronous human face animation image, and finally the human face animation image frame which is synchronous in lip tone and keeps the true image effect is obtained.

9. A speech-driven face animation generation system, characterized by: a method for implementing a speech-driven face animation generation as claimed in any one of claims 1 to 8.

10. The speech-driven face animation generation system of claim 9, wherein: comprising a processor and a memory, the memory being used for storing program instructions, the processor being used for calling the stored instructions in the memory to execute a speech-driven human face animation generation method as claimed in any one of claims 1 to 8.