CN115100329B - Multi-mode driving-based emotion controllable facial animation generation method - Google Patents

Multi-mode driving-based emotion controllable facial animation generation method Download PDF

Info

Publication number
CN115100329B
CN115100329B CN202210744504.9A CN202210744504A CN115100329B CN 115100329 B CN115100329 B CN 115100329B CN 202210744504 A CN202210744504 A CN 202210744504A CN 115100329 B CN115100329 B CN 115100329B
Authority
CN
China
Prior art keywords
coordinate
emotion
facial
audio
face
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210744504.9A
Other languages
Chinese (zh)
Other versions
CN115100329A (en
Inventor
李瑶
赵子康
李峰
郭浩
杨艳丽
程忱
曹锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiyuan University of Technology
Original Assignee
Taiyuan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiyuan University of Technology filed Critical Taiyuan University of Technology
Priority to CN202210744504.9A priority Critical patent/CN115100329B/en
Publication of CN115100329A publication Critical patent/CN115100329A/en
Application granted granted Critical
Publication of CN115100329B publication Critical patent/CN115100329B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention relates to an image processing technology, in particular to an emotion controllable facial animation generation method based on multi-mode driving. Step S1: preprocessing an image of a portrait video to obtain a face 3D feature coordinate sequence; step S2: preprocessing the audio of the portrait video, and decoupling the audio into an audio content vector and an audio style vector; and step S3: training a face lip voice coordinate animation generation network consisting of a multilayer perceptron and a long-term and short-term memory network based on the face 3D feature coordinate sequence and the audio content vector; the invention introduces the emotion portrait as an emotion source, realizes emotion remodeling of the target portrait by combining the common drive of the emotion source portrait and the audio, and provides diversified emotion facial animation. The method avoids the over-low robustness of the audio single driving source under the drive of multiple modes, gets rid of the dependency of emotion generation on emotion voice recognition, enhances the complementarity among data and realizes the emotion expression of more real facial animation.

Description

Multi-mode driving-based emotion controllable facial animation generation method
Technical Field
The invention relates to an image processing technology, in particular to an emotion controllable facial animation generation method based on multi-mode driving.
Background
Facial animation generation is a popular area of research in computer vision-generated models. Its purpose is to transform a still portrait into a realistic facial animation with an arbitrary audio. The method has wide application background in the fields of treatment systems of pseudoscopic and pseudoscopic audiometry, virtual anchor, role-defined games and the like. However, the existing facial animation generation method has the defects that due to the limitations of the principle and the characteristics of the existing facial animation generation method, the emotion aspect of the generated portrait animation is always lack of maturity, so that the application value of the portrait animation is seriously influenced.
In recent years, many studies in the field of facial animation generation have been made on realistic lip movement and head posture swing, and this is an important factor of portrait emotion. The existence of the portrait emotional information has an important influence on the expression of the synthesized facial animation expression emotion, different facial expressions often make a sentence with different emotional colors, and the perception of the emotional information in the visual mode is one of the important ways for human audiovisual speech communication. However, most of the facial animation generation driving sources are audio single modes, which are superior in lip movement performance for generating syllables, but relatively poor in generating facial expression effects. The reason for this is that the direct driving of the audio is affected by the complexity of the audio emotion and the noise, so that the generated facial expressions often appear as ghosts and distortions, resulting in poor accuracy and low robustness. However, in the existing methods, emotion voice recognition is introduced to avoid the above problems, but the existing methods are influenced and limited by emotion voice recognition accuracy, so that the efficiency is too low, and the generated facial video emotion is lack of diversity and naturalness.
Disclosure of Invention
The invention provides an emotion controllable facial animation generation method based on multi-mode driving, aiming at solving the problem that the existing facial animation generation method is lack of emotion regulation and control capability.
The invention is realized by adopting the following technical scheme:
the method for generating the emotion controllable facial animation based on multi-mode driving is realized by adopting the following steps:
step S1: the image of the portrait video is preprocessed, and then a face recognition algorithm face alignment is used to obtain a face 3D feature coordinate sequence.
Step S2: the audio of the portrait video is preprocessed and then the preprocessed audio is decoupled into an audio content vector irrelevant to the audio speaker and an audio style vector relevant to the audio speaker by using a voice conversion method.
And step S3: and training a face lip voice coordinate animation generation network consisting of a Multi-Layer Perceptron (MLP) and a Long-Short-Term Memory (LSTM) network based on the face coordinate sequence obtained in the step S1 and the audio content vector obtained in the step S2.
And step S4: and training a facial emotion coordinate animation generation network consisting of MLP, LSTM, self-attention mechanism (Self-attention mechanism) and generation countermeasure network (GAN) based on the facial coordinate sequence obtained in the step S1 and the audio content vector and the audio style vector obtained in the step S2.
Step S5: and training a coordinate-to-video network consisting of GANs based on the face coordinate sequence obtained in the step S1.
Step S6: and (4) inputting any two portrait pictures (one representing identity source and one representing emotion source) and any one section of audio based on the facial lip sound coordinate animation generation network, the facial emotion coordinate animation generation network and the coordinate-to-video network obtained in the steps S3, S4 and S5, and generating a lip sound synchronous video of the target portrait with the emotion corresponding to the emotion source.
The method for generating the emotion-controllable facial animation based on multi-mode driving uses a computer vision-generation model and a deep neural network model as technical supports, and realizes the description of the emotion-controllable facial animation generation network.
The invention has the beneficial effects that: compared with the existing facial animation generation method, the method has the advantages that the problems of double images and distortion of facial expressions and low emotion voice recognition precision caused by single audio features are considered, the emotion portrait is introduced to serve as an emotion source, the emotion of the target portrait is remolded by multi-mode driving of the emotion source portrait features and the audio features, and the facial animation with controllable emotion is generated. The dual driving of the emotion image and the audio can avoid the dependency of emotion generation on single voice information, so that the generated video has controllable emotion while meeting the requirements of lip sound synchronization and spontaneous head swing, namely, the diversity and naturalness of facial animation are ensured, and more real emotional expression of the facial animation is realized.
The method effectively solves the problem that the existing facial animation generation method has low efficiency due to the limitation of speech emotion recognition precision on facial expressions, and can be used in the fields of pseudoscopic and pseudoscopic auxiliary treatment systems, virtual anchor games, role-defined games and the like.
Drawings
FIG. 1 is a schematic diagram of a multi-modal driven emotion controllable facial animation generation structure according to an embodiment of the invention.
Fig. 2 is a schematic diagram comparing the present invention with a conventional face animation method.
Fig. 3 is a sample video schematic of an embodiment of the invention.
Detailed Description
In this embodiment, the portrait video data set used is from the public Multi-view Emotional Audio-visual data set (MEAD).
As shown in FIG. 1, the method for generating emotion controllable facial animation based on multi-modal driving is realized by adopting the following steps:
step S1: the image of the portrait video is preprocessed, and then a face recognition algorithm face alignment is used for obtaining a face 3D feature coordinate sequence.
Step S2: the audio of the portrait video is preprocessed and then the preprocessed audio is decoupled into an audio content vector irrelevant to the audio speaker and an audio style vector relevant to the audio speaker by using a voice conversion method.
And step S3: and training a face lip sound coordinate animation generation network consisting of a Multi-Layer Perceptron (MLP) and a Long Short-Term Memory (LSTM) network based on the face coordinate sequence obtained in the step S1 and the audio content vector obtained in the step S2.
And step S4: and training a face emotion coordinate animation generation network consisting of MLP, LSTM, self-attention mechanism (Self-attention mechanism) and generation countermeasure network (GAN) based on the face coordinate sequence obtained in the step S1 and the audio content vector and style vector obtained in the step S2.
Step S5: and training a coordinate-to-video network consisting of GANs based on the face coordinate sequence obtained in the step S1. During this step of training, a loss function is used to calculate the minimum distance in pixels between the reconstructed face and the training target face.
Step S6: and (4) inputting any two portrait pictures (one representing identity source and one representing emotion source) and a section of any audio based on the facial lip sound coordinate animation generation network, the facial emotion coordinate animation generation network and the coordinate-to-video network obtained in the steps S3, S4 and S5, and generating a lip sound synchronous video of the target portrait with the emotion corresponding to the emotion source.
In step S1, the image of the portrait video is preprocessed, and the specific preprocessing process includes frame rate conversion, image resampling, and face coordinate extraction.
First, the video is frame rate converted to 62.5 frames per second. It is then image resampled and cropped to 256 × 256 video containing faces. And finally, extracting face coordinates by using a face identification algorithm face alignment, and acquiring 3D coordinates (with the dimension of 68 x 3) of the face of each frame to form a face 3D feature coordinate sequence.
In addition, the face 3D feature coordinate sequence is saved as an emotion source portrait coordinate sequence (emotion source face coordinates) and an identity source portrait coordinate sequence (identity source face coordinates). Compared with the pixel points of the portrait, the face coordinates can provide natural low-dimensional representation for the portrait and also provide a high-quality bridge for downstream emotion replaying tasks.
In the step S2, the audio of the portrait video is preprocessed, wherein the preprocessing comprises sampling rate conversion, audio vector extraction and audio vector decoupling.
The audio is first sample rate converted to 16000hz using Fast forwarding Moving Picture Experts Group. Then, audio vector extraction is carried out on the audio vector, and the audio vector is obtained by using a rememblyzer library of python. And finally, inputting the audio vector into a voice conversion model AutoVC, and acquiring an audio content vector irrelevant to the audio speaker after decoupling and an audio style vector relevant to the audio speaker.
And step S3, finishing training of the facial lip voice coordinate animation generation network.
The network adopts a self-defined coder-decoder network structure, the coder comprises a facial coordinate coder consisting of two layers of MLPs and a voice content coder consisting of three layers of LSTMs, and the decoder is a facial lip sound coordinate decoder consisting of three layers of MLPs. In order to generate an optimal sequence of the offset of the facial lip voice coordinate, the facial lip voice coordinate animation generation network sets a loss function to continuously adjust the weight and the deviation of the network until the error between the predicted coordinate and the reference coordinate is minimized.
The custom encoder-decoder network structure is as follows:
firstly, two-layer MLP is used for extracting the identity feature of the face 3D feature coordinate sequence (namely the first time point of the face 3D feature coordinate sequence) of the first frame of the video obtained in the step S1. And then, based on the identity characteristics and the audio content vector obtained in the step S2, carrying out linear fusion and extracting the dependency relationship between audio continuous syllables and lip coordinates by using the LSTM of the three-layer unit. Then, based on the output of the encoder in the step, a decoder composed of three layers of MLPs is used for predicting a facial lip voice coordinate offset sequence, and the specific calculation formula is as follows:
ΔP t =MLP c (LSTM c (Ec t→t+λ ,MLP L (L;W mlp,l )W lstm );W mlp,c ) (1)
in the formula (1), Δ P t Indicating the predicted lip sound coordinate offset of the t frame face, wherein t represents the current frame of the portrait video; MLP L Representing a face coordinate encoder, L being the face coordinates of the first frame of the portrait video, W mlp,l Representing facial coordinate encoder learnable parameters; LSTM c Representing phonetic content compilationsA coder, ec representing the audio content vector, t → t + λ representing that the audio content vector is input to the speech content coder in a batch size of λ =18 per frame t, W lstm Representing speech content encoder learnable parameters; MLP c Coordinate decoder for lip voice of face, W mlp,c The coordinate decoder for lip sound on the face can learn the parameters.
Correcting the first frame coordinate of the portrait video through the predicted facial lip tone coordinate offset sequence to obtain a lip tone synchronous coordinate sequence, wherein a specific calculation formula is as follows:
P t =L+ΔP t (2)
in the formula (2), P t Representing the lip sound synchronous face coordinates of the tth frame, wherein t represents the current frame of the portrait video; l is the face coordinate of the first frame of the portrait video, Δ P t Indicating the predicted t-th frame face lip sound coordinate offset.
In order to generate an optimal sequence of the offset of the facial lip coordinates, the weight and the deviation of the loss function adjusting network are set based on the encoder-decoder structure of the facial lip coordinate animation generating network. The objective of the loss function is to minimize the error between the predicted coordinates and the coordinates obtained in step S1, and the specific calculation formula is as follows:
Figure GDA0004101548970000051
in the formula (3), the first and second groups of the compound,
Figure GDA0004101548970000052
representing a loss function of a facial lip tone coordinate animation generation network, wherein T represents the total frame rate of the video, T represents the current frame rate of the portrait video, N =68 represents the total number of facial coordinates, and i represents the current facial coordinate number; p i,t Represents the predicted face coordinates of the ith frame, < > based on the prediction>
Figure GDA0004101548970000053
Representing the coordinates of the face of the ith frame obtained in step S1;
Figure GDA0004101548970000054
represents P i,t And/or>
Figure GDA0004101548970000055
Squared euclidean norm of.
When the loss function tends to be smooth, i.e.
Figure GDA0004101548970000056
And when the minimum value is reached, the training of the facial lip sound coordinate animation generation network is finished.
And step S4, finishing the training of the face emotion coordinate animation generation network, and adding rich visual emotion expressions to the generated video.
Human beings rely on visual information in emotion interpretation, and abundant visual emotion expression can give people stronger sense of reality, and the practicality is bigger. Most of the existing face animation generation algorithms are dedicated to expressing the lip movement and the head pose swing of the face animation through audio single modality. Audio single-modality driving works well in lip movement that generates syllables, but relatively poorly in generating facial expressions. The reason for this is that the direct driving of the audio is affected by the complexity of the audio emotion and the noise, so that the generated facial expressions often appear as ghosts and distortions, resulting in poor accuracy and low robustness. However, in the existing methods, emotion voice recognition is introduced to avoid the above problems, but the existing methods are influenced and limited by emotion voice recognition accuracy, so that the efficiency is too low, and the generated facial video emotion is lack of diversity and naturalness.
The patent provides a facial emotion coordinate animation generation network based on multi-mode driving, emotion portraits are introduced to serve as emotion sources, multi-mode driving is performed in combination with audio characteristics to achieve emotion remodeling of target portraits more accurately, and the facial emotion coordinate animation generation network is generated.
The network is a custom encoder-decoder network structure, the encoder comprises an audio encoder and a facial coordinate encoder, and the decoder comprises a coordinate decoder. The encoder can obtain audio features, portrait identity features and portrait emotional features. The decoder is responsible for processing the multi-mode characteristics, and is driven by the audio characteristics and the portrait emotion characteristics together to generate a coordinate offset sequence after the target portrait emotion is remolded, so that rich visual emotion expression is added to the video. Under the driving of the multiple modes, the method avoids the over-low robustness of the audio single driving source, gets rid of the dependency of emotion generation on emotion voice recognition, enhances the complementarity among data and realizes the emotion expression of more real facial animation.
In order to generate an optimal facial emotion coordinate offset sequence, three different loss functions are set to adjust the weight and the deviation of the network based on the encoder-decoder structure of the facial emotion coordinate animation generation network. One of them calculates the distance between the predicted face 3D feature coordinate sequence and the face 3D feature coordinate sequence obtained in step S1. The second and third are the identifier loss function to distinguish the true and false of the generated face coordinate and the similarity of the face coordinate interval frame.
The network structure of the encoder-decoder customized by the facial emotion coordinate animation generation network is as follows:
the encoder consists of an audio encoder, an identity source face coordinate encoder and an emotion source face coordinate encoder. The audio encoder captures audio features through a three-layered LSTM, a three-layered MLP, and a self-attention mechanism.
Specifically, firstly, the LSTM is used to extract the features of the audio content vector obtained in step S2; then, using MLP to extract the characteristics of the audio style vector obtained in the step S2; then, carrying out linear fusion on the audio content vector characteristics and the audio style vector characteristics; and finally, capturing a longer-time structural dependency relationship between the audio content vector and the audio style vector by using a self-attention mechanism to obtain audio features with stronger time dependency, wherein a specific calculation formula is as follows:
S t =Attn(LSTM c′ (Ec t→t+λ ;W′ lstm ),MLP s (Es;W mlp,s );W attn ) (4)
in the formula (4), S t Representing processed tth frame audioA feature, t, representing a current frame of the portrait video; MLP S Representing an audio style vector encoder, es representing an audio style vector, W mlp,s Representing audio style vector encoder learnable parameters; LSTM c′ Represents an audio content vector encoder, ec represents an audio content vector, t → t + λ represents that the audio content vector is input to the audio content vector encoder, W ', with a batch size of λ =18 per frame t' lstm Representing audio content vector encoder learnable parameters; attn denotes the self-attention mechanism, W attn Indicating a self-attentiveness mechanism learnable parameter.
The two face coordinate encoders are both light neural networks composed of seven layers of MLPs. The two are similar in structure but different in function, one extracts geometric information of identity and one extracts geometric information of facial emotion.
Based on the two different face coordinates (one is regarded as an identity source face coordinate sequence and the other is regarded as an emotion source face coordinate sequence) obtained in the step S1, firstly, an identity source face coordinate encoder composed of seven layers of MLPs is used for extracting portrait identity characteristics of an identity source; secondly, extracting portrait emotional characteristics of an emotion source by using an emotion source face coordinate encoder consisting of seven layers of MLPs; and finally, performing linear fusion on the portrait identity characteristic, the portrait emotion characteristic and the audio characteristic obtained by the formula (4) to obtain a fusion characteristic, wherein the specific calculation formula is as follows:
F t =concat(MLP LA (L a ;W mlp,la ),MLP LB (L b ;W mlp,lb ),S t ) (5)
in the formula (5), F t Representing the fusion characteristics of the t frame after linear fusion, and concat representing linear fusion; MLP LA Identity source face coordinate representation encoder, L a Face coordinates, W, for the first frame of the identity Source Portrait video mlp,la Representing identity source facial coordinate encoder learnable parameters; MLP LB Face coordinate encoder for representing emotion source, L b Face coordinates for the first frame of the Source of Emotion Portrait video, W mlp,lb Representing emotion source face coordinate encoder learnable parameters; s t T frame audio representing step S4And (5) performing characteristic.
Based on the portrait identity characteristic, the portrait emotion characteristic and the fusion characteristic of the audio characteristic obtained by the formula (5), a coordinate decoder consisting of three layers of MLPs is used for predicting a facial emotion coordinate offset sequence, and the specific calculation formula is as follows:
△Q t =MLP LD (F t ;W mlp,ld ) (6)
in the formula (6), Δ Q t Representing the predicted t frame face emotion coordinate offset, wherein t represents the current frame of the portrait video; MLP LD Decoder for animation generation network representing facial emotion coordinates, F t For the t frame fusion feature after the linear fusion in step S4, W mlp,ld Indicating that the decoder can learn the parameters.
The method comprises the following steps of correcting the first frame coordinate of the identity source portrait video through a predicted facial emotion coordinate offset sequence to obtain a facial emotion coordinate sequence, wherein the specific calculation formula is as follows:
Q t =L a +△Q t (7)
in the formula (7), Q t Representing the face emotion coordinates of a tth frame, wherein t represents the current frame of the portrait video; l is a Face coordinates, deltaQ, for the first frame of the identity Source Portrait video t And the predicted t frame emotion coordinate offset is shown.
In order to generate an optimal facial emotion coordinate offset sequence, three different loss function adjusting network weights and deviations are set based on the encoder-decoder structure of the facial emotion coordinate animation generation network, and the specific formula is as follows:
Figure GDA0004101548970000081
in the formula (8), the first and second groups,
Figure GDA0004101548970000082
a total penalty function representing the facial emotion coordinate animation generation network, </or >>
Figure GDA0004101548970000083
A penalty function representing the facial emotion coordinate animation generation network, <' >>
Figure GDA0004101548970000084
Discriminator D for representing face coordinates L Based on the loss function of->
Figure GDA0004101548970000085
Frame similarity discriminator D for representing face coordinate interval T A loss function of (d); lambda [ alpha ] 1 ,λ 2 ,λ 3 Respectively, weight parameters. />
The face coordinate loss function calculates the distance between the predicted face emotion coordinate sequence and the face coordinate (identity source portrait coordinate sequence with the same emotion as the emotion source) obtained in step S1, and the specific calculation formula is as follows:
Figure GDA0004101548970000086
in the formula (9), the first and second groups,
Figure GDA0004101548970000087
representing a loss function of a facial emotion coordinate animation generation network, T representing a total frame rate of a video, T representing a current frame of a portrait video, N =68 representing the total number of facial coordinates, and i representing a current facial coordinate number; q i,t Represents the predicted face coordinates of the ith frame, < > based on the prediction>
Figure GDA0004101548970000088
Representing the coordinates of the face of the ith frame obtained in step S1;
Figure GDA0004101548970000089
represents Q i,t And &>
Figure GDA00041015489700000810
Squared euclidean norm of.
Feeling sitting on the faceDiscriminator loss function during training of markup image generation network
Figure GDA0004101548970000091
For discriminating true or false of the generated face coordinates, a discriminator penalty function->
Figure GDA0004101548970000092
For estimating the similarity of the face coordinate interval frames, the formula is as follows:
Figure GDA0004101548970000093
Figure GDA0004101548970000094
in equations (10) and (11), t represents the current frame of the portrait video, D L A discriminator for representing the true or false of the face coordinates,
Figure GDA0004101548970000095
discriminator D for representing face coordinates L A loss function of (d); d T A frame similarity discriminator representing the interval of facial coordinates>
Figure GDA0004101548970000096
Frame similarity discriminator D for representing face coordinate interval T A loss function of (d); q t The predicted t-th frame part emotion coordinates are expressed and/or evaluated>
Figure GDA0004101548970000097
Represents the t-th frame face coordinate, based on the result of step S1, and/or the value of the T-th frame face coordinate>
Figure GDA0004101548970000098
Represents->
Figure GDA0004101548970000099
Face coordinates of the previous frame.
When the loss function tends to be smooth, the training of the facial emotion coordinate animation generation network is completed.
In step S5, the coordinate is converted into a video network, and the training of the target network is completed.
And based on the face coordinate sequence obtained in the step S1, connecting discrete coordinates according to numbers, and rendering by using color line segments to create a three-channel face sketch sequence with the size of 256 × 256. The sequence is channel-concatenated with the original pictures of the first frame of the corresponding video to create a six-channel picture sequence with a size of 256 × 256. And generating a reconstructed face video by using the sequence as input and using a coordinate-to-video network.
In order to generate an optimal face video, an L1 loss function (L1-norm loss function) is set to adjust the weight and deviation of the network based on the image conversion network. The loss function aims to minimize the pixel distance between the reconstructed face video and the training target face video.
And S6, inputting any two portrait pictures (one representing identity source and the other representing emotion source) and any one section of audio to generate a target video based on the facial lip sound coordinate animation generation network, the facial emotion coordinate animation generation network and the coordinate-to-video network obtained in the steps S3, S4 and S5.
And respectively acquiring corresponding identity source portrait coordinates and emotion source portrait coordinates by using a face identification algorithm face alignment, and acquiring an audio content vector and an audio style vector of the audio by using a voice conversion method. And (4) generating a network by the audio content vector and the identity source portrait coordinate animation through the facial lip tone coordinate obtained in the step (S3), and generating a lip tone synchronous facial coordinate offset sequence. And (4) generating a network by the audio content vector, the audio style vector, the identity source portrait coordinate and the emotion source portrait coordinate through the facial emotion coordinate animation obtained in the step (S4), and generating a facial emotion coordinate offset sequence. And correcting the identity source portrait coordinates through the two offset sequences to obtain a final coordinate sequence, inputting the final coordinate sequence into the coordinate-to-video network obtained in the step S5, and generating a lip sound synchronous video of the target portrait with the emotion source emotion.
The multi-mode driven emotion controllable facial animation generation method is realized through a voice conversion method, a multi-layer perceptron, a long-term and short-term memory network, a self-attention mechanism and a generation countermeasure network; as shown in FIGS. 2-3, the invention can generate different emotion videos by adjusting the emotion source portrait, thereby having higher application value and overcoming the characteristics of the prior facial animation generation method such as lack of emotion or poor robustness.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (5)

1. The method for generating the emotion controllable facial animation based on multi-mode driving is characterized by comprising the following steps of:
step S1: preprocessing an image of a portrait video, and extracting a facial 3D feature coordinate sequence from the preprocessed image by using a facial recognition algorithm;
step S2: preprocessing the audio of the portrait video, and then decoupling the preprocessed audio into an audio content vector irrelevant to an audio speaker and an audio style vector relevant to the audio speaker by using a voice conversion method;
and step S3: training a facial lip voice coordinate animation generation network consisting of a multilayer perceptron and a long-time and short-time memory network based on a facial 3D characteristic coordinate sequence and an audio content vector;
and step S4: training a facial emotion coordinate animation generation network consisting of a multilayer perceptron, a long-time memory network, a long-time attention mechanism and a generation countermeasure network based on a facial 3D feature coordinate sequence, an audio content vector and an audio style vector;
step S5: training a coordinate-to-video network consisting of a generated countermeasure network based on the face 3D feature coordinate sequence;
step S6: inputting any two portrait pictures and a section of any audio based on a trained facial lip voice coordinate animation generation network, a facial emotion coordinate animation generation network and a coordinate-to-video network, wherein one of the two portrait pictures represents an identity source and the other represents an emotion source; generating a lip sound synchronous video of a target portrait with a mood corresponding to an emotion source;
in step S3, the facial lip voice coordinate animation generation network adopts a self-defined coder-decoder network structure, the coder comprises a facial coordinate coder consisting of two layers of MLPs and a voice content coder consisting of three layers of LSTMs, and the decoder is a facial lip voice coordinate decoder consisting of three layers of MLPs; the facial lip voice coordinate animation generation network is provided with a loss function used for continuously adjusting the weight and deviation of the network until the error between the predicted coordinate and the reference coordinate is minimized;
in step S3, the network training process for generating the facial lip sound coordinate animation is as follows:
firstly, extracting the identity feature of the 3D feature coordinate sequence of the face of the first frame of the video obtained in the step S1 by using two layers of MLPs (Multi-layer Linear programming) to obtain the identity feature of the first time point of the 3D feature coordinate sequence of the face;
then, based on the identity characteristics and the audio content vector obtained in the step S2, after linear fusion, extracting the coordinate dependency relationship between audio continuous syllables and lip parts by using the LSTM of the three-layer unit;
then, based on the output of the encoder in the step, a decoder composed of three layers of MLPs is used for predicting a facial lip voice coordinate offset sequence, and the specific calculation formula is as follows:
ΔP t =MLP c (LSTM c (Ec t→t+λ ,MLP L (L;W mlp,l );W lstm );W mlp,c )
in the formula,. DELTA.P t Indicating the predicted lip sound coordinate offset of the t frame face, wherein t represents the current frame of the portrait video; MLP L Representing a face coordinate encoder, L being the face coordinates of the first frame of the portrait video, W mlp,l Representing facial coordinate encoder learnable parameters; LSTM c Representing a speech content encoder, ec representing an audio content vector, t → t + λ representing that the audio content vector is input to the speech content encoder with a batch size of λ =18 per frame t, W lstm Representing speech content encoder learnable parametersCounting; MLP c Coordinate decoder for lip voice of face, W mlp,c Representing facial lip tone coordinate decoder learnable parameters;
correcting the first frame coordinate of the portrait video through the predicted facial lip tone coordinate offset sequence to obtain a lip tone synchronous coordinate sequence, wherein a specific calculation formula is as follows:
P t =L+ΔP t
in the formula, P t Representing the lip sound synchronous face coordinates of the tth frame, wherein t represents the current frame of the portrait video; l is the face coordinate of the first frame of the portrait video, Δ P t Indicating the predicted t frame face lip sound coordinate offset;
in order to generate an optimal sequence of the offset of the facial lip coordinates, based on the encoder-decoder structure of the facial lip coordinate animation generation network, the weight and the deviation of a loss function adjustment network are set, and a specific calculation formula of the loss function is as follows:
Figure FDA0004101548960000021
in the formula (I), the compound is shown in the specification,
Figure FDA0004101548960000022
representing a loss function of a facial lip voice coordinate animation generation network, T representing a total frame rate of a video, T representing a current frame rate of a portrait video, N =68 representing the total number of facial coordinates, and i representing a current facial coordinate number; p is i,t Coordinates representing a predicted ith frame, <' > based on a predicted frame>
Figure FDA0004101548960000023
Coordinates of the ith frame obtained in the step S1 are represented; />
Figure FDA0004101548960000024
Represents P i,t And/or>
Figure FDA0004101548960000025
The square of the euclidean norm of (d);
when the loss function tends to be smooth, i.e.
Figure FDA0004101548960000026
When the minimum value is reached, training of the facial lip sound coordinate animation generation network is completed;
in step S4, the facial emotion coordinate animation generation network adopts a self-defined encoder-decoder network structure:
the encoder comprises an audio encoder and a face coordinate encoder, wherein the face coordinate encoder comprises an identity source face coordinate encoder and an emotion source face coordinate encoder, and the audio encoder captures audio features through a three-layer LSTM, a three-layer MLP and a self-attention mechanism;
the decoder comprises a coordinate decoder;
the device comprises a coder, a decoder and a target portrait emotion recognition module, wherein the coder is used for acquiring audio features, portrait identity features and portrait emotion features, the decoder is used for processing multi-mode features, and the multi-mode features are driven together through the audio features and the portrait emotion features to generate a coordinate offset sequence after the target portrait emotion is remolded;
the facial emotion coordinate animation generation network is provided with three different weight and deviation of a loss function adjustment network, wherein one of the weight and deviation is used for calculating the distance between a predicted facial 3D characteristic coordinate sequence and the facial 3D characteristic coordinate sequence obtained in the step S1, and the second and third are identifier loss functions which are used for respectively judging the truth of generated facial coordinates and the similarity of a facial coordinate interval frame;
in step S4, the network training process for generating the facial emotion coordinate animation is as follows:
firstly, using LSTM to extract the characteristics of the audio content vector obtained in the step S2;
then, using MLP to extract the characteristics of the audio style vector obtained in the step S2;
then, carrying out linear fusion on the audio content vector characteristics and the audio style vector characteristics;
and finally, capturing a longer-time structural dependency relationship between the audio content vector and the audio style vector by using a self-attention mechanism to obtain an audio feature with stronger time dependency, wherein a specific calculation formula is as follows:
S t =Attn(LSTM c′ (Ec t→t+λ ;W′ lstm ),MLP s (Es;W mlp,s );W attn )
in the formula, S t Representing the processed audio characteristics of the t-th frame, wherein t represents the current frame of the portrait video; MLP S Representing an audio style vector encoder, es representing an audio style vector, W mlp,s Representing audio style vector encoder learnable parameters; LSTM c′ Denotes an audio content vector encoder, ec denotes an audio content vector, t → t + λ denotes that the audio content vector is input to the audio content vector encoder with a batch size of λ =18 per frame t, W' lstm Representing audio content vector encoder learnable parameters; attn denotes the self-attention mechanism, W attn Representing a self-attention mechanism learnable parameter;
the two facial coordinate encoders are light neural networks composed of seven layers of MLPs, one for extracting geometric information of identity and the other for extracting geometric information of facial emotion;
based on the two different face coordinates obtained in the step S1, one is regarded as an identity source face coordinate sequence, and the other is regarded as an emotion source face coordinate sequence, firstly, an identity source face coordinate encoder composed of seven layers of MLPs is used for extracting portrait identity characteristics of an identity source; secondly, extracting portrait emotional characteristics of an emotion source by using an emotion source face coordinate encoder consisting of seven layers of MLPs; and finally, performing linear fusion on the portrait identity characteristic, the portrait emotion characteristic and the obtained audio characteristic to obtain a fusion characteristic, wherein the specific calculation formula is as follows:
F t =concat(MLP LA (L a ;W mlp,la ),MLP LB (L b ;W mlp,lb ),S t )
in the formula, F t Representing the t frame characteristic after linear fusion, and concat representing linear fusion; MLP LA Identity source face coordinate representation encoder, L a Face coordinates, W, for the first frame of the identity Source Portrait video mlp,la Identity source representative facial coordinate encoder learnableA parameter; MLP LB Face coordinate encoder for representing emotion source, L b Face coordinates, W, for the first frame of the Source Anoticeage video mlp,lb Representing emotion source face coordinate encoder learnable parameters; s t Representing the tth frame audio feature of step S4;
based on the fusion characteristics of the portrait identity characteristics, the portrait emotion characteristics and the audio characteristics, a coordinate decoder consisting of three layers of MLPs is used for predicting a facial emotion coordinate offset sequence, and the specific calculation formula is as follows:
ΔQ t =MLP LD (F t ;W mlp,ld )
in the formula,. DELTA.Q t Representing the predicted emotional coordinate offset of the tth frame, wherein t represents the current frame of the portrait video; MLP LD Decoder for animation generation network representing facial emotion coordinates, F t For the t frame fusion feature after the linear fusion in step S4, W mlp,ld Indicating the decoder learnable parameters;
the method comprises the following steps of correcting the first frame coordinate of the identity source portrait video through a predicted facial emotion coordinate offset sequence to obtain a facial emotion coordinate sequence, wherein the specific calculation formula is as follows:
Q t =L a +ΔQ t
in the formula, Q t Representing emotional face coordinates, and t representing a current frame of the portrait video; l is a radical of an alcohol a Face coordinates, Δ Q, for the first frame of the identity Source Portrait video t Representing the predicted t frame emotion coordinate offset;
in order to generate an optimal facial emotion coordinate offset sequence, a coder-decoder structure of a network is generated based on facial emotion coordinate animation, three different loss functions are set to adjust the weight and the deviation of the network, and the specific formula is as follows:
Figure FDA0004101548960000041
in the formula (I), the compound is shown in the specification,
Figure FDA0004101548960000042
a total penalty function representing the facial emotion coordinate animation generation network, </or >>
Figure FDA0004101548960000043
A penalty function representing the facial emotion coordinate animation generation network, <' >>
Figure FDA0004101548960000044
Discriminator D for representing face coordinates L Based on the loss function of->
Figure FDA0004101548960000045
Frame similarity discriminator D for representing face coordinate interval T A loss function of (d); lambda [ alpha ] 1 ,λ 2 ,λ 3 Respectively are weight parameters;
wherein, the loss function of the facial emotion coordinate animation generation network calculates the distance between the predicted facial emotion coordinate sequence and the facial coordinates obtained in the step S1, and the specific calculation formula is as follows:
Figure FDA0004101548960000051
in the formula (I), the compound is shown in the specification,
Figure FDA0004101548960000052
representing a loss function of a facial emotion coordinate animation generation network, T representing a total frame rate of a video, T representing a current frame of a portrait video, N =68 representing the total number of facial coordinates, and i representing a current facial coordinate number; q i,t Face coordinates representing a predicted ith frame, <' > based on a predicted image>
Figure FDA0004101548960000053
Representing the coordinates of the face of the ith frame obtained in step S1; />
Figure FDA0004101548960000054
Represents Q i,t And &>
Figure FDA0004101548960000055
The square of the euclidean norm of (d);
discriminator loss function during facial emotion coordinate animation generation network training
Figure FDA0004101548960000056
For discriminating true or false of generated face coordinates, discriminator loss function>
Figure FDA0004101548960000057
For estimating the similarity of the face interval frame coordinates, the formula is as follows:
Figure FDA0004101548960000058
Figure FDA0004101548960000059
wherein t represents the current frame of the portrait video, D L A discriminator for representing whether the coordinates of the face are true or false,
Figure FDA00041015489600000510
discriminator D for representing face coordinates L A loss function of (d); d T A frame similarity discriminator representing the interval of facial coordinates>
Figure FDA00041015489600000511
Frame similarity discriminator D for representing face coordinate interval T A loss function of (d); q t The emotion coordinates of the tth frame representing the prediction are->
Figure FDA00041015489600000512
Represents the t-th frame face coordinate, based on the result of step S1, and/or the value of the T-th frame face coordinate>
Figure FDA00041015489600000513
Represents->
Figure FDA00041015489600000514
Face coordinates of a previous frame;
when the loss function tends to be smooth, the training of the facial emotion coordinate animation generation network is completed.
2. The method for generating emotion-controllable facial animation based on multi-modal driving according to claim 1, wherein step S1 specifically comprises:
firstly, performing frame rate conversion on a video, and converting the video into 62.5 frames per second;
then, the image is resampled and is cut into 256 times 256 videos containing faces;
extracting facial coordinates by using a facial recognition algorithm, acquiring 3D coordinates of each frame of face, wherein the dimensionality is 68 x 3, and forming a facial 3D feature coordinate sequence;
and storing the face 3D feature coordinate sequence into an emotion source portrait coordinate sequence and an identity source portrait coordinate sequence, namely, an emotion source face coordinate and an identity source face coordinate.
3. The method for generating emotion-controllable facial animation based on multi-modal driving according to claim 1, wherein step S2 specifically comprises:
carrying out sampling rate conversion on the audio, and converting the audio sampling rate into 16000hz by using Fast forwarding Moving Picture Experts Group;
then, audio vector extraction is carried out on the audio vector, and a rememblyzer library of python is used for obtaining the audio vector;
and finally, inputting the audio vector into a voice conversion model AutoVC, and acquiring the decoupled audio content vector irrelevant to the audio speaker and the audio style vector relevant to the audio speaker.
4. The method for generating emotion-controllable facial animation based on multi-modal driving according to claim 1, wherein in step S5, the training process of the coordinate-to-video network is as follows:
based on the face coordinate sequence obtained in the step S1, connecting discrete coordinates according to numbers, and rendering by using color line segments to create a three-channel face sketch sequence with the size of 256 × 256;
performing channel cascade on the sequence and the original picture of the first frame of the corresponding video to create a six-channel picture sequence with the size of 256 × 256;
using the sequence as input, and generating a reconstructed face video by using a coordinate-to-video network;
and setting the weight and deviation of an L1 loss function adjusting network based on the coordinate-to-video network for generating the optimal face video.
5. The method for generating controllable emotion face animation based on multi-modal driving according to claim 1, wherein in step S6, the lip-sound synchronization video of the target portrait with emotion source emotion is generated by using the trained three network models, which specifically includes:
inputting any two portrait pictures and any section of audio, respectively obtaining an identity source portrait coordinate and an emotion source portrait coordinate by using a face recognition algorithm, and obtaining an audio content vector and an audio style vector of the audio by using a voice conversion method;
generating a network by the audio content vector and the identity source portrait coordinate animation obtained in the step S3 to generate a lip-note synchronous face coordinate offset sequence;
generating a network by the audio content vector, the audio style vector, the identity source portrait coordinate and the emotion source portrait coordinate through the facial emotion coordinate animation obtained in the step S4, and generating a facial emotion coordinate offset sequence;
and correcting the identity source portrait coordinates through the two offset sequences to obtain a final coordinate sequence, inputting the final coordinate sequence into the coordinate-to-video network obtained in the step S5, and generating a lip sound synchronous video of the target portrait with the emotion source emotion.
CN202210744504.9A 2022-06-27 2022-06-27 Multi-mode driving-based emotion controllable facial animation generation method Active CN115100329B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210744504.9A CN115100329B (en) 2022-06-27 2022-06-27 Multi-mode driving-based emotion controllable facial animation generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210744504.9A CN115100329B (en) 2022-06-27 2022-06-27 Multi-mode driving-based emotion controllable facial animation generation method

Publications (2)

Publication Number Publication Date
CN115100329A CN115100329A (en) 2022-09-23
CN115100329B true CN115100329B (en) 2023-04-07

Family

ID=83295794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210744504.9A Active CN115100329B (en) 2022-06-27 2022-06-27 Multi-mode driving-based emotion controllable facial animation generation method

Country Status (1)

Country Link
CN (1) CN115100329B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631275B (en) * 2022-11-18 2023-03-31 北京红棉小冰科技有限公司 Multi-mode driven human body action sequence generation method and device
CN116433807A (en) * 2023-04-21 2023-07-14 北京百度网讯科技有限公司 Animation synthesis method and device, and training method and device for animation synthesis model
CN116843798B (en) * 2023-07-03 2024-07-05 支付宝(杭州)信息技术有限公司 Animation generation method, model training method and device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1320497C (en) * 2002-07-03 2007-06-06 中国科学院计算技术研究所 Statistics and rule combination based phonetic driving human face carton method
US9613450B2 (en) * 2011-05-03 2017-04-04 Microsoft Technology Licensing, Llc Photo-realistic synthesis of three dimensional animation with facial features synchronized with speech
CN111783658B (en) * 2020-07-01 2023-08-25 河北工业大学 Two-stage expression animation generation method based on dual-generation reactance network
CN113158727A (en) * 2020-12-31 2021-07-23 长春理工大学 Bimodal fusion emotion recognition method based on video and voice information
CN113408449B (en) * 2021-06-25 2022-12-06 达闼科技(北京)有限公司 Face action synthesis method based on voice drive, electronic equipment and storage medium
CN114202604A (en) * 2021-11-30 2022-03-18 长城信息股份有限公司 Voice-driven target person video generation method and device and storage medium
CN114663539B (en) * 2022-03-09 2023-03-14 东南大学 2D face restoration technology under mask based on audio drive

Also Published As

Publication number Publication date
CN115100329A (en) 2022-09-23

Similar Documents

Publication Publication Date Title
CN115100329B (en) Multi-mode driving-based emotion controllable facial animation generation method
Cudeiro et al. Capture, learning, and synthesis of 3D speaking styles
US20220084273A1 (en) System and method for synthesizing photo-realistic video of a speech
Wang et al. Seeing what you said: Talking face generation guided by a lip reading expert
US20210027511A1 (en) Systems and Methods for Animation Generation
CN115116109B (en) Virtual character speaking video synthesizing method, device, equipment and storage medium
CN115588224B (en) Virtual digital person generation method and device based on face key point prediction
CN114202604A (en) Voice-driven target person video generation method and device and storage medium
WO2021023869A1 (en) Audio-driven speech animation using recurrent neutral network
CN115457169A (en) Voice-driven human face animation generation method and system
CN113470170A (en) Real-time video face region space-time consistent synthesis method using voice information
CN117237521A (en) Speech driving face generation model construction method and target person speaking video generation method
EP0710929A2 (en) Acoustic-assisted image processing
CN115393949A (en) Continuous sign language recognition method and device
Zhua et al. Audio-driven talking head video generation with diffusion model
CN116828129B (en) Ultra-clear 2D digital person generation method and system
Wang et al. Ca-wav2lip: Coordinate attention-based speech to lip synthesis in the wild
Wen et al. 3D Face Processing: Modeling, Analysis and Synthesis
Tang et al. Real-time conversion from a single 2D face image to a 3D text-driven emotive audio-visual avatar
CN117171392A (en) Virtual anchor generation method and system based on nerve radiation field and hidden attribute
CN115937375A (en) Digital body-separating synthesis method, device, computer equipment and storage medium
CN114494930A (en) Training method and device for voice and image synchronism measurement model
Wang et al. Flow2Flow: Audio-visual cross-modality generation for talking face videos with rhythmic head
Pan et al. Research on face video generation algorithm based on speech content
Deena Visual speech synthesis by learning joint probabilistic models of audio and video

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant