WO2021128173A1 - 一种语音信号驱动的脸部动画生成方法 - Google Patents

一种语音信号驱动的脸部动画生成方法 Download PDF

Info

Publication number
WO2021128173A1
WO2021128173A1 PCT/CN2019/128739 CN2019128739W WO2021128173A1 WO 2021128173 A1 WO2021128173 A1 WO 2021128173A1 CN 2019128739 W CN2019128739 W CN 2019128739W WO 2021128173 A1 WO2021128173 A1 WO 2021128173A1
Authority
WO
WIPO (PCT)
Prior art keywords
dimension
time
frequency
frame
freq
Prior art date
Application number
PCT/CN2019/128739
Other languages
English (en)
French (fr)
Inventor
周昆
柴宇进
翁彦琳
王律迪
Original Assignee
浙江大学
杭州相芯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江大学, 杭州相芯科技有限公司 filed Critical 浙江大学
Priority to PCT/CN2019/128739 priority Critical patent/WO2021128173A1/zh
Priority to JP2021504541A priority patent/JP7299572B2/ja
Priority to EP19945413.3A priority patent/EP3866117A4/en
Priority to US17/214,936 priority patent/US11354841B2/en
Publication of WO2021128173A1 publication Critical patent/WO2021128173A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present invention relates to the field of facial animation, and in particular to a method for generating facial animation (referred to as voice animation for short) driven by voice signals.
  • Procedural voice animation technology (Yuyu Xu, Andrew W Feng, Stacy Marsella, and Ari Shapiro.A practical and configurable lip sync method for games. In Proceedings of Motion on Games, pages 131-140.ACM, 2013.) (Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh.Jali:animator-centric visame model for expressive lip synchronization.ACM Transactions on Graphics(TOG),35(4):127,2016.), automatic recognition and reflection from voice
  • the phoneme sequence of pronunciation (such as syllables in English, pinyin in Chinese), and group phonemes into visemes according to the shape of human lips when speaking, and make animation key frames for each viseme; through certain cooperative pronunciation rules Connect the entire sequence to get the facial animation.
  • These technologies are usually limited to artificially set key frames and cooperative pronunciation rules and cannot generate real speech animation; and are limited by the accuracy of phoneme recognition results.
  • Suwajanakorn and others proposed a delayed one-way long and short-term memory module (Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. Synthesizing obama: learning lip sync from audio.ACM Transactions on Graphics (TOG), 36(4):95,2017.), through a short delay to obtain the following information to help process collaborative pronunciation; to achieve high-quality voice animation in real time with a certain delay.
  • TOG Transactions on Graphics
  • This technology is that it requires a lot of data and can only generate facial videos of specific people.
  • Talyor et al. (Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Analysasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews.A deep learning, approach for CM animation(G)Speech 36 ):93,2017.)
  • a sliding window technology is proposed, which uses Deep Neural Network (DNN) to map phonemes within a window length to the Active Appearance Model (AAM) coefficients of the face;
  • DNN Deep Neural Network
  • AAM Active Appearance Model
  • the input is a window of linear predictive coding (Linear Predictive Coding, LPC) voice features, after a two-stage convolutional neural network (corresponding to the feature dimension formant analysis stage, corresponding time dimension The pronunciation stage), and a two-layer fully connected network, output the vertex position of a frame of three-dimensional face model.
  • LPC Linear Predictive Coding
  • the purpose of the present invention is to provide a method for generating facial animation driven by voice signals in view of the deficiencies of the prior art.
  • the present invention uses the Mel spectrum to extract the frequency characteristics of the speech signal; referring to the expressionless, closed-mouth face model, the calculated deformation gradient is used to represent the facial motion in the animation.
  • the present invention uses a three-stage (corresponding to steps (2) to (4)) deep neural network to map the Mel spectrum characteristics of a window into a frame of deformation gradient; the deformation gradient can be used to drive any face model, and the output style can be In the deep neural network, the one-hot vector is used for display control.
  • a voice signal-driven facial animation generation method includes the following steps:
  • Extracting voice features extracting Mel Spectrogram features from the voice in a window; the features are three-dimensional tensors (Tensor) composed of feature map dimensions, frequency dimensions, and time dimensions.
  • step (2) Collecting frequency information: For the Mel spectrum obtained in step (1), use Neural Network along the frequency dimension to abstract and collect all frequency information to obtain frequency abstract information.
  • step (3) the time summary information obtained in step (3) is connected with the one-hot vector of the control style input by the user; after two similar neural network branches, the zoom/cut are output respectively The (Scaling/Shearing) coefficient and the rotation (Rotation) coefficient, and the output coefficients of the two branches are combined to obtain the Deformation Gradients representing the facial motion.
  • the method for collecting frequency information in step (2) is designed based on Mel's spectrum characteristics, which can robustly abstract voice features; and the method for summarizing time information in step (3) is based on consonants, elements
  • the sound pronunciation principle is designed to efficiently learn human natural pronunciation patterns; in step (4), it is first proposed to use deformation gradients in voice-driven facial animation to represent facial movements, which can more accurately describe local changes in facial movements .
  • This method reaches the current state-of-the-art voice-driven facial animation technology level, and is lightweight, robust, and real-time (with a certain delay).
  • the present invention can use voice signals to drive the generation of facial animations in applications such as VR virtual social interaction, virtual voice assistants and games.
  • Figure 1 is a schematic flow diagram of the method of the present invention
  • step (2) is a schematic diagram of the working flow of the memory unit in step (2) and sub-step (2.3) in the method of the present invention
  • Figure 3 is a schematic flow chart of step (3) in the method of the present invention.
  • Fig. 4 is an implementation example of the present invention using voice signals to drive facial model animation, driving a human face model to speak the English word "smash" in the animation frame sequence excerpt;
  • Fig. 5 is an implementation example of the present invention using voice signals to drive facial model animation, driving cartoon animal facial model to speak the English word "smash" animation frame sequence excerpt.
  • the core technology of the present invention uses frequency dimension convolution and two-way long and short-term memory module to abstract speech features, time dimension two-way long and short-term memory and attention module to summarize time context information in a window, and deformation gradient to represent facial motion.
  • the method is mainly divided into six steps: extracting voice features, collecting frequency information, summarizing time information, decoding action features, driving the face model, and finally repeating the first five by sliding the signal window on a voice sequence. Steps to get the complete animation sequence.
  • Extract voice features extract Mel Spectrogram features from the voice in a window; the features are three-dimensional tensors (Tensor) composed of feature map dimensions, frequency dimensions, and time dimensions.
  • 1.2 Use the first and second derivatives of the mel spectrum with respect to time as auxiliary features, and stack them with the original features to form a tensor (Tensor) in the shape of 3 ⁇ F mel ⁇ L frame , where 3 in the first dimension represents the number of feature maps, F mel in the second dimension represents the length of the frequency dimension, and L frame in the third dimension represents the length of the time dimension.
  • Tonsor tensor
  • step (1) For the Mel spectrum obtained in step (1), use a neural network (Neural Network) along the frequency dimension to abstract and collect all frequency information to obtain frequency abstract information.
  • a neural network Neuron
  • the two-dimensional convolutional network includes in turn: the first two-dimensional convolutional layer, the first two-dimensional The maximum pooling layer, the second two-dimensional convolutional layer, and the second two-dimensional maximum pooling layer; the two two-dimensional convolutional layers respectively pass C freq_conv0 and C freq_conv1 convolution kernels along the frequency dimension direction ( The size is K freq ⁇ 1, where K freq represents the size of the frequency dimension direction, 1 represents the size of the time dimension direction) Convolution calculation on the input to obtain a number of local feature maps (the number is equal to the number of convolution kernels), two two-dimensional volumes
  • the multilayers all use Leaky ReLU (LReLU) with a negative slope of 0.2 as the activation function; the two two-dimensional maximum pooling layers are in a region along the frequency dimension (the size is both S freq ⁇ 1) Select the maximum
  • step (2.1) For the local frequency features obtained in step (2.1), use C freq_conv2 convolution kernels with a size of 1 ⁇ 1 (both the frequency dimension and the time dimension direction are equal to 1) to project the local frequency features; use the band with a negative slope of 0.2 Leaky Linear Rectification (Leaky ReLU, LReLU) as the activation function; the output is a Tensor of shape, where C freq_conv2 in the first dimension represents the number of feature maps, and in the second dimension Represents the length of the frequency dimension, and the L frame of the third dimension represents the length of the time dimension.
  • the long and short-term memory unit has a state device (used to store the historical information of the memory unit) and three gates: the input gate i t acts on each frequency feature x t (x represents the input, and the subscript t represents the t-th input Time) and the output h t-1 of the previous step of the memory unit (h represents the output, and the subscript t-1 represents the time of the t-1 input, that is, the previous step), indicating whether to allow new frequency characteristic information to be added to the memory
  • the value is from 0 to 1 (including both ends).
  • the forgetting gate f t acts on the state device of the memory unit, indicating whether to retain the historical frequency information stored in the previous state device S t-1 (S indicates the state of the state device, subscript t-1 represents the time of the t-1 input, that is, the previous step), the value is 0 to 1 (including both ends), if the value of the forgotten gate is 1 (that is, the door is opened), the stored information is retained, if it is 0 (that is, Close the door), reset the stored information to a zero vector.
  • the output gate o t acts on the state device of the memory unit to indicate whether to change the current state of the memory unit S t (S represents the state of the state machine, and the subscript t represents the time of the t-th input) as the output, the value is 0 to 1 (including both ends), if it is 1 (that is, the door is opened), the current state of the memory unit is used as the output, If it is 0 (that is, the door is closed), the zero vector is output.
  • x t is the current input
  • h t-1 is a step of outputting the memory cell
  • i t is the input gate value
  • W i, b i are the input weight of the door weight and bias parameters
  • f t is the input gate value
  • W f and b f are the weight and bias parameters of the forget gate respectively
  • o t is the input gate value
  • W o and b o are the weight and bias parameters of the output gate respectively
  • Is the projection of the current input and output of the previous step, W f and b f are the weight and bias parameters of the projection respectively
  • S t-1 and St are the state of the previous and current memory unit state devices respectively
  • h t is The output of the current memory unit.
  • the number of feature maps of the long and short-term memory units in each direction is The sum of the number of feature maps in the two directions is C freq_LSTM , so the output of the long and short-term memory units in the two directions in this step is Tensor of shape, where C freq_LSTM in the first dimension represents the number of feature maps, and in the second dimension Represents the length of the frequency dimension, and the L frame of the third dimension represents the length of the time dimension.
  • the state machine of the long and short-term memory unit and the three gates running around the state machine make it possible to fully consider the characteristics of other frequencies when analyzing certain frequency characteristics, and the natural phenomenon of resonance peaks will appear when adapting to human pronunciation.
  • step (2.3) For the output of the long and short time memory units along the positive and negative directions of the frequency dimension in step (2.3), all concatenated into a vector Tensor of shapes, where the first dimension Represents the number of feature maps, the second dimension L frame represents the length of the time dimension; and uses a fully connected layer with the number of feature maps C freq for projection, collects all frequency information, and obtains frequency abstract information z freq as C freq ⁇ L frame A tensor of shapes, where C freq in the first dimension represents the number of feature maps, and L frame in the second dimension represents the length of the time dimension. At this point, the frequency dimension has been completely collected and abstracted into the feature map dimension.
  • step (2) For the frequency abstract information obtained in step (2), two hidden layers are used to transmit time context information in the time dimension; in each hidden layer, a long and short-term memory unit is used along the positive and negative directions of the time dimension.
  • Each frame in the time dimension is processed cyclically, and time information is transmitted;
  • the long and short-term memory unit has the same structural principle as the long and short-term memory unit described in step (2.3), but it acts in the direction of the time dimension and has a state machine (for Store the historical information of the memory unit) and three gates: the input gate acts on the time characteristics of each frame and the output of the previous step of the memory unit, indicating whether to allow new time frame information to be added to the state device of the memory unit, with a value of 0 to 1.
  • the forgetting door acts on the state device of the memory unit, indicating whether to retain the historical time information stored in the previous state device, the value is 0 to 1 (including both ends), if the value of the forgetting door is 1 (that is, the door is opened), the stored information is retained , If it is 0 (ie close the door), the stored information will be reset to a zero vector.
  • the stored information will be multiplied by the gate value and then retained; the output gate acts on the state device of the memory unit to indicate whether the current memory is stored
  • the state of the unit is used as output, the value is 0 to 1 (including both ends), if it is 1 (that is, the door is open), the state of the current memory unit is used as output, if it is 0 (that is, the door is closed), it outputs a zero vector, if it is 0 to 1
  • the intermediate value multiplies the state of the current memory unit by the gate value and then outputs it; the specific values of the three gates are obtained by connecting and projecting the current input time frame (or the output of the previous hidden layer) with the previous output of the unit.
  • the number of feature maps of the long and short-term memory units in each direction is The sum of the number of feature maps in the two directions is C time , so the temporal context information m freq obtained in this step is a tensor of C time ⁇ L frame shape, where C time in the first dimension represents the number of feature maps, and L in the second dimension The frame represents the length of the time dimension.
  • step (3.1) For the temporal context information obtained in step (3.1), use the hidden layer to weigh the importance of each frame of information in the context and perform weighting and summarization; in the hidden layer, the K qry frame in the middle of the temporal context information m freq is selected as C Att one-dimensional convolution kernels (the size is also K qry ) are projected as the query term q att (the shape is C att ⁇ 1, where C att is the number of feature maps and the number of convolution kernels are the same, and 1 is the time dimension length).
  • the entire time context information m freq is linearly projected as the key value item k att (the shape is C att ⁇ L frame , where C att is the number of feature maps, and L frame is the length of the time dimension), the query item q att and the key value item k att
  • the sum is normalized by the tanh activation function, linear projection (the number of feature maps is projected from C att to 1) and softmax to obtain the weight of each frame (shape is 1 ⁇ L frame ), and use the weight to determine the temporal context information m freq performs weighting and summarization to obtain time summary information z att (shape is C time , where C time is the number of feature maps); the hidden layer imitates the pattern of human natural pronunciation through the weight of the time dimension, for example, the vowel pronunciation has a long time span , The pronunciation of a consonant appears as an instantaneous pronunciation and is related to the transitional vowels before and after the consonant.
  • step (3) the time summary information obtained in step (3) is connected with the one-hot vector of the control style input by the user; after two similar neural network branches, the zoom/cut ( The Scaling/Shearing coefficient and the Rotation coefficient, and the output coefficients of the two branches are combined to obtain the Deformation Gradients representing the facial motion.
  • the present invention uses the deformation gradient to represent facial movements in speech-driven speech animation for the first time. Compared with the prior art, the present invention can more accurately describe the local changes of facial movements.
  • the present invention uses the method described in (Robert W Sumner and Jovan Popovic. Deformation transfer for triangle meshes. ACM Transactions on graphics (TOG), 23(3): 399-405, 2004.) to calculate the deformation gradient of the face model.
  • the face model is composed of multiple triangular facets, with versus Respectively represent the three vertices of the i-th triangle in the face model and the deformed face model.
  • calculate the fourth vertex for the triangle according to the following formula:
  • the deformation gradient of the i-th triangle is a transformation matrix T i that satisfies the following formula:
  • V i It is formed by stacking the three vectors of the reference and deformed triangles respectively:
  • the present invention further adopts (Qianyi Wu, Juyong Zhang, Yu-Kun Lai, Jianmin Zheng, and Jianfei Cai.Alive caricature from 2d to 3d.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7336-7345, 2018.
  • S i represents the scaling / shearing transformation, symmetric matrix to be represented by six parameters;
  • R i represents a rotation transformation, using Rodrigues'formula can be represented by three parameters. Therefore, the deformation gradient of each triangle is represented by 9 parameters.
  • the deformation gradient in the present invention is defined on a template face model.
  • z dec will be connected to the C speaker dimensional one-hot control vector again, and projected by three fully connected layers; the number of feature maps of the first fully connected layer is C dec1 , and the activation function is a negative slope of 0.2 LReLU; the number of the second feature map is C dec2 and the activation function is tanh; the third has no activation function (in the zoom/shear and rotation branches, the number of feature maps are C pca_s and C pca_r ).
  • the number of feature maps are 6N and 3N respectively
  • its parameters are initialized by the principal component analysis base and mean value of the training data corresponding to the branch ;
  • 97% of the energy is retained.
  • the number of retained bases are C pca_s and C pca_r (the same as the number of feature maps of the third fully connected layer in the aforementioned branch).
  • the two branches are decoded to obtain the parameter s (size 6N) representing the scaling/cutting and the parameter r (size 3N) representing the rotation.
  • the deformation gradient obtained in step (4) cannot be used directly, and the triangle correspondence between the two models needs to be obtained first; if the topological structure is the same, you can Use directly.
  • the present invention adopts the method described in (Robert W Sumner and Jovan Popovic. Deformation transfer for triangle meshes. ACM Transactions on graphics (TOG), 23(3): 399-405, 2004.), and the corresponding relationship between several vertices is given by the user In the case of automatically solving the triangle correspondence between two face models of different topologies.
  • the automatic solution method first needs to find a series of transformation matrices (including scaling/shearing and rotation transformations, excluding translation transformations) O i ,i ⁇ 1,...,M ⁇ to deform the given face model to the nearest The status of the template face model. Define the following three energy equations E S , E I , E C and the sum E of the energy equations under restricted conditions. Minimizing E can deform the given face model to the target state:
  • E S represents the energy to constrain the smoothness of deformation
  • M is the number of triangles in a given face model
  • adj(i) represents the set of adjacent triangles around the i-th triangle
  • E I represents the energy to constrain the degree of deformation
  • I represents Identity matrix
  • E C represents the energy of the distance between the vertices of the two models after deformation
  • n is the number of vertices in a given face model
  • C i is The position of the closest vertex in the template face model
  • E is the sum of the first three energy terms, Represents the position of n vertices in a given face model after deformation
  • w S , w I , w C are the weights corresponding to E S , E I , E C respectively, and the energy equation obeys the m vertices given by the user
  • a and AT A can be pre-calculated, and each model only needs to be pre-calculated once.
  • Sliding signal window Repeat steps (1) to (5) to process all voice signal windows to generate a complete facial animation.
  • a series of audio windows are acquired at intervals of seconds, and steps (1) to (5) are repeated for each window to generate a complete animation.
  • the frame rate of the animation is fps frames per second.
  • the generation speed can reach real-time, and the delay is (L audio is the length of the input audio window described in step (1)).
  • Loss function The inventor uses the supervised learning method to train the neural network parameters involved in steps (2) to (4).
  • the voice and animation data are organized into several data pairs (x t , y t ), where x t represents the voice signal window corresponding to the t-th frame data, and y t represents the corresponding deformation gradient parameter.
  • y t can be further divided into zoom/cut parts With rotating part
  • the output of step (4) is marked as versus
  • the present invention uses similar energy terms to constrain. Taking the scaling/cutting part as an example, the energy term includes consideration of absolute values. And considering the numerical time derivative
  • the final loss function is the weighted sum of the four energy terms, and the weights used by Karras et al. (Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics(TOG),36(4):94,2017.) for automatic dynamic balancing.
  • Training example The inventor implemented an example of the present invention on a computer equipped with an Intel Core i7-8700K central processing unit (3.70GHz) and NVIDIA GTX1080Ti graphics processing unit (11GB).
  • the database VOCASET (Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael Black. Capture, learning, and synthesis of 3D speaking styles.
  • Example time consumption The VOCASET face model is used as a template face model (consisting of 9976 triangles), and the model is trained on the VOCASET data for 50 iterations, which takes about 5 hours.
  • For the input voice signal it takes about 10 milliseconds to generate a frame of animation for each window (from step (1) to (5), and directly drive the template face model in step (5)), reaching a real-time rate.
  • Animation excerpt The inventor implements an example of the present invention and uses voice signals to drive facial animation.
  • Use VOCASET's face model to generate speech animation and the sequence of selected frames is shown in Figure 4 (the characters in the figure are speaking the English word "smash”); use a cartoon animal face model whose topology is different from the template face model to generate speech animation, The sequence of selected frames is shown in Figure 5 (the cartoon animal in the figure is speaking the English word "smash").

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Processing Or Creating Images (AREA)

Abstract

一种语音信号驱动的脸部动画生成方法,分为六个步骤:提取语音特征、收集频率信息、汇总时间信息、解码动作特征、驱动脸部模型以及滑动信号窗口。可以根据输入的语音音频信号,在一定延时下实时地驱动任意脸部模型,生成动画;达到先进的语音动画技术水平,并且有轻量化、鲁棒性好的特点。可以用于生成不同场景下的语音动画,如VR虚拟社交、虚拟语音助手以及游戏。

Description

一种语音信号驱动的脸部动画生成方法 技术领域
本发明涉及脸部动画领域,尤其涉及一种语音信号驱动的脸部动画(简称为语音动画)生成方法。
背景技术
程序式的语音动画技术(Yuyu Xu,Andrew W Feng,Stacy Marsella,and Ari Shapiro.A practical and configurable lip sync method for games.In Proceedings of Motion on Games,pages 131–140.ACM,2013.)(Pif Edwards,Chris Landreth,Eugene Fiume,and Karan Singh.Jali:an animator-centric viseme model for expressive lip synchronization.ACM Transactions on Graphics(TOG),35(4):127,2016.),从语音中自动识别反映发音的音素序列(例如英语中的音节、中文中的拼音),并根据人类在发音时嘴唇的形状将音素分组为视素,且为每个视素制作动画关键帧;通过一定的协同发音规则连接整个序列,得到脸部动画。这些技术通常局限于人为设定的关键帧和协同发音规则而无法生成出真实的语音动画;并且受限于音素识别结果的准确度。
基于样本的语音动画技术(Tony Ezzat,Gadi Geiger,and Tomaso Poggio.Trainable video-realistic speech animation,volume 21.ACM,2002.)(Sarah L Taylor,Moshe Mahler,Barry-John Theobald,and Iain Matthews.Dynamic units of visual speech.In Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation,pages275–284.Eurographics Association,2012.),也进行音素序列到动画的映射,但是为了提高动画的真实性,不再使用人为设定的规则,而直接从数据样本中提取多个动画片段进行拼接。这些技术的效果通常局限于样本的数量,并且在片段拼接处往往会产生瑕疵;同样也受限于音素识别结果的准确度。
Wang等人提出一种基于隐马尔可夫链模型的技术(Lijuan Wang,Wei Han,Frank Soong,and Qiang Huo.Text-driven 3d photo-realistic talking head.In INTERSPEECH 2011.International Speech Communication Association,September 2011.),从语音信号中提取梅尔倒谱系数(Mel-Frequency Cepstral Coefficients,MFCC)作为语音特征,使用二维图像中人脸标定点(Landmarks)的主成分分析(Principal Component Analysis,PCA)系数作为动画特征。该技术借助隐马尔可夫链对语音特征与动画特征之间的映射关系进行建模;挖掘两种特征之间的自然规则,相比基于样本的技术提高了数据的利用率。
近年来,深度神经网络促进了语音动画领域技术进一步的提升。Fan等人(Bo Fan,Lei Xie,Shan Yang,Lijuan Wang,and Frank K Soong.A deep bidirectional lstm approach for video-realistic talking head.Multimedia Tools and Applications,75(9):5287–5309,2016.)使用双向的长短时记忆模块(Bidirectional Long Short-Term Memory,BiLSTM)从数据中学习语音到动画的映射,尤其是学习自然的协同发音模式;但是BiLSTM需要整段语音输入,无法做到实时生成。Suwajanakorn等人在此基础上提出一种延时的单向长短时记忆模块(Supasorn Suwajanakorn,Steven M Seitz,and Ira Kemelmacher-Shlizerman.Synthesizing obama:learning lip sync from audio.ACM Transactions on Graphics(TOG),36(4):95,2017.),通过短暂的延时获取下文信息来帮助处理协同发音;做到在一定延时下,实时地生成高质量语音动画。此技术的局限性在于需要大量的数据,并且只能生成特定人物的脸部视频。
Talyor等人(Sarah Taylor,Taehwan Kim,Yisong Yue,Moshe Mahler,James Krahe,Anastasio Garcia Rodriguez,Jessica Hodgins,and Iain Matthews.A deep learning approach for generalized speech animation.ACM Transactions on Graphics(TOG),36(4):93,2017.)提出一种滑动窗口的技术,用深度神经网络(Deep Neural Network,DNN)将一个窗口长度内的音素映射到脸部的主动外观模型(Active Appearance Model,AAM)系数;输入的音素窗口内包含短暂的上下文信息,能够很好地被DNN用于学习自然发音模式。Karras等人(Tero Karras,Timo Aila,Samuli Laine,Antti Herva,and Jaakko Lehtinen.Audio-driven facial animation by joint end-to-end learning of pose and emotion.ACM Transactions on Graphics(TOG),36(4):94,2017.)进一步提升滑动窗口技术,输入为一个窗口的线性预测编码(Linear Predictive Coding,LPC)语音特征,经过两阶段卷积神经网络(对应特征维度的共振峰分析阶段、对应时间维度的发音阶段),以及两层全连接网络,输出一帧三维脸部模型的顶点位置。这两种技术的泛用性较差,尤其是当输入的语音与模型的训练语音差别很大时。Cudeiro等人(Daniel Cudeiro,Timo Bolkart,Cassidy Laidlaw,Anurag Ranjan,and Michael Black.Capture,learning,and synthesis of 3D speaking styles.Computer Vision and Pattern Recognition(CVPR),pages 10101–10111,2019.)进一步改进,利用现有的语音识别模块来提取语音特征,提升了泛用性;但与此同时,引入的语音识别模块体积过于庞大,也使得该技术生成动画的速度缓慢。
发明内容
本发明的目的在于针对现有技术的不足,提供了一种语音信号驱动的脸部动画生成方法。本发明使用梅尔频谱来提取语音信号的频率特征;参照无表情、闭嘴的脸部模型,计算得到 的形变梯度,被用于表示动画中的脸部运动。本发明通过三阶段(对应步骤(2)至(4))的深度神经网络将一个窗口的梅尔频谱特征映射为一帧形变梯度;形变梯度可以用于驱动任意脸部模型,输出的风格可以在深度神经网络中由独热向量进行显示控制。
本发明的目的是通过以下技术方案来实现的,一种语音信号驱动的脸部动画生成方法,包括以下步骤:
(1)提取语音特征:对一个窗口内的语音提取梅尔频谱(Mel Spectrogram)特征;所述特征是由特征图维度、频率维度、时间维度组成的三维张量(Tensor)。
(2)收集频率信息:对步骤(1)得到的梅尔频谱,沿着频率维度方向,使用神经网络(Neural Network),抽象、收集所有频率信息,得到频率抽象信息。
(3)汇总时间信息:对步骤(2)所得的频率抽象信息,沿着时间维度方向,使用神经网络确定时间上下文中每帧信息的重要程度,并依据重要程度进行汇总,得到时间汇总信息。
(4)解码动作特征:对步骤(3)所得的时间汇总信息,与用户输入的控制风格的独热(One-Hot)向量连接;经过两个相似的神经网络分支,分别输出缩放/剪切(Scaling/Shearing)系数与旋转(Rotation)系数,两个分支的输出系数组合起来得到表示脸部动作的形变梯度(Deformation Gradients)。
(5)驱动脸部模型:对于任意给定的脸部模型(无表情、闭嘴状态),使用步骤(4)得到的形变梯度驱动脸部模型,作出相应的脸部动作。
(6)滑动信号窗口:重复步骤(1)至步骤(5),对所有语音信号窗口进行处理,生成完整的脸部动画。
本发明的有益效果是:步骤(2)收集频率信息的方法,是依据梅尔频谱特性进行设计的,可以鲁棒地抽象语音特征;步骤(3)汇总时间信息的方法,是依据辅音、元音发音原理设计的,可以高效地学***,且轻量化、鲁棒、可实时(在一定延时下)。本发明可以在VR虚拟社交、虚拟语音助手与游戏等应用中使用语音信号驱动脸部动画的生成。
附图说明
图1是本发明的方法流程示意图;
图2是本发明的方法中步骤(2)中子步骤(2.3)所述记忆单元的工作流程示意图;
图3是本发明的方法中步骤(3)流程示意图;
图4是本发明用语音信号驱动脸部模型动画的实施实例,驱动人类脸部模型说出英文单词“smash”的动画帧顺序节选;
图5是本发明用语音信号驱动脸部模型动画的实施实例,驱动卡通动物脸部模型说出英文单词“smash”的动画帧顺序节选。
具体实施方式
本发明的核心技术利用频率维度的卷积与双向长短时记忆模块抽象语音特征,用时间维度的双向长短时记忆与注意力模块汇总窗口内时间上下文信息,用形变梯度来表示脸部运动。如图1所示,该方法主要分为六个步骤:提取语音特征、收集频率信息、汇总时间信息、解码动作特征、驱动脸部模型以及最后通过在一段语音序列上滑动信号窗口不断重复前五个步骤,获取完整的动画序列。
1.提取语音特征:对一个窗口内的语音提取梅尔频谱(Mel Spectrogram)特征;所述特征是由特征图维度、频率维度、时间维度组成的三维张量(Tensor)。
1.1对输入音频窗口长度为L audio的语音信号进行短时傅里叶变换(帧长度为L fft、帧间隔为L hop);使用F mel个梅尔滤波器(Mel Filters),将傅立叶变换的结果转换到梅尔频率下,得到帧长度为L frame的梅尔频谱。
1.2将梅尔频谱关于时间的第一与第二阶导数作为辅助特征,与原始特征堆叠成3×F mel×L frame形状的张量(Tensor),其中第一维的3表示特征图数量,第二维的F mel表示频率维度的长度,第三维的L frame表示时间维度的长度。
2.收集频率信息:对步骤(1)得到的梅尔频谱,沿着频率维度方向,使用神经网络(Neural Network),抽象、收集所有频率信息,得到频率抽象信息。
2.1对步骤(1)所得的梅尔频谱,用二维卷积网络,提取梅尔频谱的局部频率特征;所述二维卷积网络依次包括:第一二维卷积层、第一二维最大池化层、第二二维卷积层、第二二维最大池化层;所述两个二维卷积层,分别通过C freq_conv0、C freq_conv1个沿着频率维度方向的卷积核(大小都为K freq×1,其中K freq表示频率维度方向大小,1表示时间维度方向大小)对输入进行卷积计算获取若干局部特征图(数量等同于卷积核数量),两个二维卷积层都使用负数倾斜率为0.2的带泄漏线性整流(Leaky ReLU,LReLU)作为激活函数;所述两个二维最大池化层,在沿着频率维度方向的一个区域内(大小都为S freq×1)选取局部特征最大值,完成下采样池化操作;得到的局部频率特征为一个
Figure PCTCN2019128739-appb-000001
形状的张量,其中第一维的C freq_conv1表示特征图数量,第二维的
Figure PCTCN2019128739-appb-000002
表示频率维度的长度,第三维的L frame表示时间维度的长度。
2.2对步骤(2.1)所得的局部频率特征,用C freq_conv2个大小为1×1(频率维度与时间维度方向大小都等于1)的卷积核投影局部频率特征;使用负数倾斜率为0.2的带泄漏线性整流 (Leaky ReLU,LReLU)作为激活函数;输出为一个
Figure PCTCN2019128739-appb-000003
形状的张量,其中第一维的C freq_conv2表示特征图数量,第二维的
Figure PCTCN2019128739-appb-000004
表示频率维度的长度,第三维的L frame表示时间维度的长度。
2.3对步骤(2.2)所得的投影之后的局部频率特征,沿着频率维度的正反两个方向,分别用一个长短时记忆单元循环地处理频率维度上的每个特征;如图2所示,所述长短时记忆单元具有一个状态器(用于存储记忆单元的历史信息)和三个门:输入门i t作用于每个频率特征x t(x表示输入,下标t表示第t个输入的时刻)与记忆单元上一步的输出h t-1(h表示输出,下标t-1表示第t-1个输入的时刻,即上一步),表示是否允许新的频率特征信息加入到记忆单元的状态器中,数值为0到1(包括两端),如果输入门数值为1(即开门)则加入新信息,如果为0(即关门)则加入零向量,如果为0到1中间数值则将新信息乘以门数值再加入;遗忘门f t作用于记忆单元的状态器,表示是否保留上一步状态器存储的历史频率信息S t-1(S表示状态器的状态,下标t-1表示第t-1个输入的时刻,即上一步),数值为0到1(包括两端),如果遗忘门数值为1(即开门)则保留存储的信息,如果为0(即关门)则重置存储信息为零向量,如果为0到1中间数值则将存储信息乘以门数值再保留;输出门o t作用于记忆单元的状态器,表示是否将当前记忆单元的状态S t(S表示状态器的状态,下标t表示第t个输入的时刻)作为输出,数值为0到1(包括两端),如果为1(即开门)则当前记忆单元的状态作为输出,如果为0(即关门)则输出零向量,如果为0到1中间数值则将当前记忆单元的状态乘以门数值再作为输出;三个门的具体数值由当前输入x t与该记忆单元上一步的输出h t-1连接、投影得到,其具体公式如下:
Figure PCTCN2019128739-appb-000005
其中,x t为当前输入,h t-1为记忆单元上一步的输出;i t为输入门数值,W i、b i分别为输入门的权重与偏置参数;f t为输入门数值,W f、b f分别为遗忘门的权重与偏置参数;o t为输入门数值,W o、b o分别为输出门的权重与偏置参数;
Figure PCTCN2019128739-appb-000006
为对当前输入、上一步输出的投影,W f、b f分别为投影的权重与偏置参数;S t-1、S t分别为上一步与当前的记忆单元状态器的状态;h t为当前记忆单元的输出。
每个方向的长短时记忆单元的特征图数量为
Figure PCTCN2019128739-appb-000007
两方向的特征图数量之和为C freq_LSTM,因此本步骤两方向长短时记忆单元的输出为
Figure PCTCN2019128739-appb-000008
形状的张量,其中第一维的C freq_LSTM表示特征图数量,第二维的
Figure PCTCN2019128739-appb-000009
表示频率维度的长度,第三维的L frame表示时间维度的长度。
长短时记忆单元的状态器以及围绕状态器运行的三个门,使其可以在分析某个频率特征时充分考虑其他频率的特征,顺应人类发音时会出现共振峰的自然现象。
2.4对步骤(2.3)中沿着频率维度正反两方向长短时记忆单元的输出,全部连接成为一个向量得到
Figure PCTCN2019128739-appb-000010
形状的张量,其中第一维的
Figure PCTCN2019128739-appb-000011
表示特征图数量,第二维的L frame表示时间维度的长度;并用一个特征图数量为C freq的全连接层进行投影,收集所有频率的信息,得到频率抽象信息z freq为C freq×L frame形状的张量,其中第一维的C freq表示特征图数量,第二维的L frame表示时间维度的长度。至此,频率维度被完全收集,抽象到特征图维度中。
3.汇总时间信息:对步骤(2)所得的频率抽象信息,沿着时间维度方向,使用神经网络确定时间上下文中每帧信息的重要程度,并依据重要程度进行汇总,得到时间汇总信息;具体流程如图3所示。
3.1对步骤(2)所得的频率抽象信息,用两个隐藏层,传递时间维度的时间上下文信息;所述每个隐藏层中,沿着时间维度的正反方向,分别用一个长短时记忆单元循环地处理时间维度上的每帧,传递时间信息;所述长短时记忆单元与步骤(2.3)中所述长短时记忆单元结构原理相同,但作用于时间维度方向,具有一个状态器(用于存储记忆单元的历史信息)和三个门:输入门作用于每帧时间特征与记忆单元上一步的输出,表示是否允许新的时间帧信息加入到记忆单元的状态器中,数值为0到1(包括两端),如果输入门数值为1(即开门)则加入新信息,如果为0(即关门)则加入零向量,如果为0到1中间数值则将新信息乘以门数值再加入;遗忘门作用于记忆单元的状态器,表示是否保留上一步状态器存储的历史时间信息,数值为0到1(包括两端),如果遗忘门数值为1(即开门)则保留存储的信息,如果为0(即关门)则重置存储信息为零向量,如果为0到1中间数值则将存储信息乘以门数值再保留;输出门作用于记忆单元的状态器,表示是否将当前记忆单元的状态作为输出,数值为0到1(包括两端),如果为1(即开门)则当前记忆单元的状态作为输出,如果为0(即关门)则输出零向量,如果为0到1中间数值则将当前记忆单元的状态乘以门数值再作为输出;三个门的具体数值由当前输入时间帧(或上一个隐藏层的输出)与该单元上一步的输出连接、投影得到。
每个方向的长短时记忆单元的特征图数量都为
Figure PCTCN2019128739-appb-000012
两方向的特征图数量之和为C time,因此本步骤所得的时间上下文信息m freq为C time×L frame形状的张量,其中第一维的C time表示特征图数量,第二维的L frame表示时间维度的长度。
3.2对步骤(3.1)所得的时间上下文信息,用隐藏层权衡上下文中各帧信息的重要性权重并进行加权和汇总;所述隐藏层中,选取时间上下文信息m freq中间的K qry帧用C att个一维卷积核(大小也为K qry)投影作为询问项q att(形状为C att×1,其中C att为特征图数量与卷积核数量相同,1为时间维度长度),对整个时间上下文信息m freq进行线性投影作为键值项k att(形状为C att×L frame,其中C att为特征图数量,L frame为时间维度长度),询问项q att与键值项k att之和通过tanh激活函数、线性投影(特征图数量从C att投影为1)与softmax归一化,获取每一帧的权重(形状为1×L frame),并使用该权重对时间上下文信息m freq进行加权和汇总,得到时间汇总信息z att(形状为C time,其中C time为特征图数量);所述隐藏层通过时间维度的权重模仿人类自然发音的模式,例如元音发音时间跨度长,辅音发音则表现为瞬时发音并且与该辅音前后的过渡元音有关。
4.解码动作特征:对步骤(3)所得的时间汇总信息,与用户输入的控制风格的独热(One-Hot)向量连接;经过两个相似的神经网络分支,分别输出缩放/剪切(Scaling/Shearing)系数与旋转(Rotation)系数,两个分支的输出系数组合起来得到表示脸部动作的形变梯度(Deformation Gradients)。
本发明首次在语音驱动的语音动画中,使用形变梯度来表示脸部动作,相比于以往的技术,能够更准确地描述脸部运动的局部变化。
4.1形变梯度
本发明采用了(Robert W Sumner and Jovan Popovic.Deformation transfer for triangle meshes.ACM Transactions on graphics(TOG),23(3):399–405,2004.)描述的方式计算脸部模型的形变梯度。脸部模型由多个三角形面片组成,用
Figure PCTCN2019128739-appb-000013
Figure PCTCN2019128739-appb-000014
分别表示脸部模型与形变之后的脸部模型中第i个三角形的三个顶点。为了处理垂直于三角形方向的形变,为三角形按照如下公式计算第四个顶点:
Figure PCTCN2019128739-appb-000015
第i个三角形的形变梯度是一个满足如下公式的变换矩阵T i
Figure PCTCN2019128739-appb-000016
其中V i
Figure PCTCN2019128739-appb-000017
分别由参照和形变三角形的三个向量堆叠而成:
Figure PCTCN2019128739-appb-000018
因此,
Figure PCTCN2019128739-appb-000019
本发明进一步采用(Qianyi Wu,Juyong Zhang,Yu-Kun Lai,Jianmin Zheng,and Jianfei Cai.Alive caricature from 2d to 3d.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 7336–7345,2018.)描述的方法对变换矩阵进行极分解,T i=R iS i。其中S i表示缩放/剪切变换,为对称矩阵,可以用6个参数表示;R i表示旋转变换,使用Rodrigues'formula可以用3个参数表示。因此每个三角形的形变梯度由9个参数表示。
本发明中的形变梯度是定义在一个模板脸部模型上的,该模板脸部模型无表情、闭嘴状态、由N个三角形组成,因此对应的形变梯度包含9N=6N+3N个参数。
4.2解码动作
将用于控制说话人风格的C speaker维度独热向量与步骤(3)所得的时间汇总信息z att连接,经过一个特征图数量为C dec0的全连接层与负数倾斜率为0.2的带泄漏线性整流(Leaky ReLU,LReLU)激活函数得到z dec(形状为C dec0)。之后z dec分别经过两个结构相似、并行的神经网络分支,解码缩放/剪切与旋转参数。
在每个分支中,z dec会再次与C speaker维度独热控制向量连接,并由三个全连接层进行投影;第一个全连接层特征图数量为C dec1,激活函数为负数斜率0.2的LReLU;第二个特征图数量为C dec2,激活函数为tanh;第三个没有激活函数(在缩放/剪切与旋转分支中,特征图数量分别为C pca_s与C pca_r)。分支的最后是一个固定的线性全连接层(在缩放/剪切与旋转分支中,特征图数量分别为6N与3N),其参数由该分支对应的训练数据的主成分分析基底与均值进行初始化;主成分分析过程中保留97%的能量,缩放/剪切与旋转分支中,保留的基底数量分别为C pca_s与C pca_r(与前述中分支第三个全连接层的特征图数量相同)。两个分支分别解码得到表示缩放/剪切的参数s(大小为6N)与表示旋转的参数r(大小为3N)。
5.驱动脸部模型:对于任意给定的脸部模型(无表情、闭嘴状态),使用步骤(4)得到的形变梯度驱动脸部模型,作出相应的脸部动作。
5.1获取给定脸部模型(由M个三角形组成)与模板脸部模型(由N个三角形组成)的三角形对应关系(本子步骤每个给定模型只需要执行一次):
如果给定脸部模型与模板脸部模型之间拓扑结构不同,则不能直接使用步骤(4)所得的形变梯度,需要首先获取两个模型之间的三角形对应关系;如果拓扑结构相同,则可以直接使用。
本发明采用了(Robert W Sumner and Jovan Popovic.Deformation transfer for triangle meshes.ACM Transactions on graphics(TOG),23(3):399–405,2004.)描述的方法,在用户给定若干顶点对应关系的情况下自动求解不同拓扑的两个脸部模型之间的三角形对应关系。
所述自动求解方法,首先需要寻找一系列变换矩阵(包含缩放/剪切与旋转变换,不包含平移变换)O i,i∈{1,…,M}将给定脸部模型形变到最接近模板脸部模型的状态。定义如下三个能量方程E S、E I、E C与限制条件下的能量方程之和E,最小化E可以将给定脸部模型形变到目标状态:
Figure PCTCN2019128739-appb-000020
Figure PCTCN2019128739-appb-000021
其中,E S表示约束形变平滑度的能量,M为给定脸部模型中三角形的数量,adj(i)表示第i个三角形周围邻接三角形的集合;E I表示约束形变程度的能量,I表示单位矩阵;E C表示形变之后两个模型顶点距离的能量,n为给定脸部模型中顶点的数量,
Figure PCTCN2019128739-appb-000022
为形变之后的给定脸部模型第i个顶点的位置,C i
Figure PCTCN2019128739-appb-000023
在模板脸部模型中最接近的顶点的位置;E为前三能量项之和,
Figure PCTCN2019128739-appb-000024
表示形变之后的给定脸部模型中的n个顶点位置,w S、w I、w C分别为E S、E I、E C对应的权重,该能量方程服从于用户给定的m个顶点对应关系,
Figure PCTCN2019128739-appb-000025
为第k个顶点对应关系中形变之后的给定脸部模型的顶点位置,m k为第k个顶点对应关系中顶点的目标位置。
由于最小化上述能量方程E的过程中,需要寻找
Figure PCTCN2019128739-appb-000026
也就是为形变之后的给定脸部模型的每个顶点,在模板脸部模型中寻找最近顶点;而随着优化过程,顶点位置发生变化,最近顶点的关系也在发生变化;因此需要迭代寻找最近顶点、最小化上述能量方程E的过程若干步。
将给定脸部模型形变到最接近模板脸部模型的状态之后,计算模板脸部模型与形变之后的给定脸部模型中所有三角形的质心。对形变之后的给定脸部模型中每个三角形,在模板脸部模型中寻找合理对应三角形,需要满足质心距离小于一定阈值(手动调整)并且两个法向量夹角小于90°。相同地也对模板脸部模型中每个三角形,在形变之后的给定脸部模型中寻找合理对应三角形。所有的合理对应关系,组成了两个模型间的三角形对应关系。
5.2将对应模板脸部模型的形变梯度迁移到给定脸部模型上:
将步骤(4)所得的缩放/剪切参数s与旋转参数r恢复成模板脸部模型所有三角形的变换矩阵集
Figure PCTCN2019128739-appb-000027
(其中N为模板脸部模型的变换矩阵数量,等于其三角形数量);根据步骤(5.1)所得的三角形对应关系构建出给定脸部模型的变换矩阵集
Figure PCTCN2019128739-appb-000028
(其中M′为给定脸部模型的变换矩阵数量;给定脸部模型中的一个三角形k,如果在模板脸部模型中不拥有对应三角形,则使用单位矩阵作为k的变换矩阵;如果拥有一个对应三角形,则直接使用对应三角形的变换矩阵作为k的变换矩阵;如果拥有多个对应三角形,则将k复制若干份,每份对应其中一个;由于存在拥有多个对应三角形的情况,因此最终得到的变换矩阵数量M′≥M)。
5.3依据迁移的形变梯度,求解给定脸部模型的顶点位置:
通过最小化如下能量方程,求得对应迁移的形变梯度下,给定脸部模型顶点位置
Figure PCTCN2019128739-appb-000029
Figure PCTCN2019128739-appb-000030
其中c是由
Figure PCTCN2019128739-appb-000031
堆叠而成,A是一个将c与
Figure PCTCN2019128739-appb-000032
关联起来的大型稀疏矩阵。通过设置能量方程梯度为0,
Figure PCTCN2019128739-appb-000033
可以由以下公式求解:
Figure PCTCN2019128739-appb-000034
由于A只与给定脸部模型有关,A和A TA可以预先计算,且每个模型只需要一次预计算。
6.滑动信号窗口:重复步骤(1)至步骤(5),对所有语音信号窗口进行处理,生成完整的脸部动画。
在整段输入语音信号上,以
Figure PCTCN2019128739-appb-000035
秒的间隔获取一系列音频窗口,对于每个窗口重复步骤(1)至步骤(5),生成完整的动画,动画的帧率为fps帧每秒。生成速度可达实时,延时为
Figure PCTCN2019128739-appb-000036
(其中L audio为步骤(1)中所述的输入音频窗口长度)。
实施实例
损失函数:发明人使用监督学习的方法训练步骤(2)至步骤(4)中涉及的神经网络参数。将语音与动画数据组织成若干数据对(x t,y t),其中x t表示第t帧数据对应的语音信号窗口,y t表示对应的形变梯度参数。按照步骤(4)中的描述,y t可以进一步分为缩放/剪切部分
Figure PCTCN2019128739-appb-000037
与旋转部分
Figure PCTCN2019128739-appb-000038
训练过程中,步骤(4)的输出标记为
Figure PCTCN2019128739-appb-000039
Figure PCTCN2019128739-appb-000040
对于两部分参数,本发明使用相似的能量项进行约束,以缩放/剪切部分为例,能量项包含考虑绝对数值的
Figure PCTCN2019128739-appb-000041
与考虑数值时间导数的
Figure PCTCN2019128739-appb-000042
Figure PCTCN2019128739-appb-000043
对于旋转部分,
Figure PCTCN2019128739-appb-000044
Figure PCTCN2019128739-appb-000045
的定义方式与上述公式相似。最终的损失函数是四个能量项的加权和, 权重使用Karras等人(Tero Karras,Timo Aila,Samuli Laine,Antti Herva,and Jaakko Lehtinen.Audio-driven facial animation by joint end-to-end learning of pose and emotion.ACM Transactions on Graphics(TOG),36(4):94,2017.)提出的技术进行自动动态平衡。
训练实例:发明人在一台配备Intel Core i7-8700K中央处理器(3.70GHz),NVIDIA GTX1080Ti图形处理器(11GB)的计算机上实施本发明的实例。实施过程中使用数据库VOCASET(Daniel Cudeiro,Timo Bolkart,Cassidy Laidlaw,Anurag Ranjan,and Michael Black.Capture,learning,and synthesis of 3D speaking styles.Computer Vision and Pattern Recognition(CVPR),pages 10101–10111,2019.)对模型进行训练。
模型参数:发明人在实施本发明的实例时,步骤(1)到(6)所涉及的参数如下:
(1)提取语音特征:音频窗口长度L audio=0.568秒;短时傅立叶变换帧长度L fft=0.064秒,帧间隔L hop=0.008秒;梅尔滤波器个数F mel=128;所得梅尔频谱帧数L frame=64。
(2)收集频率信息:卷积核的数量(也是卷积之后特征图的数量)分别为C freq_conv0=32,C freq_conv1=64,C freq_conv2=64;前两层卷积核大小K freq=3,池化区域大小S freq=2;频率维度的两方向长短时记忆单元的特征图数量之和为C freq_LSTM=64(即单一方向特征图数量为32);全连接投影特征图数量C freq=256。
(3)汇总时间信息:时间维度的两方向长短时记忆单元的特征图数量之和为C time=512(即单一方向特征图数量为256);注意力模块中K qry=3,C att=128。
(4)解码动作特征:模板脸部模型的三角形数量N=9976;说话人风格控制向量维度C speaker=8;第一个全连接层特征图数量C dec0=512;每个分支中前两个全连接层特征图数量C dec1=512,C dec2=256;缩放/剪切的参数s保留的主成分分析基底数量(也是缩放/剪切分支第三个全连接层特征图数量)C pca_s=85,旋转的参数r保留的主成分分析基底数量(也是旋转分支第三个全连接层特征图数量)C pca_r=180。
(5)驱动脸部模型:M由具体给定模型参数决定;在步骤(5.1)中迭代优化公式(5)的过程中,第一步w S=1.0、w I=0.001、w C=0,之后再迭代四步,w C由1变到5000。
(6)滑动信号窗口:重复步骤(1)至(5):动画帧率fps=60。
实例时耗:将VOCASET的脸部模型作为模板脸部模型(由9976个三角形组成),并在VOCASET的数据上训练模型50次迭代,耗时约5小时。对于输入的语音信号,每一个窗口生成一帧动画(从步骤(1)至(5),步骤(5)中直接驱动模板脸部模型)耗时约10毫秒,达到实时的速率。对于拓扑结构不同于模板脸部模型的其他给定脸部模型,需要事先按照步骤(5.1)进行模型三角形对应关系的设定,根据模型复杂度以及实施人员的熟练度,耗时大约 在15~40分钟;对于任意模型此工作只需进行一次。
动画节选:发明人实施本发明实例,用语音信号驱动脸部动画。使用VOCASET的脸部模型生成语音动画,其顺序节选帧如图4所示(图中人物正在说英文单词“smash”);使用拓扑不同于模板脸部模型的卡通动物脸部模型生成语音动画,其顺序节选帧如图5所示(图中卡通动物正在说英文单词“smash”)。

Claims (6)

  1. 一种语音信号驱动的脸部动画生成方法,其特征在于,包括以下步骤:
    (1)提取语音特征:对一个窗口内的语音提取梅尔频谱特征;所述特征是由特征图维度、频率维度、时间维度组成的三维张量。
    (2)收集频率信息:对步骤(1)得到的梅尔频谱,沿着频率维度方向,使用神经网络抽象、收集所有频率信息,得到频率抽象信息。
    (3)汇总时间信息:对步骤(2)所得的频率抽象信息,沿着时间维度方向,使用神经网络确定时间上下文中每帧信息的重要程度,并依据重要程度进行汇总,得到时间汇总信息。
    (4)解码动作特征:对步骤(3)所得的时间汇总信息,与用户输入的控制风格的独热向量连接;经过两个神经网络分支,分别输出缩放/剪切系数与旋转系数,两个分支的输出系数组合起来得到表示脸部动作的形变梯度。
    (5)驱动脸部模型:对于任意给定的无表情、闭嘴状态脸的部模型,使用步骤(4)得到的形变梯度驱动脸部模型,作出相应的脸部动作。
    (6)滑动信号窗口:重复步骤(1)至步骤(5),对所有语音信号窗口进行处理,生成完整的脸部动画。
  2. 根据权利要求1所述的语音信号驱动的脸部动画生成方法,其特征在于,所述步骤(1)包含如下子步骤:
    (1.1)对输入音频窗口长度为L audio的语音信号进行短时傅里叶变换,帧长度为L fft、帧间隔为L hop;使用F mel个梅尔滤波器,将傅立叶变换的结果转换到梅尔频率下,得到帧长度为L frame的梅尔频谱。
    (1.2)将梅尔频谱关于时间的第一与第二阶导数作为辅助特征,与原始特征堆叠成3×F mel×L frame形状的张量,其中第一维的3表示特征图数量,第二维的F mel表示频率维度的长度,第三维的L frame表示时间维度的长度。
  3. 根据权利要求1所述的语音信号驱动的脸部动画生成方法,其特征在于,所述步骤(2)包含如下子步骤:
    (2.1)对步骤(1)所得的梅尔频谱,用二维卷积网络,提取梅尔频谱的局部频率特征;所述二维卷积网络依次包括:第一二维卷积层、第一二维最大池化层、第二二维卷积层、第二二维最大池化层;所述两个二维卷积层,分别通过C freq_conv0、C freq_conv1个沿着频率维度方向的大小都为K freq×1的卷积核对输入进行卷积计算获取若干局部特征图,其中,所述局部特征图的数量等同于卷积核数量,K freq表示频率维度方向大小,1表示时间维度方向大小; 两个二维卷积层都使用负数倾斜率为0.2的带泄漏线性整流作为激活函数;所述两个二维最大池化层,在沿着频率维度方向的大小为S freq×1的区域内选取局部特征最大值,完成下采样池化操作;得到的局部频率特征为一个
    Figure PCTCN2019128739-appb-100001
    形状的张量,其中第一维的C freq_conv1表示特征图数量,第二维的
    Figure PCTCN2019128739-appb-100002
    表示频率维度的长度,第三维的L frame表示时间维度的长度;
    (2.2)对步骤(2.1)所得的局部频率特征,用C freq_conv2个大小为1×1的卷积核投影局部频率特征;使用负数倾斜率为0.2的带泄漏线性整流作为激活函数;输出为一个C freq_conv2×
    Figure PCTCN2019128739-appb-100003
    形状的张量,其中第一维的C freq_conv2表示特征图数量,第二维的
    Figure PCTCN2019128739-appb-100004
    表示频率维度的长度,第三维的L frame表示时间维度的长度,所述大小为1×1表示频率维度与时间维度方向大小都等于1;
    (2.3)对步骤(2.2)所得的投影之后的局部频率特征,沿着频率维度的正反两个方向,分别用一个长短时记忆单元循环地处理频率维度上的每个特征;
    (2.4)对步骤(2.3)中沿着频率维度正反两方向长短时记忆单元的输出,全部连接成为一个向量得到
    Figure PCTCN2019128739-appb-100005
    形状的张量,其中第一维的
    Figure PCTCN2019128739-appb-100006
    表示特征图数量,第二维的L frame表示时间维度的长度;并用一个特征图数量为C freq的全连接层进行投影,收集所有频率的信息,得到频率抽象信息z freq为C freq×L frame形状的张量,其中第一维的C freq表示特征图数量,第二维的L frame表示时间维度的长度。至此,频率维度被完全收集,抽象到特征图维度中。
  4. 根据权利要求1所述的语音信号驱动的脸部动画生成方法,其特征在于,所述步骤(3)包含如下子步骤:
    (3.1)对步骤(2)所得的频率抽象信息,用两个隐藏层传递时间维度的时间上下文信息m freq;所述每个隐藏层中,沿着时间维度的正反方向,分别用一个长短时记忆单元循环地处理时间维度上的每帧,传递时间信息;每个方向的长短时记忆单元的特征图数量都为
    Figure PCTCN2019128739-appb-100007
    两方向的特征图数量之和为C time,所述时间上下文信息m freq为C time×L frame形状的张量,其中第一维的C time表示特征图数量,第二维的L frame表示时间维度的长度;
    (3.2)对步骤(3.1)所得的时间上下文信息,用隐藏层权衡上下文中各帧信息的重要性权重并进行加权和汇总;所述隐藏层中,选取时间上下文信息m freq中间的K qry帧用C att个一维的大小为K qry的卷积核投影作为询问项q att,对整个时间上下文信息m freq进行线性投影作为键值项k att,询问项q att与键值项k att之和通过tanh激活函数、线性投影与softmax归一化, 获取每一帧的权重,并使用该权重对时间上下文信息m freq进行加权和汇总,得到时间汇总信息z att;所述询问项q att的形状为C att×1,其中,询问项C att为特征图数量与卷积核数量相同,1为时间维度长度;键值项k att的形状为C att×L frame,其中,C att为特征图数量,L frame为时间维度长度;所述线性投影的特征图数量从C att投影为1,所述权重的形状为1×L frame;时间汇总信息z att的形状为C time,其中C time为特征图数量;
  5. 根据权利要求1所述的语音信号驱动的脸部动画生成方法,其特征在于,所述步骤(4)中使用形变梯度来表示脸部动作,所述形变梯度定义在一个模板脸部模型上,该模板脸部模型为无表情、闭嘴状态,由N个三角形组成。
  6. 根据权利要求1所述的语音信号驱动的脸部动画生成方法,其特征在于,所述步骤(5)包含如下子步骤:
    (5.1)获取给定脸部模型(由M个三角形组成)与模板脸部模型(由N个三角形组成)的三角形对应关系,所述脸部模型由M个三角形组成,所述模板脸部模型由N个三角形组成;
    (5.2)将对应模板脸部模型的形变梯度迁移到给定脸部模型上;
    (5.3)依据迁移的形变梯度,求解给定脸部模型的顶点位置。
PCT/CN2019/128739 2019-12-26 2019-12-26 一种语音信号驱动的脸部动画生成方法 WO2021128173A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
PCT/CN2019/128739 WO2021128173A1 (zh) 2019-12-26 2019-12-26 一种语音信号驱动的脸部动画生成方法
JP2021504541A JP7299572B2 (ja) 2019-12-26 2019-12-26 音声信号により駆動される顔アニメーションの生成方法
EP19945413.3A EP3866117A4 (en) 2019-12-26 2019-12-26 VOICE CONTROLLED FACE ANIMATION GENERATION PROCESS
US17/214,936 US11354841B2 (en) 2019-12-26 2021-03-29 Speech-driven facial animation generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/128739 WO2021128173A1 (zh) 2019-12-26 2019-12-26 一种语音信号驱动的脸部动画生成方法

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/214,936 Continuation US11354841B2 (en) 2019-12-26 2021-03-29 Speech-driven facial animation generation method

Publications (1)

Publication Number Publication Date
WO2021128173A1 true WO2021128173A1 (zh) 2021-07-01

Family

ID=76573630

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/128739 WO2021128173A1 (zh) 2019-12-26 2019-12-26 一种语音信号驱动的脸部动画生成方法

Country Status (4)

Country Link
US (1) US11354841B2 (zh)
EP (1) EP3866117A4 (zh)
JP (1) JP7299572B2 (zh)
WO (1) WO2021128173A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115222856A (zh) * 2022-05-20 2022-10-21 一点灵犀信息技术(广州)有限公司 表情动画生成方法及电子设备
CN115883753A (zh) * 2022-11-04 2023-03-31 网易(杭州)网络有限公司 视频的生成方法、装置、计算设备及存储介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11244668B2 (en) * 2020-05-29 2022-02-08 TCL Research America Inc. Device and method for generating speech animation
CN113781616B (zh) * 2021-11-08 2022-02-08 江苏原力数字科技股份有限公司 一种基于神经网络的面部动画绑定加速方法
CN113822968B (zh) * 2021-11-24 2022-03-04 北京影创信息科技有限公司 语音实时驱动虚拟人的方法、***及存储介质
CN114155321B (zh) * 2021-11-26 2024-06-07 天津大学 一种基于自监督和混合密度网络的人脸动画生成方法
US20230394732A1 (en) * 2022-06-06 2023-12-07 Samsung Electronics Co., Ltd. Creating images, meshes, and talking animations from mouth shape data
US20230410396A1 (en) * 2022-06-17 2023-12-21 Lemon Inc. Audio or visual input interacting with video creation
KR20240096011A (ko) * 2022-12-19 2024-06-26 씨제이올리브네트웍스 주식회사 인공지능 기반 랜드마크를 이용한 이미지 발화 영상 생성 방법 및 장치

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279970A (zh) * 2013-05-10 2013-09-04 中国科学技术大学 一种实时的语音驱动人脸动画的方法
CN107004287A (zh) * 2014-11-05 2017-08-01 英特尔公司 化身视频装置和方法
US20180203946A1 (en) * 2013-08-16 2018-07-19 Kabushiki Kaisha Toshiba Computer generated emulation of a subject
CN109448083A (zh) * 2018-09-29 2019-03-08 浙江大学 一种从单幅图像生成人脸动画的方法
CN109599113A (zh) * 2019-01-22 2019-04-09 北京百度网讯科技有限公司 用于处理信息的方法和装置

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2162199A1 (en) * 1994-11-07 1996-05-08 Homer H. Chen Acoustic-assisted image processing
CN1860504A (zh) * 2003-09-30 2006-11-08 皇家飞利浦电子股份有限公司 用于视听内容合成的***和方法
US8797328B2 (en) * 2010-07-23 2014-08-05 Mixamo, Inc. Automatic generation of 3D character animation from 3D meshes
US20140278403A1 (en) * 2013-03-14 2014-09-18 Toytalk, Inc. Systems and methods for interactive synthetic character dialogue
EP3335195A2 (en) * 2015-08-14 2018-06-20 Metail Limited Methods of generating personalized 3d head models or 3d body models
US10559111B2 (en) * 2016-06-23 2020-02-11 LoomAi, Inc. Systems and methods for generating computer ready animation models of a human head from captured data images
US10453476B1 (en) * 2016-07-21 2019-10-22 Oben, Inc. Split-model architecture for DNN-based small corpus voice conversion
US10249314B1 (en) * 2016-07-21 2019-04-02 Oben, Inc. Voice conversion system and method with variance and spectrum compensation
US11004461B2 (en) * 2017-09-01 2021-05-11 Newton Howard Real-time vocal features extraction for automated emotional or mental state assessment
US11462209B2 (en) * 2018-05-18 2022-10-04 Baidu Usa Llc Spectrogram to waveform synthesis using convolutional networks
US10755463B1 (en) * 2018-07-20 2020-08-25 Facebook Technologies, Llc Audio-based face tracking and lip syncing for natural facial animation and lip movement
US10593336B2 (en) * 2018-07-26 2020-03-17 Accenture Global Solutions Limited Machine learning for authenticating voice
WO2020072759A1 (en) * 2018-10-03 2020-04-09 Visteon Global Technologies, Inc. A voice assistant system for a vehicle cockpit system
US10846522B2 (en) * 2018-10-16 2020-11-24 Google Llc Speaking classification using audio-visual data
US11238885B2 (en) * 2018-10-29 2022-02-01 Microsoft Technology Licensing, Llc Computing system for expressive three-dimensional facial animation
US11114086B2 (en) * 2019-01-18 2021-09-07 Snap Inc. Text and audio-based real-time face reenactment
US11049308B2 (en) * 2019-03-21 2021-06-29 Electronic Arts Inc. Generating facial position data based on audio data
US10885693B1 (en) * 2019-06-21 2021-01-05 Facebook Technologies, Llc Animating avatars from headset cameras
US10970907B1 (en) * 2019-07-02 2021-04-06 Facebook Technologies, Llc System and method for applying an expression to an avatar
KR102181901B1 (ko) * 2019-07-25 2020-11-23 넷마블 주식회사 애니메이션 생성 방법

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279970A (zh) * 2013-05-10 2013-09-04 中国科学技术大学 一种实时的语音驱动人脸动画的方法
US20180203946A1 (en) * 2013-08-16 2018-07-19 Kabushiki Kaisha Toshiba Computer generated emulation of a subject
CN107004287A (zh) * 2014-11-05 2017-08-01 英特尔公司 化身视频装置和方法
CN109448083A (zh) * 2018-09-29 2019-03-08 浙江大学 一种从单幅图像生成人脸动画的方法
CN109599113A (zh) * 2019-01-22 2019-04-09 北京百度网讯科技有限公司 用于处理信息的方法和装置

Non-Patent Citations (13)

* Cited by examiner, † Cited by third party
Title
BO FANLEI XIESHAN YANGLIJUAN WANGFRANK K SOONG: "A deep bidirectional lstm approach for video-realistic talking head", MULTIMEDIA TOOLS AND APPLICATIONS, vol. 75, no. 9, 2016, pages 5287 - 5309
DANIEL CUDEIROTIMO BOLKARTCASSIDY LAIDLAWANURAG RANJANMICHAEL BLACK: "Capture, learning, and synthesis of 3D speaking styles", COMPUTER VISION AND PATTERN RECOGNITION (CVPR, 2019, pages 10101 - 10111
LIJUAN WANGWEI HANFRANK SOONGQIANG HUO: "INTERSPEECH 2011", September 2011, INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, article "Text-driven 3d photo-realistic talking head"
PIF EDWARDSCHRIS LANDRETHEUGENE FIUMEKARAN SINGH: "Jali: an animator-centric viseme model for expressive lip synchronization", ACM TRANSACTIONS ON GRAPHICS (TOG, vol. 35, no. 4, 2016, pages 127
QIANYI WUJUYONG ZHANGYU-KUN LAIJIANMIN ZHENGJIANFEI CAI: "Alive caricature from 2d to 3d", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2018, pages 7336 - 7345, XP033473653, DOI: 10.1109/CVPR.2018.00766
ROBERT W SUMNERJOVAN POPOVIC: "Deformation transfer for triangle meshes", ACM TRANSACTIONS ON GRAPHICS (TOG, vol. 23, no. 3, 2004, pages 399 - 405
SARAH L TAYLORMOSHE MAHLERBARRY-JOHN THEOBALDLAIN MATTHEWS: "Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation", 2012, EUROGRAPHICS ASSOCIATION, article "Dynamic units of visual speech", pages: 275 - 284
SARAH TAYLORTAEHWAN KIMYISONG YUEMOSHE MAHLERJAMES KRAHEANASTASIO GARCIA RODRIGUEZJESSICA HODGINSLAIN MATTHEWS: "A deep learning approach for generalized speech animation", ACM TRANSACTIONS ON GRAPHICS (TOG, vol. 36, no. 4, 2017, pages 93, XP058372867, DOI: 10.1145/3072959.3073699
See also references of EP3866117A4
SUPASORN SUWAJANAKORNSTEVEN M SEITZIRA KEMELMACHER-SHLIZERMAN: "Synthesizing obama: learning lip sync from audio", ACM TRANSACTIONS ON GRAPHICS (TOG, vol. 36, no. 4, 2017, pages 95
TERO KARRASTIMO AILASAMULI LAINEANTTI HERVAJAAKKO LEHTINEN: "Audio-driven facial animation by joint end-to-end learning of pose and emotion", ACM TRANSACTIONS ON GRAPHICS (TOG, vol. 36, no. 4, 2017, pages 94, XP058372868, DOI: 10.1145/3072959.3073658
TONY EZZATGADI GEIGERTOMASO POGGIO: "Trainable video-realistic speech animation", vol. 21, 2002, ACM
YUYU XUANDREW W FENGSTACY MARSELLAARI SHAPIRO: "Proceedings of Motion on Games", 2013, ACM, article "A practical and configurable lip sync method for games", pages: 131 - 140

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115222856A (zh) * 2022-05-20 2022-10-21 一点灵犀信息技术(广州)有限公司 表情动画生成方法及电子设备
CN115222856B (zh) * 2022-05-20 2023-09-26 一点灵犀信息技术(广州)有限公司 表情动画生成方法及电子设备
CN115883753A (zh) * 2022-11-04 2023-03-31 网易(杭州)网络有限公司 视频的生成方法、装置、计算设备及存储介质

Also Published As

Publication number Publication date
JP7299572B2 (ja) 2023-06-28
EP3866117A1 (en) 2021-08-18
EP3866117A4 (en) 2022-05-04
JP2022518989A (ja) 2022-03-18
US11354841B2 (en) 2022-06-07
US20210233299A1 (en) 2021-07-29

Similar Documents

Publication Publication Date Title
WO2021128173A1 (zh) 一种语音信号驱动的脸部动画生成方法
CN111243065B (zh) 一种语音信号驱动的脸部动画生成方法
Datcu et al. Semantic audiovisual data fusion for automatic emotion recognition
Hong et al. Real-time speech-driven face animation with expressions using neural networks
Fan et al. Photo-real talking head with deep bidirectional LSTM
Fan et al. A deep bidirectional LSTM approach for video-realistic talking head
Chen Audiovisual speech processing
Pham et al. End-to-end learning for 3d facial animation from speech
CN103279970A (zh) 一种实时的语音驱动人脸动画的方法
Choi et al. Hidden Markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system
CN113838174B (zh) 一种音频驱动人脸动画生成方法、装置、设备与介质
Burton et al. The speaker-independent lipreading play-off; a survey of lipreading machines
Deena et al. Visual speech synthesis using a variable-order switching shared Gaussian process dynamical model
WO2024124680A1 (zh) 一种语音信号驱动的个性化三维人脸动画生成方法及其应用
Liu et al. Emotional facial expression transfer based on temporal restricted Boltzmann machines
Liu et al. Real-time speech-driven animation of expressive talking faces
Lan et al. Low level descriptors based DBLSTM bottleneck feature for speech driven talking avatar
Shih et al. Speech-driven talking face using embedded confusable system for real time mobile multimedia
Thangthai Computer lipreading via hybrid deep neural network hidden Markov models
Li et al. A novel speech-driven lip-sync model with CNN and LSTM
Deena et al. Speech-driven facial animation using a shared Gaussian process latent variable model
Wu et al. Emotional communication robot based on 3d face model and ASR technology
Ra et al. Visual-to-speech conversion based on maximum likelihood estimation
CN117037255B (zh) 基于有向图的3d表情合成方法
Edge et al. Model-based synthesis of visual speech movements from 3D video

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021504541

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2019945413

Country of ref document: EP

Effective date: 20210416

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19945413

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE