CN111724458B - Voice-driven three-dimensional face animation generation method and network structure - Google Patents

Voice-driven three-dimensional face animation generation method and network structure Download PDF

Info

Publication number
CN111724458B
CN111724458B CN202010387250.0A CN202010387250A CN111724458B CN 111724458 B CN111724458 B CN 111724458B CN 202010387250 A CN202010387250 A CN 202010387250A CN 111724458 B CN111724458 B CN 111724458B
Authority
CN
China
Prior art keywords
voice
constraint
driven
intermediate variable
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010387250.0A
Other languages
Chinese (zh)
Other versions
CN111724458A (en
Inventor
李坤
刘云珂
刘景瑛
惠彬原
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202010387250.0A priority Critical patent/CN111724458B/en
Publication of CN111724458A publication Critical patent/CN111724458A/en
Application granted granted Critical
Publication of CN111724458B publication Critical patent/CN111724458B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/20Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Architecture (AREA)
  • Computer Graphics (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention belongs to the technical field of computers, and particularly relates to a voice-driven three-dimensional face animation generation method, which comprises the following steps of: 1) Extracting voice characteristics, and embedding the identity information of the voice into a characteristic matrix; 2) Mapping the feature matrix to a low-dimensional space through an encoder to obtain an intermediate variable; 3) Mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and performing 3D graph geometry constraint on the intermediate variable to obtain the displacement of the 3D space; 4) And driving the template to simulate the facial animation according to the acquired displacement of the 3D space. Compared with the prior art, the invention creatively utilizes the characteristics of the 3D geometric figure to restrain the intermediate variable, and leads the generated 3D facial expression to be more vivid and visual by introducing the nonlinear geometric figure representation and two constraint conditions from different visual angles. In addition, the invention also provides a voice-driven three-dimensional face animation generation network structure.

Description

Voice-driven three-dimensional face animation generation method and network structure
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a voice-driven three-dimensional face animation generation method and a network structure.
Background
The voice contains rich information, and the expression and action of the face are simulated by the voice, so that the animation with special speaking style and conforming to the identity of the individual can be produced. Creating 3D facial animation that conforms to speech features has wide-ranging applications in movies, games, augmented reality and virtual reality. Therefore, it is very important to understand the correlation between speech and facial deformation.
The 3D facial animation using voice driving can be classified into two types of speaker-dependent and speaker-independent according to whether generalization across characters is supported. Where speaker-dependent animation refers to animation capabilities that use large amounts of data primarily to learn a particular situation to generate an animation of a fixed individual. Current methods for speaker-dependent animation generally require the generation of video by means of high quality motion capture data, the generation of video therefrom for the sound and material of a fixed speaker, or the real-time generation of facial animation using an end-to-end network, but these methods for specific situations cannot be applied due to their inconvenience. Currently, more research is mainly directed to speaker-independent animation, while for speaker-independent animation, in the prior art, effective feature learning is mainly performed by using a neural network. For example, from phoneme labels to mouth motion nonlinear mapping (Taylor et al: A deep learning approach for generalized speech animation. ACM Trans. Graph.36,93:1-93:11 (2017)); estimating rotation and activation parameters of3D blendershape using long and short term memory networks (pharm et al: spech-drive 3d facial animation with implicit emotional awareness:A deep learning approach.2017IEEE Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW) pp.2328-2336 (2017)); further utilizing a network learning acoustic feature representation (Pham et al: end-to-End learning for 3d facial animation from speech.In:ICMI'18 (2018)); the cartoon man (Zhou et al: visegment: audio-drive animal-center speed animation. ACM trans. Graph.37,161:1-161:10 (2018)) was simulated using a three-phase network; from the proposed multi-topic 4D face dataset, a generic voice driven 3D face framework (Cudeiro et al: capture, learning, and synchronization of3D scrolling styles.In: the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)) and so forth, which can work across a variety of identity ranges, is generated. However, none of these approaches take into account the impact of the geometric representation on the voice-driven 3D facial animation.
In view of this, it is necessary to propose a new voice-driven three-dimensional face animation generation method.
Disclosure of Invention
The invention aims at: aiming at the defects of the prior art, the voice-driven three-dimensional facial animation generation method realizes a speaker-independent voice-driven facial animation network guided by a 3D geometric figure, and the generated 3D facial expression is more vivid and visual by introducing a nonlinear geometric figure representation and two constraint conditions from different visual angles.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a voice-driven three-dimensional face animation generation method is characterized by comprising the following steps:
1) Extracting voice characteristics, and embedding the identity information of the voice into a characteristic matrix;
2) Mapping the feature matrix to a low-dimensional space through an encoder to obtain an intermediate variable;
3) Mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and performing 3D graph geometry constraint on the intermediate variable to obtain the displacement of the 3D space;
4) And driving the template to simulate the facial animation according to the acquired displacement of the 3D space.
As an improvement of the voice-driven three-dimensional face animation generation method, in the step 1), a deep speech engine is adopted to extract voice features.
As an improvement to the voice-driven three-dimensional face animation generation method, the encoder comprises four convolution layers, and the ith convolution needs to receive all the previous layers x 0 ......x i-1 As input:
x i =H i ([x 0 ,x 1 ,…,x i-1 ]);
wherein [ x ] 0 ,x 1 ,…,x i-1 ]Representing the concatenation of feature maps generated in layers 0 to i-1, H i Representing a complex function with a 3 x1 filter, a convolution of 2 x1 steps and a linear activation unit ReLU. The main purpose of the encoder in the present invention is to map speech features to potential representations, i.e. intermediate vectors. The encoder uses 4 convolution layers, unlike the general convolution process, where a more dense model is used, enabling deep features and shallow features to be effectively combined.
As an improvement on the voice-driven three-dimensional facial animation generation method, a pooling layer is added after each convolution layer, and the number of feature images is reduced through the pooling layer. In general, the number of feature patterns is doubled after each convolution layer, so that the serial connection process is smoothly performed, and a pooling layer is added after each convolution layer to reduce the number of feature patterns, so that each convolution layer can be effectively reused, and the learning of an encoder is richer.
As a pair of the inventionThe improvement of the three-dimensional face animation generation method driven by the voice in the invention, the decoder comprises two full-connection layers with tanh activation functions and a final output layer, and an attention mechanism is arranged between the two full-connection layers, namely, the assumption is that
Figure BDA0002484255930000041
Representing the input of an attention layer, where C is the number of feature graphs, the attention value a i Can be expressed as:
a i =σ(W 2 δ(W 1 x i ));
wherein sigma represents a ReLU function, delta represents a sigmoid function,
Figure BDA0002484255930000042
and->
Figure BDA0002484255930000043
Weights representing the attention block; the output of the attention layer is:
Figure BDA0002484255930000044
wherein,,
Figure BDA0002484255930000045
representing element-by-element multiplication; adaptively selecting important features of a current input sample through an attention module, and generating different attention responses for different inputs; the final output layer is a fully connected layer with linear activation that produces an N x 3 output corresponding to three-dimensional displacement vectors of N vertices. The invention adds a attention mechanism to make the network have important learning information with emphasis.
As an improvement to the speech driven three-dimensional face animation generation method described in the present invention, the weights of the output layer are initialized by 50 PCA components calculated from the vertex displacement of the training data, and the deviation is initialized by 0. The PCA is used as a principal component analysis, and the stability of the training of the network model can be improved through the arrangement.
As an improvement to the speech driven three-dimensional face animation generation method of the present invention, the constraint of3D graphics geometry on the intermediate variables is specifically to set a grid corresponding to the frames in each audio, and the encoder can be used to automatically obtain a corresponding geometric representation, and the geometric representation is used to constrain the intermediate variables, including Huber constraint and Hilbert-Schmidt independence criterion constraint, where Huber constraint is expressed as: assume that there are two vectors r and
Figure BDA0002484255930000051
then there is
Figure BDA0002484255930000052
Wherein,,
Figure BDA0002484255930000053
in the present invention, the encoder encodes the inputted face Mesh as an intermediate variable +.>
Figure BDA0002484255930000054
The decoder decodes it into a 3D mesh; the present invention uses a multi-column multi-scale network MGCN to extract the geometric representation of each training Mesh by setting a grid corresponding to the frames in each audio and can use an automatic encoder to obtain the corresponding geometric representation, using which the intermediate variables are effectively constrained so that the encoder output is closely related to the 3D geometric representation.
As an improvement to the voice-driven three-dimensional face animation generation method, the Hilbert-Schmidt independence constraint is used for measuring nonlinearity and higher-order correlation, and the dependence between representations can be estimated without explicitly estimating the joint distribution of random variables, assuming that there are two variables R= [ R ] 1 ,...,r i ,...,r M ]And
Figure BDA0002484255930000055
m is the batch size, defining a map phi (r) mapping the intermediate variable r to kernel space +.>
Figure BDA0002484255930000056
Also the inner product is denoted +.>
Figure BDA0002484255930000057
Hilbert-Schmidt independence criterion constraints are expressed as:
Figure BDA0002484255930000061
wherein k is R And
Figure BDA0002484255930000062
for kernel function +.>
Figure BDA0002484255930000063
And->
Figure BDA0002484255930000064
Is Hilbert space->
Figure BDA0002484255930000065
For R and->
Figure BDA0002484255930000066
Is to order
Figure BDA0002484255930000067
Is taken from->
Figure BDA0002484255930000068
The empirical derivation of HSIC is:
Figure BDA0002484255930000069
where tr denotes the trace of the square matrix, K 1 And K 2 Is k 1,ij =k 1 (r i ,r j ) And
Figure BDA00024842559300000610
is centered with a mean value of 0 in the feature space:
Figure BDA00024842559300000611
as an improvement of the voice-driven three-dimensional face animation generation method of the present invention, the loss functions of the steps 1) to 4) include reconstruction loss, constraint loss and speed loss, and the expressions are as follows:
L=L r1 L c2 L v
wherein lambda is 1 And lambda (lambda) 2 Is a positive number to balance loss terms, set lambda 1 Is 0.1 lambda 2 10.0, L r To reconstruct the loss, the distance between the true and predicted values is calculated:
Figure BDA00024842559300000612
constraint loss L c The 3D graph geometric intermediate variable is obtained through the grid, the Huber or Hilbert-Schmidt independence criterion is used for restraining the existing intermediate variable, and the speed loss is expressed as follows, so that the time stability is guaranteed:
Figure BDA00024842559300000613
the invention also provides a voice-driven three-dimensional facial animation generation network structure which comprises an encoder, a decoder and a constraint of3D graph geometry for intermediate variables.
The invention has the beneficial effects that: compared with the traditional reconstruction method, the method creatively utilizes the characteristic of the 3D geometric figure to restrain the intermediate variable. An encoder stage, in which closely connected convolutional layers are designed to enhance feature propagation and enhance reuse of audio features; a decoder stage, which enables a network to adaptively adjust key areas through an attention mechanism; a training strategy of geometric guidance is provided for the intermediate variables, and the training strategy has two constraints from different angles so as to realize a more powerful animation effect; in addition, the three-dimensional face animation generation network has high precision, the generated animation effect is more accurate and reasonable, and the generalization can be well carried out.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:
FIG. 1 is a workflow diagram of the present invention;
FIG. 2 is a diagram of a network architecture model according to the present invention;
fig. 3 is a schematic diagram of comparing a reconstruction result obtained on a VOCASET data set with other methods according to an embodiment of the present invention, from top to bottom, the result reconstructed by Cudeiro et al, the error visualization diagram of the method of Cudeiro et al, and the error visualization diagram of the present invention.
Detailed Description
Certain terms are used throughout the description and claims to refer to particular components. Those of skill in the art will appreciate that a hardware manufacturer may refer to the same component by different names. The description and claims do not take the form of an element differentiated by name, but rather by functionality. As used throughout the specification and claims, the word "comprise" is an open-ended term, and thus should be interpreted to mean "include, but not limited to. By "substantially" is meant that within an acceptable error range, a person skilled in the art is able to solve the technical problem within a certain error range, substantially achieving the technical effect.
In the description of the present invention, it should be understood that the directions or positional relationships indicated by the terms "upper", "lower", "front", "rear", "left", "right", "horizontal", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.
In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
The present invention will be described in further detail below with reference to the drawings, but is not limited thereto.
Example 1
As shown in fig. 1 to 3, a voice-driven three-dimensional face animation generation method includes the following steps:
1) Extracting voice characteristics by adopting a deep engine, converting the identity information of the voice into one-hot vectors, and embedding the one-hot vectors into a characteristic matrix;
2) Mapping the feature matrix to a low-dimensional space through an encoder to obtain an intermediate variable;
3) Mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and performing 3D graph geometry constraint on the intermediate variable to obtain the displacement of the 3D space;
4) And driving the template to simulate the facial animation according to the acquired displacement of the 3D space.
Preferably, the primary purpose of the encoder is to map speech features to potential representations, i.e. intermediate vectors. The encoder uses 4 convolutional layers, each of which is serially connected by downsamplingAnd the feature map is obtained through the ReLU activation function, unlike the general convolution process, a denser model is adopted here, so that deep features and shallow features can be effectively combined. Specifically, the ith layer convolution requires receiving all previous layers x 0 ......x i-1 As input:
x i =H i ([x 0 ,x 1 ,…,x i-1 ]);
wherein [ x ] 0 ,x 1 ,...,x i-1 ]Representing the concatenation of feature maps generated in layers 0 to i-1, H i Representing a complex function with a 3 x1 filter, a convolution of 2 x1 steps and a linear activation unit ReLU. In general, the number of feature patterns is doubled after each convolution layer, so that the serial connection process is smoothly performed, and a pooling layer is added after each convolution layer to reduce the number of feature patterns, so that each convolution layer can be effectively reused, and the learning of an encoder is richer.
Preferably, the decoder comprises two fully connected layers with tanh activation function and a final output layer, and an attention mechanism is arranged between the two fully connected layers, so that the network has important information for study. Assume that
Figure BDA0002484255930000091
Representing the input of an attention layer, where C is the number of feature graphs, the attention value a i Can be expressed as:
a i =σ(W 2 δ(W 1 x i ));
wherein sigma represents a ReLU function, delta represents a sigmoid function,
Figure BDA0002484255930000101
and->
Figure BDA0002484255930000102
Weights representing the attention block; the output of the attention layer is:
Figure BDA0002484255930000103
wherein,,
Figure BDA0002484255930000104
representing element-by-element multiplication; adaptively selecting important features of a current input sample through an attention module, and generating different attention responses for different inputs; the final output layer is a fully connected layer with linear activation that produces an N x 3 output corresponding to three-dimensional displacement vectors of N vertices. To make training more stable, in this embodiment, the weights of the output layer are initialized by 50 PCA components calculated from the vertex displacement of the training data, and the bias is initialized by 0.
Preferably, the above-described encoder-decoder structure can be regarded as a cross-mode process, in which the intermediate variable r is referred to as a cross-modal representation, which represents the representation of a particular identity and the deformed geometry. The encoder encodes the input face mesh into intermediate variables
Figure BDA0002484255930000105
The decoder decodes it into a 3D mesh. The present invention uses a multi-column multi-scale network MGCN to extract the geometric representation of each training grid. During training of the present invention, by providing a grid corresponding to frames in each audio, and using the encoder, the corresponding geometric representation can be automatically obtained, which is used to constrain the cross-modal representation such that the encoder output is closely related to the 3D geometric representation. Among these, huber constraints and Hilbert-Schmidt independence criteria constraints are employed in the present invention.
The Huber constraint is expressed as: assume that there are two vectors r and
Figure BDA0002484255930000106
then there is
Figure BDA0002484255930000111
Wherein,,
Figure BDA0002484255930000112
preferably, the Hilbert-Schmidt independence criterion constraint is used to measure non-linearities and higher-order correlations, enabling the estimation of dependencies between representations without explicitly estimating the joint distribution of random variables, assuming that there are two variables R= [ R ] 1 ,...,r i ,...,r M ]And
Figure BDA0002484255930000113
m is the batch size, defining a map phi (r) mapping the intermediate variable r to kernel space +.>
Figure BDA0002484255930000114
Also the inner product is denoted +.>
Figure BDA0002484255930000115
Hilbert-Schmidt independence criterion constraints are expressed as:
Figure BDA0002484255930000116
wherein k is R And
Figure BDA0002484255930000117
for kernel function +.>
Figure BDA0002484255930000118
And->
Figure BDA0002484255930000119
Is Hilbert space->
Figure BDA00024842559300001110
For R and->
Figure BDA00024842559300001111
Is to order
Figure BDA00024842559300001112
Is taken from->
Figure BDA00024842559300001113
The empirical derivation of HSIC is:
Figure BDA00024842559300001114
where tr denotes the trace of the square matrix, K 1 And K 2 Is k 1,ij =k 1 (r i ,r j ) And
Figure BDA00024842559300001115
is centered with a mean value of 0 in the feature space:
Figure BDA00024842559300001116
in deep, the present embodiment employs a window of w=16, a voice feature of d=29, and the size of the intermediate variable is set to 64. As previously described, the network is divided into encoder and decoder parts and the constraints on the intermediate variables are imposed on the 3D geometry. The encoder has a 4-layer convolution, a 3 x1 filter, a convolution of 2 x1 steps, and a linear activation unit ReLU. The number of feature maps is doubled after each convolution layer, and in order to make the concatenation process smooth, a 2×1 pooling layer is added after each convolution layer to reduce the number of feature maps. The first two fully connected layers of the decoder use the tanh activation function and the final output layer is the fully connected layer with linear activation function, which produces 5023 x 3 outputs corresponding to the three-dimensional displacement vectors of 5023 vertices. The weights of this layer are initialized by 50 PCA components calculated from the vertex displacements of the training data, and the deviations are initialized by 0. Wherein the loss function includes a reconstruction loss, a constraint loss, and a speed loss, and the expression is:
L=L r1 L c2 L v
wherein lambda is 1 And lambda (lambda) 2 Is a positive number to balance loss terms, set lambda 1 Is 0.1 lambda 2 10.0, L r To reconstruct the loss, the distance between the true and predicted values is calculated:
Figure BDA0002484255930000121
constraint loss L c The 3D graph geometric intermediate variable is obtained through the grid, the Huber or Hilbert-Schmidt independence criterion is used for restraining the existing intermediate variable, and the speed loss is expressed as follows, so that the time stability is guaranteed:
Figure BDA0002484255930000122
it should be noted that the invention is realized based on Tensorflow, and runs on an Indellovely GTX1080Ti video card to train with an Adam optimizer with momentum of 0.9, and trains 50 stages with a fixed learning rate of 1 e-4. For efficient training and testing, the present embodiment divides 12 subjects into a training set, a validation set and a test set. In addition, the remaining objects are also divided into 2 validation sets, 2 test sets. The training set includes all sentences of the eight objects. For the validation set and the test set, 20 unique sentences are selected so that they are not shared with other objects. There is no overlap between training, validation and test set for an object or sentence.
Example 2
A voice-driven three-dimensional face animation generation network structure comprises an encoder, a decoder and a constraint on 3D graphics geometry of intermediate variables, wherein the encoder and the decoder are the encoder and the decoder in the embodiment 1. The network takes the 3D geometric figure as a guide, and can obtain low-dimensional voice characteristics through the encoder as long as a section of voice is input, constraint is carried out through the 3D geometric figure, then face displacement in the 3D space can be obtained through the decoder, and animation can be generated through driving the template.
While the foregoing description illustrates and describes several preferred embodiments of the present invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as limited to other embodiments, and is capable of numerous other combinations, modifications and environments and is capable of changes or modifications within the scope of the inventive concept as described herein, either as a result of the foregoing teachings or as a result of the knowledge or technology in the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Claims (7)

1. A voice-driven three-dimensional face animation generation method is characterized by comprising the following steps:
1) Extracting voice characteristics, and embedding the identity information of the voice into a characteristic matrix;
2) Mapping the feature matrix to a low-dimensional space through an encoder to obtain an intermediate variable;
3) Mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and performing 3D graph geometry constraint on the intermediate variable to obtain the displacement of the 3D space; the decoder comprises two fully connected layers with tanh activation function and a final output layer, and attention mechanism is arranged between the two fully connected layers, namely
Figure FDA0004239778600000011
Representing the input of an attention layer, where C is the number of feature graphs, the attention value a i Can be expressed as:
a i =σ(W 2 δ(W 1 x i ));
wherein sigma represents a ReLU function, delta represents a sigmoid function,
Figure FDA0004239778600000012
and->
Figure FDA0004239778600000013
Attention blockWeight of->
Figure FDA0004239778600000014
Is a real number; the output of the attention layer is:
Figure FDA0004239778600000015
wherein,,
Figure FDA0004239778600000016
representing element-by-element multiplication; adaptively selecting important features of a current input sample through an attention module, and generating different attention responses for different inputs; the final output layer is a fully connected layer with a linear activation function, and the layer generates an N multiplied by 3 output and corresponds to three-dimensional displacement vectors of N vertexes;
the constraint of3D graphics geometry on the intermediate variables is specifically to set a grid corresponding to the frames in each audio, and the encoder can be used to automatically obtain a corresponding geometric representation, and the geometric representation is used to constrain the intermediate variables, including Huber constraint and Hilbert-Schmidt independence criterion constraint, where Huber constraint is expressed as: assume that there are two vectors r and
Figure FDA0004239778600000017
then there is
Figure FDA0004239778600000021
Wherein,,
Figure FDA0004239778600000022
the Hilbert-Schmidt independence constraint is used to measure non-linearities and higher-order correlations, enabling the estimation of dependencies between representations without explicitly estimating the joint distribution of random variables, assuming twoThe variable r= [ R ] 1 ,...,r i ,...,r M ]And
Figure FDA0004239778600000023
m is the batch size, defining a map phi (r) mapping the intermediate variable r to kernel space +.>
Figure FDA0004239778600000024
Also the inner product is denoted +.>
Figure FDA0004239778600000025
Hilbert-Schmidt independence criterion constraints are expressed as:
Figure FDA0004239778600000026
wherein k is R And
Figure FDA0004239778600000027
for kernel function +.>
Figure FDA0004239778600000028
And->
Figure FDA0004239778600000029
Is Hilbert space->
Figure FDA00042397786000000210
For R and->
Figure FDA00042397786000000211
Is (are) desirable to be (are)>
Figure FDA00042397786000000212
For two parameters R and->
Figure FDA00042397786000000213
Is a combination of (2)Distribution of the combination, let->
Figure FDA00042397786000000214
Is taken from->
Figure FDA00042397786000000215
The empirical derivation of HSIC is:
Figure FDA00042397786000000216
the two HSIC formulas are gradually progressive in front and back, and a final deduction formula is adopted;
where tr denotes the trace of the square matrix, K 1 And K 2 Is k 1,ij =k 1 (r i ,r j ) And
Figure FDA00042397786000000217
is centered with a mean value of 0 in the feature space:
Figure FDA00042397786000000218
4) And driving the template to simulate the facial animation according to the acquired displacement of the 3D space.
2. The voice-driven three-dimensional facial animation generating method of claim 1, wherein: the deep speech engine is adopted to extract the speech characteristics in the step 1).
3. The method of claim 1, wherein the encoder comprises four convolution layers, and the ith convolution requires reception of all previous layers x 0 ......x i-1 As input:
x i =H i ([x 0 ,x 1 ,...,x i-1 ]);
wherein the method comprises the steps of,[x 0 ,x 1 ,...,x i-1 ]Representing the concatenation of feature maps generated in layers 0 to i-1, H i Representing a complex function with a 3 x1 filter, a convolution of 2 x1 steps and a linear activation unit ReLU.
4. A method of generating a three-dimensional face animation driven by speech according to claim 3, wherein: the method further comprises adding a pooling layer after each convolution layer, and reducing the number of feature graphs through the pooling layer.
5. The voice-driven three-dimensional facial animation generating method of claim 1, wherein: the weights of the output layer are initialized by 50 PCA components calculated from the vertex displacements of the training data, and the deviations are initialized by 0.
6. The method for generating three-dimensional facial animation driven by voice according to claim 1, wherein the loss functions of the steps 1) to 4) comprise reconstruction loss, constraint loss and speed loss, and the expression is as follows:
L=L r1 L c2 L v
wherein lambda is 1 And lambda (lambda) 2 Is a positive number to balance loss terms, set lambda 1 Is 0.1 lambda 2 10.0, L r To reconstruct the loss, the distance between the true and predicted values is calculated:
Figure FDA0004239778600000041
constraint loss L c The 3D graph geometric intermediate variable is obtained through the grid, the Huber or Hilbert-Schmidt independence criterion is used for restraining the existing intermediate variable, and the speed loss is expressed as follows, so that the time stability is guaranteed:
Figure FDA0004239778600000042
7. a three-dimensional face animation generation network structure driven by voice is characterized in that: comprising an encoder, a decoder and a constraint on the 3D geometry of the intermediate variable, wherein the encoder is an encoder according to any of claims 1 to 6 and the decoder is a decoder according to any of claims 1 to 6.
CN202010387250.0A 2020-05-09 2020-05-09 Voice-driven three-dimensional face animation generation method and network structure Active CN111724458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010387250.0A CN111724458B (en) 2020-05-09 2020-05-09 Voice-driven three-dimensional face animation generation method and network structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010387250.0A CN111724458B (en) 2020-05-09 2020-05-09 Voice-driven three-dimensional face animation generation method and network structure

Publications (2)

Publication Number Publication Date
CN111724458A CN111724458A (en) 2020-09-29
CN111724458B true CN111724458B (en) 2023-07-04

Family

ID=72564794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010387250.0A Active CN111724458B (en) 2020-05-09 2020-05-09 Voice-driven three-dimensional face animation generation method and network structure

Country Status (1)

Country Link
CN (1) CN111724458B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113763532B (en) * 2021-04-19 2024-01-19 腾讯科技(深圳)有限公司 Man-machine interaction method, device, equipment and medium based on three-dimensional virtual object
CN113838174B (en) * 2021-11-25 2022-06-10 之江实验室 Audio-driven face animation generation method, device, equipment and medium
CN114332315B (en) * 2021-12-07 2022-11-08 北京百度网讯科技有限公司 3D video generation method, model training method and device
CN116385606A (en) * 2022-12-16 2023-07-04 浙江大学 Speech signal driven personalized three-dimensional face animation generation method and application thereof
CN116188649B (en) * 2023-04-27 2023-10-13 科大讯飞股份有限公司 Three-dimensional face model driving method based on voice and related device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308731A (en) * 2018-08-24 2019-02-05 浙江大学 The synchronous face video composition algorithm of the voice-driven lip of concatenated convolutional LSTM
CN110728219A (en) * 2019-09-29 2020-01-24 天津大学 3D face generation method based on multi-column multi-scale graph convolution neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308731A (en) * 2018-08-24 2019-02-05 浙江大学 The synchronous face video composition algorithm of the voice-driven lip of concatenated convolutional LSTM
CN110728219A (en) * 2019-09-29 2020-01-24 天津大学 3D face generation method based on multi-column multi-scale graph convolution neural network

Also Published As

Publication number Publication date
CN111724458A (en) 2020-09-29

Similar Documents

Publication Publication Date Title
CN111724458B (en) Voice-driven three-dimensional face animation generation method and network structure
Cudeiro et al. Capture, learning, and synthesis of 3D speaking styles
Busso et al. Rigid head motion in expressive speech animation: Analysis and synthesis
CN112581569B (en) Adaptive emotion expression speaker facial animation generation method and electronic device
CN113256821B (en) Three-dimensional virtual image lip shape generation method and device and electronic equipment
Granström et al. Audiovisual representation of prosody in expressive speech communication
JP2023545642A (en) Target object movement driving method, device, equipment and computer program
CN110728219A (en) 3D face generation method based on multi-column multi-scale graph convolution neural network
EP3866117A1 (en) Voice signal-driven facial animation generation method
Liu et al. Geometry-guided dense perspective network for speech-driven facial animation
CN103258340B (en) Is rich in the manner of articulation of the three-dimensional visualization Mandarin Chinese pronunciation dictionary of emotional expression ability
CN115964467A (en) Visual situation fused rich semantic dialogue generation method
CN104732590A (en) Sign language animation synthesis method
CN115330911A (en) Method and system for driving mimicry expression by using audio
Vasani et al. Generation of indian sign language by sentence processing and generative adversarial networks
CN113140023A (en) Text-to-image generation method and system based on space attention
Li et al. A survey of computer facial animation techniques
Ma et al. Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of diviseme motion capture data
CN117115316A (en) Voice-driven three-dimensional face animation method based on multi-level voice features
Liu et al. Emotional facial expression transfer based on temporal restricted Boltzmann machines
Balayn et al. Data-driven development of virtual sign language communication agents
CN116309984A (en) Mouth shape animation generation method and system based on text driving
Chu et al. CorrTalk: Correlation Between Hierarchical Speech and Facial Activity Variances for 3D Animation
Huang et al. Visual speech emotion conversion using deep learning for 3D talking head
Li et al. A novel speech-driven lip-sync model with CNN and LSTM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant