CN111724458B

CN111724458B - Voice-driven three-dimensional face animation generation method and network structure

Info

Publication number: CN111724458B
Application number: CN202010387250.0A
Authority: CN
Inventors: 李坤; 刘云珂; 刘景瑛; 惠彬原
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2023-07-04
Anticipated expiration: 2040-05-09
Also published as: CN111724458A

Abstract

The invention belongs to the technical field of computers, and particularly relates to a voice-driven three-dimensional face animation generation method, which comprises the following steps of: 1) Extracting voice characteristics, and embedding the identity information of the voice into a characteristic matrix; 2) Mapping the feature matrix to a low-dimensional space through an encoder to obtain an intermediate variable; 3) Mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and performing 3D graph geometry constraint on the intermediate variable to obtain the displacement of the 3D space; 4) And driving the template to simulate the facial animation according to the acquired displacement of the 3D space. Compared with the prior art, the invention creatively utilizes the characteristics of the 3D geometric figure to restrain the intermediate variable, and leads the generated 3D facial expression to be more vivid and visual by introducing the nonlinear geometric figure representation and two constraint conditions from different visual angles. In addition, the invention also provides a voice-driven three-dimensional face animation generation network structure.

Description

Voice-driven three-dimensional face animation generation method and network structure

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a voice-driven three-dimensional face animation generation method and a network structure.

Background

The voice contains rich information, and the expression and action of the face are simulated by the voice, so that the animation with special speaking style and conforming to the identity of the individual can be produced. Creating 3D facial animation that conforms to speech features has wide-ranging applications in movies, games, augmented reality and virtual reality. Therefore, it is very important to understand the correlation between speech and facial deformation.

The 3D facial animation using voice driving can be classified into two types of speaker-dependent and speaker-independent according to whether generalization across characters is supported. Where speaker-dependent animation refers to animation capabilities that use large amounts of data primarily to learn a particular situation to generate an animation of a fixed individual. Current methods for speaker-dependent animation generally require the generation of video by means of high quality motion capture data, the generation of video therefrom for the sound and material of a fixed speaker, or the real-time generation of facial animation using an end-to-end network, but these methods for specific situations cannot be applied due to their inconvenience. Currently, more research is mainly directed to speaker-independent animation, while for speaker-independent animation, in the prior art, effective feature learning is mainly performed by using a neural network. For example, from phoneme labels to mouth motion nonlinear mapping (Taylor et al: A deep learning approach for generalized speech animation. ACM Trans. Graph.36,93:1-93:11 (2017)); estimating rotation and activation parameters of3D blendershape using long and short term memory networks (pharm et al: spech-drive 3d facial animation with implicit emotional awareness:A deep learning approach.2017IEEE Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW) pp.2328-2336 (2017)); further utilizing a network learning acoustic feature representation (Pham et al: end-to-End learning for 3d facial animation from speech.In:ICMI'18 (2018)); the cartoon man (Zhou et al: visegment: audio-drive animal-center speed animation. ACM trans. Graph.37,161:1-161:10 (2018)) was simulated using a three-phase network; from the proposed multi-topic 4D face dataset, a generic voice driven 3D face framework (Cudeiro et al: capture, learning, and synchronization of3D scrolling styles.In: the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)) and so forth, which can work across a variety of identity ranges, is generated. However, none of these approaches take into account the impact of the geometric representation on the voice-driven 3D facial animation.

In view of this, it is necessary to propose a new voice-driven three-dimensional face animation generation method.

Disclosure of Invention

The invention aims at: aiming at the defects of the prior art, the voice-driven three-dimensional facial animation generation method realizes a speaker-independent voice-driven facial animation network guided by a 3D geometric figure, and the generated 3D facial expression is more vivid and visual by introducing a nonlinear geometric figure representation and two constraint conditions from different visual angles.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a voice-driven three-dimensional face animation generation method is characterized by comprising the following steps:

1) Extracting voice characteristics, and embedding the identity information of the voice into a characteristic matrix;

2) Mapping the feature matrix to a low-dimensional space through an encoder to obtain an intermediate variable;

3) Mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and performing 3D graph geometry constraint on the intermediate variable to obtain the displacement of the 3D space;

4) And driving the template to simulate the facial animation according to the acquired displacement of the 3D space.

As an improvement of the voice-driven three-dimensional face animation generation method, in the step 1), a deep speech engine is adopted to extract voice features.

As an improvement to the voice-driven three-dimensional face animation generation method, the encoder comprises four convolution layers, and the ith convolution needs to receive all the previous layers x ₀ ......x _i-1 As input:

x _i ＝H _i ([x ₀ ,x ₁ ,…,x _i-1 ])；

wherein [ x ] ₀ ,x ₁ ,…,x _i-1 ]Representing the concatenation of feature maps generated in layers 0 to i-1, H _i Representing a complex function with a 3 x1 filter, a convolution of 2 x1 steps and a linear activation unit ReLU. The main purpose of the encoder in the present invention is to map speech features to potential representations, i.e. intermediate vectors. The encoder uses 4 convolution layers, unlike the general convolution process, where a more dense model is used, enabling deep features and shallow features to be effectively combined.

As an improvement on the voice-driven three-dimensional facial animation generation method, a pooling layer is added after each convolution layer, and the number of feature images is reduced through the pooling layer. In general, the number of feature patterns is doubled after each convolution layer, so that the serial connection process is smoothly performed, and a pooling layer is added after each convolution layer to reduce the number of feature patterns, so that each convolution layer can be effectively reused, and the learning of an encoder is richer.

As a pair of the inventionThe improvement of the three-dimensional face animation generation method driven by the voice in the invention, the decoder comprises two full-connection layers with tanh activation functions and a final output layer, and an attention mechanism is arranged between the two full-connection layers, namely, the assumption is that

Representing the input of an attention layer, where C is the number of feature graphs, the attention value a _i Can be expressed as:

a _i ＝σ(W ₂ δ(W ₁ x _i ))；

wherein sigma represents a ReLU function, delta represents a sigmoid function,

and->

Weights representing the attention block; the output of the attention layer is:

wherein,,

representing element-by-element multiplication; adaptively selecting important features of a current input sample through an attention module, and generating different attention responses for different inputs; the final output layer is a fully connected layer with linear activation that produces an N x 3 output corresponding to three-dimensional displacement vectors of N vertices. The invention adds a attention mechanism to make the network have important learning information with emphasis.

As an improvement to the speech driven three-dimensional face animation generation method described in the present invention, the weights of the output layer are initialized by 50 PCA components calculated from the vertex displacement of the training data, and the deviation is initialized by 0. The PCA is used as a principal component analysis, and the stability of the training of the network model can be improved through the arrangement.

As an improvement to the speech driven three-dimensional face animation generation method of the present invention, the constraint of3D graphics geometry on the intermediate variables is specifically to set a grid corresponding to the frames in each audio, and the encoder can be used to automatically obtain a corresponding geometric representation, and the geometric representation is used to constrain the intermediate variables, including Huber constraint and Hilbert-Schmidt independence criterion constraint, where Huber constraint is expressed as: assume that there are two vectors r and

then there is

Wherein,,

in the present invention, the encoder encodes the inputted face Mesh as an intermediate variable +.>

The decoder decodes it into a 3D mesh; the present invention uses a multi-column multi-scale network MGCN to extract the geometric representation of each training Mesh by setting a grid corresponding to the frames in each audio and can use an automatic encoder to obtain the corresponding geometric representation, using which the intermediate variables are effectively constrained so that the encoder output is closely related to the 3D geometric representation.

As an improvement to the voice-driven three-dimensional face animation generation method, the Hilbert-Schmidt independence constraint is used for measuring nonlinearity and higher-order correlation, and the dependence between representations can be estimated without explicitly estimating the joint distribution of random variables, assuming that there are two variables R= [ R ] ₁ ,...,r _i ,...,r _M ]And

m is the batch size, defining a map phi (r) mapping the intermediate variable r to kernel space +.>

Also the inner product is denoted +.>

Hilbert-Schmidt independence criterion constraints are expressed as:

wherein k is _R And

for kernel function +.>

And->

Is Hilbert space->

For R and->

Is to order

Is taken from->

The empirical derivation of HSIC is:

where tr denotes the trace of the square matrix, K ₁ And K ₂ Is k _1,ij ＝k ₁ (r _i ,r _j ) And

is centered with a mean value of 0 in the feature space:

as an improvement of the voice-driven three-dimensional face animation generation method of the present invention, the loss functions of the steps 1) to 4) include reconstruction loss, constraint loss and speed loss, and the expressions are as follows:

L＝L _r +λ ₁ L _c +λ ₂ L _v ；

wherein lambda is ₁ And lambda (lambda) ₂ Is a positive number to balance loss terms, set lambda ₁ Is 0.1 lambda ₂ 10.0, L _r To reconstruct the loss, the distance between the true and predicted values is calculated:

constraint loss L _c The 3D graph geometric intermediate variable is obtained through the grid, the Huber or Hilbert-Schmidt independence criterion is used for restraining the existing intermediate variable, and the speed loss is expressed as follows, so that the time stability is guaranteed:

the invention also provides a voice-driven three-dimensional facial animation generation network structure which comprises an encoder, a decoder and a constraint of3D graph geometry for intermediate variables.

The invention has the beneficial effects that: compared with the traditional reconstruction method, the method creatively utilizes the characteristic of the 3D geometric figure to restrain the intermediate variable. An encoder stage, in which closely connected convolutional layers are designed to enhance feature propagation and enhance reuse of audio features; a decoder stage, which enables a network to adaptively adjust key areas through an attention mechanism; a training strategy of geometric guidance is provided for the intermediate variables, and the training strategy has two constraints from different angles so as to realize a more powerful animation effect; in addition, the three-dimensional face animation generation network has high precision, the generated animation effect is more accurate and reasonable, and the generalization can be well carried out.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

FIG. 1 is a workflow diagram of the present invention;

FIG. 2 is a diagram of a network architecture model according to the present invention;

fig. 3 is a schematic diagram of comparing a reconstruction result obtained on a VOCASET data set with other methods according to an embodiment of the present invention, from top to bottom, the result reconstructed by Cudeiro et al, the error visualization diagram of the method of Cudeiro et al, and the error visualization diagram of the present invention.

Detailed Description

Certain terms are used throughout the description and claims to refer to particular components. Those of skill in the art will appreciate that a hardware manufacturer may refer to the same component by different names. The description and claims do not take the form of an element differentiated by name, but rather by functionality. As used throughout the specification and claims, the word "comprise" is an open-ended term, and thus should be interpreted to mean "include, but not limited to. By "substantially" is meant that within an acceptable error range, a person skilled in the art is able to solve the technical problem within a certain error range, substantially achieving the technical effect.

In the description of the present invention, it should be understood that the directions or positional relationships indicated by the terms "upper", "lower", "front", "rear", "left", "right", "horizontal", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.

In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.

The present invention will be described in further detail below with reference to the drawings, but is not limited thereto.

Example 1

As shown in fig. 1 to 3, a voice-driven three-dimensional face animation generation method includes the following steps:

1) Extracting voice characteristics by adopting a deep engine, converting the identity information of the voice into one-hot vectors, and embedding the one-hot vectors into a characteristic matrix;

Preferably, the primary purpose of the encoder is to map speech features to potential representations, i.e. intermediate vectors. The encoder uses 4 convolutional layers, each of which is serially connected by downsamplingAnd the feature map is obtained through the ReLU activation function, unlike the general convolution process, a denser model is adopted here, so that deep features and shallow features can be effectively combined. Specifically, the ith layer convolution requires receiving all previous layers x ₀ ......x _i-1 As input:

x _i ＝H _i ([x ₀ ,x ₁ ,…,x _i-1 ])；

wherein [ x ] ₀ ,x ₁ ,...,x _i-1 ]Representing the concatenation of feature maps generated in layers 0 to i-1, H _i Representing a complex function with a 3 x1 filter, a convolution of 2 x1 steps and a linear activation unit ReLU. In general, the number of feature patterns is doubled after each convolution layer, so that the serial connection process is smoothly performed, and a pooling layer is added after each convolution layer to reduce the number of feature patterns, so that each convolution layer can be effectively reused, and the learning of an encoder is richer.

Preferably, the decoder comprises two fully connected layers with tanh activation function and a final output layer, and an attention mechanism is arranged between the two fully connected layers, so that the network has important information for study. Assume that

a _i ＝σ(W ₂ δ(W ₁ x _i ))；

wherein sigma represents a ReLU function, delta represents a sigmoid function,

and->

Weights representing the attention block; the output of the attention layer is:

wherein,,

representing element-by-element multiplication; adaptively selecting important features of a current input sample through an attention module, and generating different attention responses for different inputs; the final output layer is a fully connected layer with linear activation that produces an N x 3 output corresponding to three-dimensional displacement vectors of N vertices. To make training more stable, in this embodiment, the weights of the output layer are initialized by 50 PCA components calculated from the vertex displacement of the training data, and the bias is initialized by 0.

Preferably, the above-described encoder-decoder structure can be regarded as a cross-mode process, in which the intermediate variable r is referred to as a cross-modal representation, which represents the representation of a particular identity and the deformed geometry. The encoder encodes the input face mesh into intermediate variables

The decoder decodes it into a 3D mesh. The present invention uses a multi-column multi-scale network MGCN to extract the geometric representation of each training grid. During training of the present invention, by providing a grid corresponding to frames in each audio, and using the encoder, the corresponding geometric representation can be automatically obtained, which is used to constrain the cross-modal representation such that the encoder output is closely related to the 3D geometric representation. Among these, huber constraints and Hilbert-Schmidt independence criteria constraints are employed in the present invention.

The Huber constraint is expressed as: assume that there are two vectors r and

then there is

Wherein,,

preferably, the Hilbert-Schmidt independence criterion constraint is used to measure non-linearities and higher-order correlations, enabling the estimation of dependencies between representations without explicitly estimating the joint distribution of random variables, assuming that there are two variables R= [ R ] ₁ ,...,r _i ,...,r _M ]And

Also the inner product is denoted +.>

Hilbert-Schmidt independence criterion constraints are expressed as:

wherein k is _R And

for kernel function +.>

And->

Is Hilbert space->

For R and->

Is to order

Is taken from->

The empirical derivation of HSIC is:

is centered with a mean value of 0 in the feature space:

in deep, the present embodiment employs a window of w=16, a voice feature of d=29, and the size of the intermediate variable is set to 64. As previously described, the network is divided into encoder and decoder parts and the constraints on the intermediate variables are imposed on the 3D geometry. The encoder has a 4-layer convolution, a 3 x1 filter, a convolution of 2 x1 steps, and a linear activation unit ReLU. The number of feature maps is doubled after each convolution layer, and in order to make the concatenation process smooth, a 2×1 pooling layer is added after each convolution layer to reduce the number of feature maps. The first two fully connected layers of the decoder use the tanh activation function and the final output layer is the fully connected layer with linear activation function, which produces 5023 x 3 outputs corresponding to the three-dimensional displacement vectors of 5023 vertices. The weights of this layer are initialized by 50 PCA components calculated from the vertex displacements of the training data, and the deviations are initialized by 0. Wherein the loss function includes a reconstruction loss, a constraint loss, and a speed loss, and the expression is:

L＝L _r +λ ₁ L _c +λ ₂ L _v ；

it should be noted that the invention is realized based on Tensorflow, and runs on an Indellovely GTX1080Ti video card to train with an Adam optimizer with momentum of 0.9, and trains 50 stages with a fixed learning rate of 1 e-4. For efficient training and testing, the present embodiment divides 12 subjects into a training set, a validation set and a test set. In addition, the remaining objects are also divided into 2 validation sets, 2 test sets. The training set includes all sentences of the eight objects. For the validation set and the test set, 20 unique sentences are selected so that they are not shared with other objects. There is no overlap between training, validation and test set for an object or sentence.

Example 2

A voice-driven three-dimensional face animation generation network structure comprises an encoder, a decoder and a constraint on 3D graphics geometry of intermediate variables, wherein the encoder and the decoder are the encoder and the decoder in the embodiment 1. The network takes the 3D geometric figure as a guide, and can obtain low-dimensional voice characteristics through the encoder as long as a section of voice is input, constraint is carried out through the 3D geometric figure, then face displacement in the 3D space can be obtained through the decoder, and animation can be generated through driving the template.

While the foregoing description illustrates and describes several preferred embodiments of the present invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as limited to other embodiments, and is capable of numerous other combinations, modifications and environments and is capable of changes or modifications within the scope of the inventive concept as described herein, either as a result of the foregoing teachings or as a result of the knowledge or technology in the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Claims

1. A voice-driven three-dimensional face animation generation method is characterized by comprising the following steps:

3) Mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and performing 3D graph geometry constraint on the intermediate variable to obtain the displacement of the 3D space; the decoder comprises two fully connected layers with tanh activation function and a final output layer, and attention mechanism is arranged between the two fully connected layers, namely

a _i ＝σ(W ₂ δ(W ₁ x _i ))；

wherein sigma represents a ReLU function, delta represents a sigmoid function,

and->

Attention blockWeight of->

Is a real number; the output of the attention layer is:

wherein,,

representing element-by-element multiplication; adaptively selecting important features of a current input sample through an attention module, and generating different attention responses for different inputs; the final output layer is a fully connected layer with a linear activation function, and the layer generates an N multiplied by 3 output and corresponds to three-dimensional displacement vectors of N vertexes;

the constraint of3D graphics geometry on the intermediate variables is specifically to set a grid corresponding to the frames in each audio, and the encoder can be used to automatically obtain a corresponding geometric representation, and the geometric representation is used to constrain the intermediate variables, including Huber constraint and Hilbert-Schmidt independence criterion constraint, where Huber constraint is expressed as: assume that there are two vectors r and

then there is

Wherein,,

the Hilbert-Schmidt independence constraint is used to measure non-linearities and higher-order correlations, enabling the estimation of dependencies between representations without explicitly estimating the joint distribution of random variables, assuming twoThe variable r= [ R ] ₁ ,...,r _i ,...,r _M ]And

Also the inner product is denoted +.>

Hilbert-Schmidt independence criterion constraints are expressed as:

wherein k is _R And

for kernel function +.>

And->

Is Hilbert space->

For R and->

Is (are) desirable to be (are)>

For two parameters R and->

Is a combination of (2)Distribution of the combination, let->

Is taken from->

The empirical derivation of HSIC is:

the two HSIC formulas are gradually progressive in front and back, and a final deduction formula is adopted;

is centered with a mean value of 0 in the feature space:

2. The voice-driven three-dimensional facial animation generating method of claim 1, wherein: the deep speech engine is adopted to extract the speech characteristics in the step 1).

3. The method of claim 1, wherein the encoder comprises four convolution layers, and the ith convolution requires reception of all previous layers x ₀ ......x _i-1 As input:

x _i ＝H _i ([x ₀ ,x ₁ ,...,x _i-1 ])；

wherein the method comprises the steps of，[x ₀ ,x ₁ ,...,x _i-1 ]Representing the concatenation of feature maps generated in layers 0 to i-1, H _i Representing a complex function with a 3 x1 filter, a convolution of 2 x1 steps and a linear activation unit ReLU.

4. A method of generating a three-dimensional face animation driven by speech according to claim 3, wherein: the method further comprises adding a pooling layer after each convolution layer, and reducing the number of feature graphs through the pooling layer.

5. The voice-driven three-dimensional facial animation generating method of claim 1, wherein: the weights of the output layer are initialized by 50 PCA components calculated from the vertex displacements of the training data, and the deviations are initialized by 0.

6. The method for generating three-dimensional facial animation driven by voice according to claim 1, wherein the loss functions of the steps 1) to 4) comprise reconstruction loss, constraint loss and speed loss, and the expression is as follows:

L＝L _r +λ ₁ L _c +λ ₂ L _v ；

7. a three-dimensional face animation generation network structure driven by voice is characterized in that: comprising an encoder, a decoder and a constraint on the 3D geometry of the intermediate variable, wherein the encoder is an encoder according to any of claims 1 to 6 and the decoder is a decoder according to any of claims 1 to 6.