CN117237495A

CN117237495A - Three-dimensional face animation generation method and system

Info

Publication number: CN117237495A
Application number: CN202311463584.1A
Authority: CN
Inventors: 王新文; 陈珉; 林俊江; 朱禹
Original assignee: Zhejiang Tonghuashun Intelligent Technology Co Ltd
Current assignee: Zhejiang Tonghuashun Intelligent Technology Co Ltd
Priority date: 2023-11-06
Filing date: 2023-11-06
Publication date: 2023-12-15
Anticipated expiration: 2043-11-06
Also published as: CN117237495B

Abstract

The embodiment of the specification provides a three-dimensional face animation generation method and system, and relates to the technical field of face animation. The method is performed by a processor and includes: generating an animation base feature based on the character base element, wherein the animation base feature comprises at least one of a style feature and a voice feature; generating a fusion animation feature through a feature fusion model based on the animation basic feature, wherein the feature fusion model is a machine learning model; and generating a facial expression animation through an expression generating model based on the fusion animation characteristics, wherein the expression generating model is a machine learning model. The method and the system can make the facial expression of the facial animation more natural, improve the accuracy and emotion expression effect of the facial animation expression, and promote the use experience of the user.

Description

Three-dimensional face animation generation method and system

Technical Field

The present disclosure relates to the field of facial animation technologies, and in particular, to a method and a system for generating three-dimensional facial animation.

Background

With the rapid development of computer vision, graphics technology and metauniverse, related technology is gradually and deeply applied to various fields, including face animation generation.

At present, the generation of the facial animation is mainly carried out by two modes of real person dynamic capturing and model driving. The real person is required to record the real facial expression in a real person dynamic capturing mode, and a large amount of labor cost is required to be consumed. The model driving mode adopts a model to generate relevant facial animation data, so that the facial animation data can be automatically and quickly generated, the time and the labor cost are saved, but the formed facial expression is neutral and unnatural, and the emotion of the character can not be reflected.

Therefore, it is hoped to provide a three-dimensional face animation generation method and system, so that the facial expression of the face animation is more natural, the accuracy and emotion expression effect of the face animation expression are improved, and the use experience of a user is improved.

Disclosure of Invention

One or more embodiments of the present specification provide a three-dimensional face animation generation method. The method is performed by a processor and includes: generating an animation base feature based on the character base element, the animation base feature comprising at least one of a style feature and a speech feature; generating a fused animation feature through a feature fusion model based on the animation basic feature, wherein the feature fusion model is a machine learning model; and generating a facial expression animation through an expression generating model based on the fusion animation characteristics, wherein the expression generating model is a machine learning model.

One or more embodiments of the present specification provide a three-dimensional facial animation generation system. The system comprises: the system comprises a basic generation module, a fusion generation module and an expression generation module, wherein the basic generation module is configured to generate animation basic characteristics based on character basic elements, and the animation basic characteristics comprise at least one of style characteristics and voice characteristics; the fusion generation module is configured to generate fusion animation features through a feature fusion model based on the animation basic features, wherein the feature fusion model is a machine learning model; the expression generation module is configured to generate facial expression animation through an expression generation model based on the fused animation features, wherein the expression generation model is a machine learning model.

One or more embodiments of the present specification provide a three-dimensional facial animation generating apparatus, the apparatus including: the system comprises at least one processor and at least one memory, wherein the at least one memory is used for storing computer instructions, and the at least one processor is used for executing at least part of the computer instructions to realize the three-dimensional face animation generation method.

One of the embodiments of the present specification provides a computer-readable storage medium storing computer instructions that, when read by a computer in the storage medium, the computer performs the three-dimensional face animation generation method.

The beneficial effects are that: based on character basic elements, voice characteristics and style characteristics matched with the character basic elements can be determined, fusion animation characteristics can be quickly and accurately generated according to the voice characteristics and the style characteristics by using a characteristic fusion model, the fusion animation characteristics are processed based on an expression generation model, and the 3D facial expression animation with high mouth shape matching degree, strong voice synchronism, natural and vivid expression and controllable emotion is generated, so that the facial expression of the generated three-dimensional facial animation is more natural, the accuracy and emotion expression effect of the facial animation expression are improved, the use experience of a user is improved, and the user requirements are fully met.

Drawings

The present specification will be further elucidated by way of example embodiments, which will be described in detail by means of the accompanying drawings. The embodiments are not limiting, in which like numerals represent like structures, wherein:

FIG. 1 is a schematic illustration of an application scenario of a three-dimensional facial animation generating system, according to some embodiments of the present description;

FIG. 2 is an exemplary block diagram of a three-dimensional facial animation generation system, according to some embodiments of the present description;

FIG. 3 is an exemplary flow chart of a three-dimensional facial animation generation method, shown in accordance with some embodiments of the present description;

FIG. 4 is an exemplary schematic diagram of a feature fusion model and a weight determination model shown in accordance with some embodiments of the present description;

FIG. 5 is an exemplary diagram of a phoneme generation model shown in accordance with some embodiments of the present specification;

fig. 6 is an exemplary overall schematic of a three-dimensional facial animation generation method, according to some embodiments of the present description.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only some examples or embodiments of the present specification, and it is possible for those of ordinary skill in the art to apply the present specification to other similar situations according to the drawings without inventive effort. Unless otherwise apparent from the context of the language or otherwise specified, like reference numerals in the figures refer to like structures or operations.

It will be appreciated that "system," "apparatus," "unit" and/or "module" as used herein is one method for distinguishing between different components, elements, parts, portions or assemblies at different levels. However, if other words can achieve the same purpose, the words can be replaced by other expressions.

The terms "a," "an," "the," and/or "the" are not specific to the singular, but may include the plural, unless the context clearly indicates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

A flowchart is used in this specification to describe the operations performed by the system according to embodiments of the present specification. It should be appreciated that the preceding or following operations are not necessarily performed in order precisely. Rather, the steps may be processed in reverse order or simultaneously. Also, other operations may be added to or removed from these processes.

Fig. 1 is a schematic view of an application scenario of a three-dimensional facial animation generating system according to some embodiments of the present description.

In some embodiments, facial expression animation may be obtained by implementing the methods and/or processes disclosed in the present specification in the application scenario 100 of the three-dimensional facial animation generation system.

As shown in fig. 1, an application scenario 100 of a three-dimensional face animation generation system may include a processor 110, a network 120, a storage device 130, and a terminal 140.

In some embodiments, the processor 110 may be configured to process information and/or data related to the application scenario 100 of the three-dimensional facial animation generation system to perform one or more of the functions described herein. For example, the processor 110 may generate corresponding animation base features based on the character base elements; and generating a fusion animation feature through the feature fusion model based on the animation basic feature. For another example, the processor 110 may generate a facial expression animation from the expression generation model based on the fused animation features. In some embodiments, processor 110 may include one or more processing engines (e.g., a single chip processing engine or a multi-chip processing engine). By way of example only, the processor 110 may include a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), an Application Specific Instruction Processor (ASIP), a Graphics Processor (GPU), a Physical Processor (PPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), an editable logic circuit (PLD), a controller, a microcontroller unit, a Reduced Instruction Set Computer (RISC), a microprocessor, or the like, or any combination thereof.

In some embodiments, the processor 110 may be connected to other devices (not shown) via the network 120 (e.g., cameras, video cameras, audio recorders, etc.), acquire basic figures of a person in real time or intermittently, and generate facial expression animation through processing.

The network 120 may include any suitable network capable of facilitating the exchange of information and/or data of components in the application scenario 100 of the three-dimensional facial animation generating system. In some embodiments, one or more components (e.g., the processor 110, the storage device 130, and the terminal 140) in the application scenario 100 of the three-dimensional facial animation generating system may send/receive information and/or data to/from other components in the application scenario 100 of the facial animation generating system over the network 120. For example, processor 110 may retrieve the persona base from storage 130 via network 120. In some embodiments, network 120 may be any form of wired or wireless network or any combination thereof. In some embodiments, the application scenario 100 of the three-dimensional face animation generation system may include one or more network access points. For example, the application scenario 100 of the three-dimensional facial animation generation system may include wired or wireless network access points, such as base stations and/or wireless access points, through which one or more components of the application scenario 100 of the three-dimensional facial animation generation system may connect to the network 120 to exchange data and/or information.

The storage device 130 may be used to store data and/or instructions related to the application scenario 100 of the three-dimensional facial animation generation system. In some embodiments, the storage device 130 may store material (e.g., character primitives, etc.) retrieved from the terminal 140. In some embodiments, the storage device 130 may store information and/or instructions for execution or use by the processor 110 to perform the exemplary methods described herein. In some embodiments, the storage device 130 may store character primitives (e.g., expression style pictures, expression style videos, style identification numbers, style intensities, voice elements, etc.), facial expression animations, training data sets, and the like. In some embodiments, storage device 130 may be implemented on a cloud platform.

In some embodiments, the storage device 130 may be connected to the network 120 to communicate with one or more components (e.g., the processor 110, the terminal 140, etc.) of the application scenario 100 of the three-dimensional facial animation generating system. One or more components of the application scenario 100 of the three-dimensional facial animation generation system may access materials or instructions in the storage device 130 via the network 120. For example, the processor 110 may read voice element data or the like from the storage device 130 and perform corresponding processing. In some embodiments, the storage device 130 may be directly connected to or in communication with one or more components (e.g., the processor 110, the terminal 140) in the application scenario 100 of the three-dimensional facial animation generation system. In some embodiments, the storage device 130 may be part of the processor 110.

In some embodiments, the terminal 140 may enable user interaction with the application scene 100 of the three-dimensional facial animation generation system. In some embodiments, the terminal 140 may include a mobile device 140-1, a tablet 140-2, a notebook 140-3, or the like, or any combination thereof. In some embodiments, the mobile device 140-1 may include a smart home device, a smart mobile device, a virtual reality device, an augmented reality device, or the like, or any combination thereof. In some embodiments, the processor 110 may obtain the persona base elements in one or more components (e.g., video cameras, audio recorders, photo albums, etc.) in the terminal 140 over the network 120.

In some embodiments, the terminal 140 may present the finally generated three-dimensional face animation, etc., to the user.

It should be noted that the application scenario 100 of the three-dimensional facial animation generating system is provided for illustrative purposes only and is not intended to limit the scope of the present application. Many modifications and variations will be apparent to those of ordinary skill in the art in light of the present description. For example, the application scenario 100 of the three-dimensional face animation generation system may implement similar or different functions on other devices. However, such changes and modifications do not depart from the scope of the present application.

It should be noted that the above description of the application scenario 100 of the three-dimensional facial animation generating system is merely for convenience of description, and the present disclosure should not be limited to the scope of the illustrated embodiments. It will be appreciated by those skilled in the art that, given the principles of the system, various modules may be combined arbitrarily or a subsystem may be constructed in connection with other modules without departing from such principles.

FIG. 2 is an exemplary block diagram of a three-dimensional facial animation generation system, according to some embodiments of the present description.

As shown in fig. 2, the three-dimensional facial animation generation system 200 may include a base generation module 210, a fusion generation module 220, and an expression generation module 230.

In some embodiments, the base generation module 210 may be configured to generate the animation base feature based on the character base elements. Wherein the animation base feature comprises at least one of a style feature and a speech feature. For more on the basic features of the animation, see FIG. 3 and its associated description.

In some embodiments, the character base elements include expression style elements, wherein the expression style elements include at least one of expression style pictures, expression style videos, style identification numbers, and style intensities. In some embodiments, the base generation module 210 may be further configured to generate the style feature from a style coding model based on the expression style element, wherein the style coding model is a machine learning model. See fig. 3 and its associated description for more details of character primitives, animation basis features, expression style elements, and style coding models.

In some embodiments, the persona base may also include a voice element. In some embodiments, the speech features include at least one of speech energy features and phoneme features. In some embodiments, the base generation module 210 may be further configured to generate a speech energy feature based on the speech element; and generating phoneme features through a phoneme generation model based on the speech elements, wherein the phoneme generation model is a machine learning model. For more on the phonetic elements, phoneme generation models, phoneme features, see fig. 3 and its associated description.

In some embodiments, the phoneme generation model comprises a speech transcription phoneme layer and a phoneme encoding layer. In some embodiments, generating phoneme features by a phoneme generation model based on the phonetic elements comprises: generating a phoneme time sequence through a voice transcription phoneme layer based on the voice elements; based on the phoneme time sequence, phoneme features are generated by the phoneme encoding layer. For more on the speech transcription phoneme layer, phoneme coding layer, phoneme time sequence, see fig. 5 and its related description.

In some embodiments, the fusion generation module 220 may be configured to generate fused animation features through a feature fusion model based on the animation base features, wherein the feature fusion model is a machine learning model; for more on feature fusion generation models, fusion animation features, see FIG. 3 and its associated description.

In some embodiments, the input to the feature fusion model further includes weight coefficients for the temporal stitching. In some embodiments, the fusion generation module 220 may be further configured to process the animation base features through a weight determination model to determine weight coefficients for time-point stitching, as more fully described with respect to fig. 4.

In some embodiments, the expression generation module 230 may be configured to generate a facial expression animation by an expression generation model based on the fused animation features, wherein the expression generation model is a machine learning model.

In some embodiments, the expression generation model may include a feature encoder and an expression decoder. In some embodiments, expression generation module 230 may be further configured to: determining a coding result by a feature encoder based on the fused animation features; generating an expression coefficient sequence by an expression decoder based on the encoding result; and generating facial expression animation through animation synthesis processing based on the expression coefficient sequence. For more on the expression generation model, see fig. 3 and its related description.

In some embodiments, the three-dimensional facial animation generation system 200 may also include a training module (not shown). The training module may be configured to construct a training dataset and obtain the feature fusion model and the expression generation model by joint training based on the training dataset, for more details see fig. 3 and its associated description.

In some embodiments, the basic generation module, the fusion generation module and the expression generation module disclosed in fig. 2 may be different modules in one system, or may be one module to implement the functions of two or more modules. For example, each module may share one memory module, or each module may have a respective memory module. Such variations are within the scope of the present description.

Fig. 3 is an exemplary flow chart of a three-dimensional facial animation generation method, according to some embodiments of the present description. In some embodiments, the process 300 may be performed by the processor 110. The process 300 may include the steps of:

at step 310, an animation base feature is generated based on the character base elements. In some embodiments, step 310 may be performed by base generation module 210.

The character basic element refers to data related to a character. In some embodiments, the character base may be various types of data, such as pictures, videos, and/or audio files. In some embodiments, the basic elements of the character may be facial image data and/or voice data when the character utters voice, where the facial image data may be picture and/or video data reflecting facial expression changes when the character utters voice, and the voice data may be content of the character uttering voice.

In some embodiments, the character base elements include an expression style element and a speech element. For more explanation of the expression style elements and the speech elements, reference may be made to the related description below in fig. 3.

In some embodiments, the processor 110 may obtain the persona base in a variety of ways. For example, the processor 110 may obtain character primitives (e.g., facial image data, etc.) by taking a facial expression picture and/or video of a character based on a camera, video camera, and/or video camera, etc. For another example, the processor 110 may acquire the character basic elements (voice data and the like) by recording voice and the like uttered by the character based on a recorder and the like.

The animation base feature refers to a feature reflecting face information of a person for synthesizing an animation. In some embodiments, the animation base feature comprises at least one of a style feature and a speech feature. Style characteristics refer to characteristics reflecting the expression style of a person. Speech features refer to features that reflect the voice content of a person. In some embodiments, the animation base feature may comprise style features and/or speech features at a plurality of consecutive points in time. The character expression style may include, but is not limited to, happiness, gas, sadness, fear, and some other special expression style, and the character voice content may include, but is not limited to, duration of voice, sound intensity of voice, information contained in voice, and the like. In some embodiments, the format of the animation base feature may be a vector and/or matrix format or the like that facilitates recognition and reading by the processor 110.

For more explanation of style features and speech features, see the associated description below with respect to FIG. 3.

In some embodiments, the processor 110 may generate the animation base feature in a variety of ways based on the character base element. For example, the processor 110 may generate the animation base features based on the character base elements by way of artificial intelligence algorithms, neural networks, machine learning models, and the like.

In some embodiments, the processor 110 may analyze the character base elements in a variety of ways to generate style features in the animation base features. For example, the processor 110 may filter a large number of facial image data in the emotion style elements based on the image classifier to determine facial image data having prominent emotional features and treat at least one set of emotional features corresponding to at least one set of facial image data as style features in the animation base features. Wherein the image classifier is a machine learning algorithm for image classification, such as one or a combination of decision trees, U-net, naive bayes algorithm, etc. Emotional characteristics refer to characteristics that reflect the emotion of a person's face and changes in emotion, for example, facial expressions that reflect the emotion of a person crying, happiness, etc. A protruding emotion feature refers to a relevant feature that can significantly characterize the emotion of a person. For more explanation of the expression style elements, see the related description below in fig. 3.

In some embodiments, the processor 110 may generate style features from the style coding model based on the expression style elements.

The expression style element refers to data reflecting the expression style of a person. In some embodiments, the expression style element may include at least one of an expression style picture, an expression style video, a style identification number, and a style intensity.

The expression style picture and expression style video refer to related picture data and video data reflecting the expression style of a person, for example, related picture and video respectively reflecting facial expression when the person is happy and happy. For a description of how to acquire the expression style elements (e.g., expression style pictures and expression style videos, etc.), reference may be made to fig. 3 for the above description of how to acquire the basic elements of the character.

The style identification number refers to an identification number reflecting different style types of the character expression. The style identification number may be represented in a variety of forms based on numbers, letters, etc. For example, the processor 110 may respectively number facial expressions of the person's happiness, vitality, sadness, fear, etc. as 001, 002, 003, and 004, etc., and then the facial expression number 003 is a style identification number reflecting the person's sadness.

The style intensity refers to a numerical value reflecting the similarity degree of different expression styles and standard expression styles of a person. In some embodiments, the standard expression style refers to preset image data with some kind of outstanding emotion feature, such as a picture that obviously reflects the facial expression of a person when sad. In some embodiments, the standard expression style may have a plurality of types, and different standard expression styles may respectively correspond to different emotional characteristics. In some embodiments, there may be multiple corresponding style intensities for the expression style picture and the expression style video. For example, if a certain expression style picture 1 reflecting the facial expression of a person is sad, the similarity with the standard expression style 1 is 80%, and the similarity with the standard expression style 2 is 70%, the style strength of the expression style picture 1 is (0.8,0.7). Different standard expression styles have corresponding relations with different style identification numbers.

In some embodiments, the processor 110 may determine the expression style element (e.g., style identification number and/or style strength, etc.) in a variety of ways. For example, the processor 110 may preset the style identification numbers corresponding to the different expression style pictures and the expression style videos and create a data table, and then the processor 110 may determine the style identification numbers corresponding to the expression style pictures and the expression style videos by querying/retrieving the data table. For another example, the processor 110 may calculate the similarity magnitudes of the expression style pictures and/or expression style videos with a plurality of types of standard expression styles, respectively, wherein the manner of calculating the similarity may include, but is not limited to, a Hash similarity algorithm (Hash algorithm), a histogram calculation, a cosine similarity calculation, and the like. The processor 110 may determine a style identification number corresponding to a standard expression style having the greatest similarity to the expression style picture and/or the expression style video as the style identification number of the current expression style picture and/or the expression style video. The processor 110 may rank the similarities from large to small, determining at least one similarity that ranks first as a style strength.

A style coding model refers to a model used to determine style characteristics. In some embodiments, the style-coding model may be a machine learning model, for example, any one of a Neural Network (NN) model, a 2D or 3D convolutional Neural network (2D or 3D Convolutional Neural Network,CNN) model, a Long-short-term memory (Long-Short Term Memory, LSTM) model, a model of a combination of a 2D or 3D convolutional Neural network model and a Long-short-term memory model, and the like.

In some embodiments, the input of the style coding model may include an expression style element, with the output being a style feature.

In some embodiments, the style characteristics may include parameters of the basic emotion (e.g., anger, sadness, etc.) of the character to which the expression style element corresponds. For example, a person is extremely angry but the corresponding style feature of the picture data with smile may be anger smile. In some embodiments, the format of the style characteristics may be a format that can be read and identified by the processor 110, such as a vector and/or matrix, etc. For example, the processor 110 may represent a style feature corresponding to a very brimming smile with a vector (2, a, F, 0.8), where 2 represents a picture or video corresponding to the style feature, a and F represent facial expression features corresponding to anger and smile, respectively, and 0.8 represents the style strength of the style feature, the very brimming smile. For more explanation of style strengths, see the relevant description above for FIG. 3.

In some embodiments, the style-coding model may be trained from a plurality of tagged first training samples. The first training sample is a sample expression style element, for example, the first training sample may be a picture and/or a video with a style identification number and style intensity, and a label of the first training sample is a style feature, and the label may be determined by manual labeling. For example, the processor 110 may take the picture data of the person extremely angry but with the smile as a first training sample, and manually label the label corresponding to the first training sample as being very angry and smile. For another example, the processor 110 may use, as the first training sample, a plurality of pictures at different times reflecting the fading of the smile of the person, and manually annotate the label corresponding to the first training sample as the fading of the smile.

In some embodiments, the processor 110 may input a plurality of first training samples with labels into the initial style coding model, construct a loss function from the output of the labels and the initial style coding model, and iteratively update parameters of the initial style coding model based on the loss function by gradient descent or other methods. And when the preset conditions are met, model training is completed, and a trained style coding model is obtained. The preset condition may be that the loss function converges, the number of iterations reaches a threshold value, etc.

In some embodiments of the present disclosure, by using a trained style coding model, style features corresponding to different expression style elements may be obtained relatively quickly and accurately, and expression change features reflecting different time and expression features containing multiple emotions of a user may be obtained at the same time, so that when animation is generated subsequently, the generated animation is more vivid and flexible, and the true emotion of the user may be reflected more.

In some embodiments, the processor 110 may analyze the character base elements in a variety of ways to generate speech features in the animated base feature. For example, the processor 110 may acquire the sound intensity of the human voice at different time points through the sound intensity sensor according to the voice elements in the basic elements of the human, and use the sound intensity as the voice feature. The sound intensity sensor refers to a device that can monitor sound intensity and frequency, and the processor 110 can monitor sound intensity and frequency of a sound recording file and/or a video file at different points in time through the sound intensity sensor and determine a voice feature based on the sound intensity and frequency.

In some embodiments, the processor 110 may generate the speech energy features based on speech elements in the character base elements and generate the phoneme features based on the speech elements by a phoneme generation model.

In some embodiments, the persona base elements include voice elements. The voice element refers to related data reflecting when a person utters voice. For example, the voice elements include an audio file containing a person voice, a recording file, call record data, and the like. As to the explanation of how to acquire the voice element, reference may be made to fig. 3 above regarding how to acquire the basic element of the character.

In some embodiments, the speech features include at least one of speech energy features and phoneme features.

The speech energy features refer to features reflecting the intensity of speech, prosody, etc. of a person at different points in time. The phoneme characteristic refers to a characteristic related to the voice content of a person at different points of time. For example, a speech feature (30, 50, 35, "your good") may represent values of the speech energy feature at time point 1, time point 2, and time point 3 of 30 db, 50 db, and 35 db, respectively, with a phoneme feature of "your good".

In some embodiments, the processor 110 may obtain the speech energy characteristics in a variety of ways based on the speech elements. For example, the processor 110 may process the speech elements to obtain the speech energy feature using a preset algorithm, a calculation formula, a combination thereof, or the like.

In some embodiments, the processor 110 may process the speech elements based on a short-time fourier algorithm, generate an energy feature sequence, and determine the speech energy features based on the energy feature sequence.

The short-time fourier transform is a time-frequency analysis method that uses a signal of a certain period of time to represent the signal characteristics of a certain point of time. In the short-time fourier transform process, the length of the time period determines the magnitudes of the time resolution and the frequency resolution of the energy feature sequence, e.g. the longer the time period, the lower the time resolution and the higher the frequency resolution. Wherein time resolution and frequency resolution refer to the accuracy in time and the accuracy in frequency in the short-time fourier transform, respectively. The higher the accuracy, the more consistent the generated energy signature sequence is to its corresponding speech element.

The energy feature sequence refers to sequence data reflecting the intensity of voice of a person at different points in time.

In some embodiments, the processor 110 may utilize a sound intensity sensor to obtain the speech energy characteristics by monitoring the intensity of the speech sound, reading a recorded file and/or a video file, and so forth. For example, the processor 110 may process the voice elements based on a short-time fourier algorithm to generate an energy feature sequence, then screen the energy feature sequence of the person in a local time period corresponding to the person speaking based on the energy feature sequence, and splice the partial energy feature sequences to determine the voice energy feature.

In some embodiments, the processor 110 may process each segment of the speech element separately using a short-time fourier algorithm to obtain time-dependent data of the sound signal, accurately decouple the content of the utterance from the speech, and then stitch the time-dependent data together to obtain the energy signature sequence. The processor 110 extracts the voice energy characteristics corresponding to the voice elements from the energy characteristic sequence, so that the voice energy changes corresponding to different time points when the person utters voice can be rapidly and accurately determined, and the acquisition efficiency of the voice energy characteristics is improved.

The phoneme generation model is a model for acquiring phoneme features. In some embodiments, the phoneme generation model may be a machine learning model, such as a time convolutional neural network (Temporal Convolutional Network, TCN) or a 1D convolutional neural network (1D Convolutional Neural Networks,1D CNN), or the like.

In some embodiments, the input of the phoneme generation model may comprise speech elements and the output may be a phoneme feature. For more explanation of the phonetic elements and phonetic features, see the other description above with respect to fig. 3.

In some embodiments, the phoneme generation model may be trained from a plurality of tagged second training samples. The second training sample is a sample voice element, the label is a phoneme feature, and the label can be determined through manual labeling. The phoneme generation model training process is similar to the style coding model, and for more details on the training process of the phoneme generation model, reference can be made to the description of the style coding model training described above with respect to fig. 3.

In some embodiments, the phoneme generation model includes a speech transcription phoneme layer and a phoneme encoding layer, and more reference is made to fig. 5 of the present specification and its associated description.

In some embodiments of the present disclosure, the processor 110 may relatively accurately obtain the voice energy feature and the phoneme feature related to the voice of the person based on the voice element through the preset algorithm and the trained phoneme generation model, so as to provide data support for the subsequent generation of the accurate and vivid facial expression animation.

Step 320, generating a fused animation feature through the feature fusion model based on the animation base feature. In some embodiments, step 320 may be performed by fusion generation module 220.

For more explanation of the basic features of an animation, see the relevant description above with respect to FIG. 3.

The fused animation features refer to features of related parameters obtained by fusing and splicing animation basic features at different time points. The fused animation features may include relevant parameters that characterize speech content, speech energy size, and expression style, among others. For example, the fused animation features include parameters obtained by fusing and splicing style features and voice features in the animation base features.

In some embodiments, the processor 110 may generate the fused animation features from the feature fusion model based on the animation base features. For more explanation of the animation base features and the fused animation features, see the relevant description above with respect to FIG. 3.

The feature fusion model is a model for generating fused animation features. In some embodiments, the feature fusion model may be a machine learning model, for example, at least one of a 1D convolutional neural network, a long-short-term memory model (LSTM), or the like, or a combination thereof.

In some embodiments, the input of the feature fusion model includes the animation base features and the output is a fused animation feature.

In some embodiments, the feature fusion model may be trained from a plurality of tagged third training samples. The third training sample is a sample animation basic feature, the label of the third training sample is a fusion animation feature, and the label can be determined through manual labeling. The training process of the feature fusion model is similar to that of the style coding model, and for more details on the training of the feature fusion model, reference can be made to the description of the training of the style coding model described above with reference to fig. 3.

In some embodiments, the input of the feature fusion model further includes weight coefficients of the time-point stitching, which may be determined by processing the animation base features through the weight determination model, as described in more detail with reference to fig. 4.

And 330, generating the facial expression animation through the expression generating model based on the fused animation characteristics. In some embodiments, step 330 may be performed by expression generation module 230.

Facial expression animation refers to animation that reflects changes in facial expressions of a person. In some embodiments, the facial expression animation may include a plurality of three-dimensional facial action frames, which are played in time sequence, and may represent three-dimensional facial expression changes of the person, and each action frame may be one piece of picture data. For example, a three-dimensional animation expressing a facial expression when a person is happy to sad while speaking may be included in the facial expression animation.

In some embodiments, the processor 110 may generate the facial expression animation from the expression generation model based on the fused animation features.

In some embodiments, the expression generation model may be a machine learning model, such as a neural network model (NN), or the like.

In some embodiments, the input of the expression generating model comprises a fused animation feature and the output comprises an expression morphing sequence.

The expression deformation sequence refers to a sequence of deformation parameters reflecting continuous changes of the facial expression of a person at a plurality of time points. In some embodiments, the sequence of emoji deformation may comprise a sequence of emoji. For more details on the sequence of emoticons, see the associated description below with respect to fig. 3.

In some embodiments, the processor 110 may generate facial expression animations through an animation synthesis process based on the expression morphing sequence. For more details on the animation synthesis process, see the related description below in fig. 3.

In some embodiments, the expression generating model may be trained by a plurality of tagged fourth training samples. The fourth training sample is a sample fusion animation feature, the label of the fourth training sample is an expression deformation sequence, and the label can be determined through manual labeling. The training process of the expression generating model is similar to that of the style coding model, and for more details on the training of the expression generating model, reference can be made to the description of the training of the style coding model described above with reference to fig. 3.

In some embodiments, the expression generation model may include a feature encoder and an expression decoder. The processor 110 may determine the encoding result by the feature encoder based on the fused animation feature, generate an expression coefficient sequence by the expression decoder based on the encoding result, and generate a facial expression animation by the animation synthesis process based on the expression coefficient sequence.

In some embodiments, the feature encoder refers to an encoder that performs feature extraction on data to obtain an encoding result. The feature encoder may be a machine learning model, such as a transform encoder, a position Encoding (Positional Encoding), a One-Hot Encoding (One-Hot Encoding), or the like. The input of the feature encoder includes fusing animation features and the output includes the encoding result.

The coding result refers to the coding result corresponding to the fusion animation feature at different time points.

In some embodiments, the processor 110 may obtain the encoding results from the feature encoder based on the fused animation features.

In some embodiments, an expressive decoder refers to a decoder that decodes data. The expression decoder may be a machine learning model, such as a transducer decoder, an Autoregressive (Autoregressive) decoder, a Non-Autoregressive (Non-Autoregressive) decoder, or the like. The input of the expression decoder comprises the encoding result and the output comprises the expression coefficient sequence.

The expression coefficient sequence refers to data reflecting expression information generated by an expression decoder. In some embodiments, the sequence of emoticons may be a sequence of parameters reflecting the behavior of the human facial features at different points in time. For example, the emoticon sequence may be a sequence of parameters representing facial muscle movements of five sense organs (such as orbicularis stomatitis, orbicularis oculi, etc.) at different points in time. For another example, the expression coefficient sequence may be a parameter sequence representing five-sense organ actions (such as mouth actions and sizes, whether blinking, etc.) at different time points.

In some embodiments, the expression decoder may perform a decoding process on the encoding results obtained from the feature encoder to generate an expression coefficient sequence matching the fused animation features of the character (the speaking content and emotion style of the character, etc.).

In some embodiments, the processor 110 may input the plurality of fused animation features output by the feature fusion model as the fifth training sample data set to the initial feature encoder to obtain the encoding result output by the initial feature encoder; inputting the coding result output by the feature encoder as a training sample data set into an initial expression decoder to obtain an expression coefficient sequence, and constructing a loss function based on the expression coefficient sequence output by the initial expression decoder and the manually acquired or marked expression coefficient sequence as labels; and (3) obtaining the trained feature encoder and expression decoder by adjusting model parameters of the initial feature encoder and the initial expression decoder and performing iterative processing on the loss function until preset conditions are met.

The animation synthesis process refers to a process of obtaining corresponding facial expression animation based on the expression coefficient sequence. The manner of the animation synthesis process may include various animation synthesis techniques, such as an animation digital synthesis technique, a digital matting technique in stop-motion animation, a rendering technique, and a mapping technique. In some embodiments, the processor 110 may generate the facial expression animation through an animation synthesis technique based on the sequence of the facial expression coefficients obtained from the expression decoder.

In some embodiments of the present disclosure, the efficiency of facial expression animation generation is improved by the feature encoder and the expression decoder on the basis of retaining the information of the fused animation features, so that the fused animation features can be more accurately and reliably encoded and decoded, and then the facial expression animation more conforming to the fused animation features is generated by the animation synthesis process.

In some embodiments, the output of the feature fusion model may be used as an input to an expression generation model, and the feature fusion model and the expression generation model may be derived based on joint training. The training process of the joint training comprises the following steps: and constructing a training data set, and acquiring a feature fusion model and an expression generation model through joint training based on the training data set.

Training data sets refer to data sets used to train machine learning models. In some embodiments, the training data set may include multiple sets of training data consisting of training samples and training labels, where the training samples may include sample style features and sample speech features, and the labels are artificially labeled sample expression deformation sequences. For example, the training data set may include multiple sets of training data consisting of sample style features, sample speech features, and sample expression deformation sequences.

In some embodiments, the processor 110 may obtain style characteristics and speech characteristics based on a variety of means, such as by means of models and/or preset algorithms. The processor 110 may determine animation base features based on style features, speech features. The processor 110 may construct a training data set based on the animation base features corresponding to the different points in time. For example, the processor 110 may group style and speech features of the same character at successive time points in a certain time period into one set of training data, and groups of training data of a plurality of characters in different time periods constitute a training data set. For more explanation of style features and speech features, reference may be made to the relevant description above with respect to FIG. 3.

In some embodiments, the labels of each set of training data in the training data set may be an artificially annotated sequence of emoji deformations. The processor 110 may manually determine and label the tags for each set of training data based on the expression change parameters in the animation, where the animation methods may include, but are not limited to, 3D Flash, adobe Flash, AI generation, etc., or combinations thereof.

In some embodiments, the processor 110 may input each set of training data in the constructed training data set into the initial feature fusion model to obtain the fusion animation feature output by the initial feature fusion model; and taking the fusion animation characteristics output by the characteristic fusion model as a training sample, and inputting the training sample into an initial expression generating model to obtain an expression coefficient sequence output by the initial expression generating model. The processor 110 may construct a loss function based on the result output by the initial expression generating model and the manually labeled label, and perform iterative processing on the loss function by adjusting parameters of the model until a preset condition is met, to obtain a trained feature fusion model and expression generating model, where the preset condition may include, but is not limited to, the loss function converging, the iteration number reaching a frequency threshold, and the like.

In some embodiments of the present disclosure, the processor 110 may relatively quickly acquire the model training related data by using a model and/or a preset algorithm, so as to construct a training data set corresponding to the voice and the expression, train the feature fusion model and the expression generation model according to the training data set, and enable the output of the model to more conform to the synchronous change of the actual voice and the expression of the person, thereby meeting the user requirement.

In some embodiments of the present disclosure, based on basic elements of a person, a speech feature and a style feature that are matched with the basic elements of the person may be determined, by using a feature fusion model, a fusion animation feature may be quickly and accurately generated according to the speech feature and the style feature, and the fusion animation feature is processed based on an expression generation model, so as to generate a 3D facial expression animation with high mouth shape matching degree, strong speech synchronism, natural and vivid expression, and controllable emotion, so that the facial expression of the generated three-dimensional facial animation is more natural, the accuracy and the emotion expression effect of the facial animation are improved, the use experience of the user is improved, and the user requirements are fully satisfied.

It should be noted that the above description of the process 300 is for purposes of example and illustration only and is not intended to limit the scope of applicability of the present disclosure. Various modifications and changes to flow 300 will be apparent to those skilled in the art in light of the present description. However, such modifications and variations are still within the scope of the present description.

FIG. 4 is an exemplary schematic diagram of a feature fusion model and a weight determination model shown in accordance with some embodiments of the present description.

As shown in fig. 4, the inputs to the feature fusion model 440 include the animated base feature 410 and the weight coefficients 430 of the time point stitching. Wherein, the weight coefficient 430 of the time point stitching can process and determine the animation basic feature 410 through the weight determination model 420. Wherein, the more content of the animation basic feature and the feature fusion model can be seen from the related description of fig. 3.

The time point splicing refers to fusion splicing of animation basic features at different time points when the feature fusion model carries out time sequence convolution.

The weight coefficient 430 of the time point stitching refers to the corresponding weight adopted by the feature fusion model when stitching the animation base features of each time point.

The weight determination model 420 refers to a model for determining weight coefficients for time point stitching. The weight determination model may be a machine learning model. The types of weight determination models may include neural network models, deep neural network models, etc., and the selection of model types may be contingent on the particular situation.

In some embodiments, the input of the weight determination model may include the animation base features 410 and the output of the weight determination model may include the weight coefficients 430 of the time-point stitching.

In some embodiments, the inputs to the feature fusion model 440 may include the animated base feature 410 and the weight coefficients 430 of the point-in-time splice.

In some embodiments, the animation base feature 410 may be represented as at least one feature vector based on a time series, and in particular, may be represented separately according to the content of the animation base feature 410. For example, the feature vector of the style feature is (Vs 1, vs2,) the feature vector of the speech feature is (Vv 1, vv2,) wherein (Vs 1, vs2,) represents that the element value of the style feature at time point 1 is Vs1, the element value at time point 2 is Vs2, and so on; (Vv 1, vv2,) represents that the element value of the speech feature at time point 1 is Vv1, the element value at time point 2 is Vv2, and so on. The time point 1, the time point 2 and the like refer to time points which are arranged in time sequence in the same time period, and the element value refers to feature data of style features and/or voice features at the time points.

The representation form of the weight coefficient 430 of the time point splicing may be a numerical value, and the numerical range may be manually determined, for example, between 0 and 1. The weight coefficients 430 of the time-point splice may be used to determine the weights employed by the element values of the corresponding time points in the feature vector. For example, the weight coefficient of the time point n is 0.3, and the weights adopted for the element values Vsn and Vvn of the time point n in the feature vector (Vs 1, vs2,..vsn, …) representing the style feature and the feature vector (Vv 1, vv2,..vvn, …) of the speech feature are 0.3. It should be noted that, the weight coefficient 430 of the time point stitching refers to a parameter used for internal calculation of the feature fusion model.

In some embodiments, the output of the feature fusion model 440 may include a fused animation feature 450, for more details regarding which reference may be made to FIG. 3 and its associated description.

In some embodiments, the output of weight determination model 420 may be an input to feature fusion model 440, and feature fusion model 440 and weight determination model 420 may be co-trained.

In some embodiments, the sample data of the joint training includes sample animation base features, and the labels are sample fusion animation features corresponding to the sample animation base features. Inputting the basic characteristics of the sample animation into an initial weight determining model to obtain a weight coefficient of time point splicing output by the initial weight determining model; and taking the weight coefficient spliced at the time point as training sample data, and inputting the training sample data and the basic characteristics of the sample animation into an initial characteristic fusion model to obtain the fusion animation characteristics output by the initial characteristic fusion model. And constructing a loss function based on the fusion animation characteristics output by the sample fusion animation characteristics and the characteristic fusion model, and synchronously updating parameters of the initial characteristic fusion model and the initial weight determination model. And obtaining a trained feature fusion model and a weight determination model through parameter updating.

In some embodiments, the sample data for the joint training may be obtained based on historical data, and the tag may be determined by means of manual acquisition and labeling.

According to some embodiments of the specification, through determining the weight coefficients of reasonable time point splicing of different time points of the basic animation features, the fusion animation features generated by the feature fusion model can be more accurate, and the candidate generated facial expression animation is more flexible and rich, so that the animation effect is improved.

Fig. 5 is an exemplary schematic diagram of a phoneme generation model shown in accordance with some embodiments of the present description.

In some embodiments, phoneme generation model 520 may include a speech transcription phoneme layer 521 and a phoneme encoding layer 523. In some embodiments, processor 110 may generate phoneme time sequence 522 by speech transcription phoneme layer 521 based on speech element 510; based on the phoneme time sequence 522, a phoneme feature 530 is generated by the phoneme encoding layer 523. For more on the phonetic elements, phoneme generation models, phoneme features, see fig. 3 and its associated description.

The speech transcription phoneme layer 521 is a layer structure for processing speech elements to generate a phoneme time series. In some embodiments, the speech transcription phoneme layer 521 may be a machine learning Model, for example, the speech transcription phoneme layer 521 may be any one or a combination of an Acoustic Model (AM), a Language Model (LM), and the like.

In some embodiments, the input of the speech transcription phoneme layer 521 may comprise a speech element 510 and the output may comprise a phoneme time sequence.

The phoneme time series refers to a time series of data reflecting the voice elements having different time points of the order. In some embodiments, the phoneme time sequence may be represented by a vector. For example, the phoneme time sequence may be a text data sequence corresponding to a character after converting a corresponding speech element into a text at a plurality of consecutive time points. For example, the phoneme time series may be expressed as { (time point 1, text data 1), (time point 2, text data 2), … … }, where (time point 1, text data 1) represents the corresponding text data 1 after the voice element content of time point 1 is converted into a text form; (time point 2, text data 2) means that the voice element content at time point 2 is converted into a text form and then corresponds to text data 2.

The phoneme coding layer refers to a layer structure for processing a phoneme time sequence to generate phoneme features. In some embodiments, the phoneme encoding layer may be a machine learning model, e.g., a long and short term memory model, etc.

In some embodiments, the input of the phoneme encoding layer 523 may comprise a phoneme time sequence 522 output by the speech transcription phoneme layer 521, and the output may comprise a phoneme feature 530.

In some embodiments, the phoneme features 530 are related to a phoneme time sequence. The phoneme feature 530 may be a feature vector value converted from a phoneme time series. For example, the phoneme feature 530 may be represented as { (time point 1, feature vector value 1), (time point 2, feature vector value 2), … … }, where (time point 1, feature vector value 1) represents the corresponding feature vector value 1 after the text data at time point 1 is converted into the feature vector value form; (time point 2, feature vector value 2) means that the text data at time point 2 is converted into feature vector value form and then corresponds to feature vector value 2. For more on the phoneme features, see fig. 3 and its associated description.

In some embodiments, the output of the speech transcription phoneme layer 521 may be used as the input of the phoneme encoding layer 523, and the speech transcription phoneme layer 521 and the phoneme encoding layer 523 may be obtained by joint training based on the sixth training sample.

In some embodiments, the samples in the sixth training sample data for the joint training may include sample speech elements, labeled as phoneme features. Wherein the tag may be determined by manual labeling. For a description of how to obtain the phonetic elements, see fig. 3 and its associated description.

In some embodiments, the sample speech element may be input into an initial speech transcription phoneme layer to obtain a phoneme time sequence output by the initial speech transcription phoneme layer, the phoneme time sequence is input into an initial phoneme coding layer as the sample phoneme time sequence to obtain a phoneme feature output by the initial phoneme coding layer, a loss function is constructed based on the phoneme features output by the label and the initial phoneme coding layer, parameters of the initial speech transcription phoneme layer and the initial phoneme coding layer are synchronously updated, and the trained speech transcription phoneme layer and the trained phoneme coding layer are obtained through parameter updating.

In some embodiments of the present disclosure, through the speech transcription phoneme layer and the phoneme coding layer of the trained phoneme generation model, the speaking content information of the person can be accurately obtained from the speech element and the phoneme feature can be generated, so that the accuracy of the determined phoneme feature is further improved, and the following 3D face model generated based on the phoneme feature is more attached to the real person.

In some embodiments, processor 110 may determine an expression coefficient sequence 670 via a style coding model 621, a phoneme generation model 622, a feature fusion model 640, a feature encoder 661, an expression decoder 662, etc., based on character base elements 610, and a facial expression animation 690 via an animation synthesis process 680.

In some embodiments, character base elements 610 may include an expression style element 611, a voice element 612, and the like. The expression style element 611 may include at least one of an expression style picture 611-1, an expression style video 611-2, a style identification number 611-3, a style intensity 611-4, and the like. For more details on the basic elements of the figures, see fig. 3 and the associated description.

In some embodiments, the animation base features 630 may include style features 631 and speech features 632. The speech features 632 may include speech energy features 632-1, phoneme features 632-2, and more fully may be seen in fig. 3 and its corresponding description.

In some embodiments, the processor 110 may generate the style feature 631 by the style coding model 621 based on the expression style picture 611-1, the expression style video 611-2, the style identification number 611-3, and the style intensity 611-4, as more fully described with reference to fig. 3 and its corresponding description.

In some embodiments, processor 110 may generate phoneme features 632-2 based on speech elements 612 via phoneme generation model 622, see FIG. 3 and its corresponding description for further details.

In some embodiments, the processor 110 may determine the fused animation feature 650 via the feature fusion model 640 based on the style feature 631, the speech energy feature 632-1, and the phoneme feature 632-2. The relevant content of feature fusion model 640 may be found in fig. 3 and its corresponding description.

In some embodiments, the processor 110 may generate the sequence of expression coefficients 670 by the expression generation model 660 based on the fused animation features 650.

In some embodiments, expression generation model 660 may include feature encoder 661 and expression decoder 662. The processor 110 may further encode the fused animation features obtained from the feature fusion model 640 to obtain an encoded result. For more details on the feature encoder 661, see fig. 3 and its associated description.

In some embodiments, the expression decoder 662 may process the encoding results obtained from the feature encoder 661 to obtain the expression coefficient sequence 670. For more details on the emoticon sequence 670 and the emoticon decoder 662, see fig. 3 and its associated description.

In some embodiments, the facial expression animation 690 may be obtained by an animation synthesis process 680 based on the expression factor sequence 670. For more details regarding facial expression animation 690, see FIG. 3 and its associated description.

In some embodiments, processor 110 may generate animation base feature 630 via style coding model 621 and phoneme generation model 622 based on character base element 610. The processor 110 generates style characteristics 631 from the style coding model based on the expression style element 611 and generates speech characteristics 632 from the phoneme generation model based on the speech element 612. Speech feature 632 includes speech energy feature 632-1 and phoneme feature 632-2. For example, the processor may process data input in the form of pictures through a 2D convolutional neural network (2D Convolutional Neural Networks,2D CNN), may process data input in the form of video through a 2D convolutional neural network or a 3D convolutional neural network (3D Convolutional Neural Networks,3D CNN) and Long-short-term memory (Long-Short Term Memory, LSTM) model, thereby generating style features 631. As another example, the processor may process the speech elements through a time convolutional neural network (Temporal Convolutional Network, TCN)) to obtain speech features. The processor may generate a fused animation feature 650 through the feature fusion model 640 based on the animation base features 630 derived from the style coding model 621 and the phoneme generation model 622. The feature fusion model 640 may exist in a variety of structural forms. The processor may generate an expression coefficient sequence 670 through an expression generation model 660 based on the fused animation features 650 generated by the feature fusion model 640. The expression generating model 660 includes a feature encoder 661 and an expression decoder 662, the output dimension of the expression decoder 662 depending on the dimension of the expression coefficient sequence 670. The processor may generate a facial expression animation 690 via an animation synthesis process 680 based on the sequence of emoticons 670.

The overall training dataset 6110-1 is a dataset for joint training including a style coding model, a phoneme generation model, a feature fusion model, a feature encoder, and an expression decoder.

In some embodiments, the total training data set 6110-1 may be a paired data set constructed based on character base elements (expression style elements and speech elements). The expression style element comprises at least one of expression style picture, expression style video, style identification number, style intensity and the like. The method for acquiring the basic elements of the person can be seen from the relevant description of fig. 3. The labels of the paired data sets may be sample emoticon sequences that are determined by means of manual acquisition and labeling.

Data preprocessing 6110-2 is the process of further processing the data set from the total training data set. For example, the data preprocessing 6110-2 may be to clip the expression style pictures in the total training data set to a preset size, initialize the noise point number and the noise threshold value for the expression style video in the total training data set, and so on.

In some embodiments, on a device containing a Graphics Processor (GPU), the processor 110 may perform a total joint training 6110-3 on a model parameter including a feature style coding model, a phoneme generating model, a feature fusion model, a feature encoder, and an expression decoder, resulting in the model parameters of each model described above.

The overall joint training 6110-3 is a joint training of the initial style coding model, the initial phoneme generation model, the initial feature fusion model, the initial feature encoder, and the initial expression decoder. In some embodiments, the processor 110 may input the constructed total training dataset 6110-1 into an initial style coding model and an initial phoneme generating model, resulting in a plurality of animated base features of its output; taking the animation basic features output by the initial style coding model and the initial phoneme generating model as a training sample data set, and inputting an initial feature fusion model to obtain the fusion animation features output by the initial feature fusion model; inputting a plurality of fusion animation features output by the initial feature fusion model as a training sample data set into an initial feature encoder to obtain an output encoding result; inputting the coding result output by the initial feature coder as a training sample data set into an initial expression decoder to obtain an expression coefficient sequence, and constructing a loss function by the obtained expression coefficient sequence and a manually marked label; and obtaining a trained style coding model, a phoneme generating model, a feature fusion model, a feature encoder and an expression decoder by adjusting model parameters and carrying out iterative processing on the loss function until preset conditions are met, wherein the preset conditions can be that the loss function converges, the iterative times reach a threshold value and the like.

In some embodiments of the present description, facial animation generated by combining style elements, voice elements is more accurate and natural than facial expression animation generated by employing only a single element, feature. The model such as the style coding model and the phoneme generating model is used for processing the style elements and the voice elements, so that the facial expression animation with high mouth shape matching degree, strong voice synchronism and more natural expression can be generated, and the generated facial expression animation can fully meet various requirements of users.

Some embodiments of the present specification further provide a three-dimensional facial animation generating apparatus, the apparatus including at least one processor and at least one memory, the at least one memory configured to store computer instructions, the at least one processor configured to execute at least some of the computer instructions to implement the three-dimensional facial animation generating method according to any of the embodiments of the present specification.

Some embodiments of the present disclosure also provide a computer-readable storage medium storing computer instructions that, when read by a computer in the storage medium, perform the three-dimensional facial animation generating method according to any one of the embodiments of the present disclosure.

While the basic concepts have been described above, it will be apparent to those skilled in the art that the foregoing detailed disclosure is by way of example only and is not intended to be limiting. Although not explicitly described herein, various modifications, improvements, and adaptations to the present disclosure may occur to one skilled in the art. Such modifications, improvements, and modifications are intended to be suggested within this specification, and therefore, such modifications, improvements, and modifications are intended to be included within the spirit and scope of the exemplary embodiments of the present invention.

Furthermore, the order in which the elements and sequences are processed, the use of numerical letters, or other designations in the description are not intended to limit the order in which the processes and methods of the description are performed unless explicitly recited in the claims. While certain presently useful inventive embodiments have been discussed in the foregoing disclosure, by way of various examples, it is to be understood that such details are merely illustrative and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements included within the spirit and scope of the embodiments of the present disclosure. For example, while the system components described above may be implemented by hardware devices, they may also be implemented solely by software solutions, such as installing the described system on an existing server or mobile device.

Likewise, it should be noted that in order to simplify the presentation disclosed in this specification and thereby aid in understanding one or more inventive embodiments, various features are sometimes grouped together in a single embodiment, figure, or description thereof. This method of disclosure, however, is not intended to imply that more features than are presented in the claims are required for the present description. Indeed, less than all of the features of a single embodiment disclosed above.

In some embodiments, numbers describing the components, number of attributes are used, it being understood that such numbers being used in the description of embodiments are modified in some examples by the modifier "about," approximately, "or" substantially. Unless otherwise indicated, "about," "approximately," or "substantially" indicate that the number allows for a 20% variation. Accordingly, in some embodiments, numerical parameters set forth in the specification and claims are approximations that may vary depending upon the desired properties sought to be obtained by the individual embodiments. In some embodiments, the numerical parameters should take into account the specified significant digits and employ a method for preserving the general number of digits. Although the numerical ranges and parameters set forth herein are approximations that may be employed in some embodiments to confirm the breadth of the range, in particular embodiments, the setting of such numerical values is as precise as possible.

Finally, it should be understood that the embodiments described in this specification are merely illustrative of the principles of the embodiments of this specification. Other variations are possible within the scope of this description. Thus, by way of example, and not limitation, alternative configurations of embodiments of the present specification may be considered as consistent with the teachings of the present specification. Accordingly, the embodiments of the present specification are not limited to only the embodiments explicitly described and depicted in the present specification.

Claims

1. A method of three-dimensional facial animation generation, the method performed by a processor, comprising:

generating an animation base feature based on the character base element, the animation base feature comprising at least one of a style feature and a speech feature;

generating a fused animation feature through a feature fusion model based on the animation basic feature, wherein the feature fusion model is a machine learning model;

and generating a facial expression animation through an expression generating model based on the fusion animation characteristics, wherein the expression generating model is a machine learning model.

2. The method of claim 1, wherein the character base elements comprise expression style elements including at least one of expression style pictures, expression style videos, style identification numbers, and style intensities, wherein generating the animation base features based on the character base elements comprises:

And generating the style characteristics through a style coding model based on the expression style elements, wherein the style coding model is a machine learning model.

3. The method of claim 1, wherein the character base elements comprise voice elements, the voice features comprise at least one of voice energy features and phoneme features, and wherein generating the animation base features based on the character base elements comprises:

generating the speech energy feature based on the speech element; and

and generating the phoneme features through a phoneme generation model based on the voice elements, wherein the phoneme generation model is a machine learning model.

4. The method of claim 1, wherein the expression generation model comprises a feature encoder and an expression decoder, and wherein generating a facial expression animation from the expression generation model based on the fused animation features comprises:

determining a coding result by the feature encoder based on the fused animation feature;

generating an expression coefficient sequence by the expression decoder based on the encoding result;

and generating the facial expression animation through animation synthesis processing based on the expression coefficient sequence.

5. The method of claim 1, wherein the feature fusion model and the expression generation model are derived based on a joint training, a training process of the joint training comprising:

constructing a training data set;

and acquiring the feature fusion model and the expression generation model through the combined training based on the training data set.

6. A three-dimensional facial animation generation system is characterized by comprising a basic generation module, a fusion generation module and an expression generation module, wherein,

the base generation module is configured to generate an animation base feature based on the character base element, the animation base feature including at least one of a style feature and a speech feature;

the fusion generation module is configured to generate fusion animation features through a feature fusion model based on the animation basic features, wherein the feature fusion model is a machine learning model;

the expression generation module is configured to generate facial expression animation through an expression generation model based on the fused animation features, wherein the expression generation model is a machine learning model.

7. The system of claim 6, wherein the character base element comprises an expression style element comprising at least one of an expression style picture, an expression style video, a style identification number, and a style intensity, the base generation module further configured to:

8. The system of claim 6, wherein the expression generation model comprises a feature encoder and an expression decoder, the expression generation module further configured to:

9. A three-dimensional facial animation generating device, comprising at least one processor and at least one memory, the at least one memory configured to store computer instructions, the at least one processor configured to execute at least some of the computer instructions to implement the three-dimensional facial animation generating method according to any of claims 1-5.

10. A computer-readable storage medium storing computer instructions, wherein when the computer instructions in the storage medium are read by a computer, the computer performs the three-dimensional face animation generation method according to any one of claims 1 to 5.