CN112215926A - Voice-driven human face action real-time transfer method and system - Google Patents

Voice-driven human face action real-time transfer method and system Download PDF

Info

Publication number
CN112215926A
CN112215926A CN202011027777.9A CN202011027777A CN112215926A CN 112215926 A CN112215926 A CN 112215926A CN 202011027777 A CN202011027777 A CN 202011027777A CN 112215926 A CN112215926 A CN 112215926A
Authority
CN
China
Prior art keywords
audio
dimensional
frame
audio features
human face
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011027777.9A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huayan Mutual Entertainment Technology Co ltd
Original Assignee
Beijing Huayan Mutual Entertainment Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Huayan Mutual Entertainment Technology Co ltd filed Critical Beijing Huayan Mutual Entertainment Technology Co ltd
Priority to CN202011027777.9A priority Critical patent/CN112215926A/en
Publication of CN112215926A publication Critical patent/CN112215926A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a voice-driven real-time transfer method and a voice-driven real-time transfer system for human face actions, wherein the method comprises the following steps: inputting an audio sequence of a source character; estimating an audio signal characterization for each frame in the audio sequence; driving a three-dimensional face model action according to the estimated audio signal representation of each audio frame; acquiring a target video frame; predicting the human face action on the target video frame image based on the driven three-dimensional face model to obtain a human face action prediction result; and synthesizing the predicted face action prediction result to a corresponding frame image in the target video, so as to realize real-time transfer of the face action driven by voice. The invention greatly improves the sense of reality of the driven human face action, greatly reduces the complexity of the human face driving algorithm and can effectively ensure the real-time property of the driving human face action.

Description

Voice-driven human face action real-time transfer method and system
Technical Field
The invention relates to the technical field of face action driving, in particular to a voice-driven face action real-time transfer method and system.
Background
The voice-driven human face animation is a research hotspot in the technical field of current animation simulation. The technical core of the voice-driven human face animation is that the human face model animation is driven by the voice information input from the outside. The technology of the voice-driven face animation which is popular in time is mainly characterized in that the corresponding relation between voice information and face animation videos is established, all the face animation videos are stored in a face animation material library, then the voice information input from the outside is recognized, the face animation videos corresponding to the recognized voice information are matched from the face animation material library according to the matching relation between the voice information and the face animation videos, and finally the face animation videos are directly called to be displayed to a user. The method cannot realize the real-time performance of the voice-driven human face animation.
In addition, although some existing methods for driving the human face animation by voice ensure the real-time performance of human face driving to a certain extent, the algorithm is complex, the real-time effect is not ideal, and the fidelity of the driven human face is poor, so that the application requirements cannot be met.
Disclosure of Invention
The invention aims to provide a voice-driven real-time transfer method and system for human face actions, so as to solve the technical problems.
In order to achieve the purpose, the invention adopts the following technical scheme:
the method for transferring the human face action driven by the voice in real time comprises the following steps:
inputting an audio sequence of a source character;
estimating an audio signal characterization for each frame in the audio sequence;
driving a three-dimensional face model action according to the estimated audio signal representation of each audio frame;
acquiring a target video frame;
predicting the human face action on the target video frame image based on the driven three-dimensional face model to obtain a human face action prediction result;
and synthesizing the predicted face action prediction result to a corresponding frame image in the target video, so as to realize real-time transfer of the face action driven by voice.
Preferably, the audio signal characterization for each frame in the audio sequence is estimated based on a FacialSpeech speech recognition framework.
Preferably, the feature dimension of each frame of audio input into the facel special speech recognition framework is 16 × 29, and the number "16" represents a time window in which each frame of audio contains 16 audio features;
the number "29" indicates that the FacialSpeech alphabet is 29 in length.
Preferably, the FacialSpeech speech recognition framework comprises 4 convolutional layers and 3 full-connected layers which are cascaded in sequence, and the input 16 × 29-dimensional audio features are subjected to one-dimensional feature convolution extraction of the first convolutional layer and then output 8 × 32-dimensional audio features;
8 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the second convolution layer and then 4 x 32 dimensional audio features are output;
4 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the third convolution layer, and then 2 x 64 dimensional audio features are output;
the 2 x 64-dimensional audio features are subjected to one-dimensional feature convolution extraction of a fourth convolution layer and then 64 audio features are output;
the first fully-connected layer maps 64 audio features to 128;
the second fully-connected layer maps 128 audio features to 64;
the third fully-connected layer maps 64 audio features into an audio characterization vector of length 32.
Preferably, the convolution kernel size of 4 of said convolutional layers is 3, the step size is 2.
The invention also provides a voice-driven human face action real-time transfer system, which can realize the human face action real-time transfer method, and the system comprises:
the audio sequence input module is used for inputting an audio sequence of a source role;
the audio signal expression estimation module is connected with the audio sequence input module and used for estimating the representation of the audio signal of each frame in the audio sequence;
the model action driving module is connected with the audio signal expression estimation module and used for driving a three-dimensional face model action according to the representation of the audio signal of each audio frame;
the target video frame acquisition module is used for acquiring a target video frame;
the target frame human face action prediction module is respectively connected with the model action driving module and the target video frame acquisition module and is used for predicting human face actions on the target video frame images based on the driven three-dimensional face model to obtain a human face action prediction result;
and the face action transfer module is connected with the target frame face action prediction module and used for synthesizing the predicted face action prediction result to a corresponding frame image in the target video frame so as to realize real-time transfer of the face action driven by voice.
Preferably, the audio signal characterization for each frame in the audio sequence is estimated based on a FacialSpeech speech recognition framework.
Preferably, the feature dimension of each frame of audio input into the facel special speech recognition framework is 16 × 29, and the number "16" represents a time window in which each frame of audio contains 16 audio features;
the number "29" indicates that the FacialSpeech alphabet is 29 in length.
Preferably, the FacialSpeech speech recognition framework comprises 4 convolutional layers and 3 full-connected layers which are cascaded in sequence, and the input 16 × 29-dimensional audio features are subjected to one-dimensional feature convolution extraction of the first convolutional layer and then output 8 × 32-dimensional audio features;
8 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the second convolution layer and then 4 x 32 dimensional audio features are output;
4 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the third convolution layer, and then 2 x 64 dimensional audio features are output;
the 2 x 64-dimensional audio features are subjected to one-dimensional feature convolution extraction of a fourth convolution layer and then 64 audio features are output;
the first fully-connected layer maps 64 audio features to 128;
the second fully-connected layer maps 128 audio features to 64;
the third fully-connected layer maps 64 audio features into an audio characterization vector of length 32.
Preferably, the convolution kernel size of 4 of said convolutional layers is 3, the step size is 2.
The invention firstly drives the three-dimensional face model to act through the estimated audio signal representation, then predicts the face action of the target frame image based on the driven three-dimensional face model, synthesizes the prediction result on the target frame image, and realizes the real-time transfer of the face action driven by voice.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 is a step diagram of a voice-driven real-time transfer method for human face actions according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a voice-driven real-time human face motion transfer system according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of the architecture of the FacialSpeech speech recognition framework employed in the present invention;
fig. 4 is a schematic diagram of the invention for realizing real-time transfer of human face actions based on voice driving.
Detailed Description
The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.
Wherein the showings are for the purpose of illustration only and are shown by way of illustration only and not in actual form, and are not to be construed as limiting the present patent; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if the terms "upper", "lower", "left", "right", "inner", "outer", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not indicated or implied that the referred device or element must have a specific orientation, be constructed in a specific orientation and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and the specific meanings of the terms may be understood by those skilled in the art according to specific situations.
In the description of the present invention, unless otherwise explicitly specified or limited, the term "connected" or the like, if appearing to indicate a connection relationship between the components, is to be understood broadly, for example, as being fixed or detachable or integral; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or may be connected through one or more other components or may be in an interactive relationship with one another. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
An embodiment of the present invention provides a method for transferring a face action in real time driven by voice, as shown in fig. 1, including the following steps:
step S1, inputting an audio sequence of a source role;
step S2, evaluating an audio signal representation (expression characterizing audio features) of each frame in the audio sequence;
step S3, driving a three-dimensional face model according to the estimated audio signal representation of each audio frame; fig. 4 is a schematic diagram of a three-dimensional face model;
step S4, acquiring a target video frame;
step S5, based on the driven three-dimensional face model, predicting the face action on the target video frame image to obtain the face action prediction result;
and step S6, synthesizing the predicted human face action prediction result to a corresponding frame image in the target video, and realizing the real-time transfer of the human face action driven by voice.
In step S2, the present invention estimates an audio signal characterization for each frame in an audio sequence based on, in particular, the FacialSpeech speech recognition framework. FacialSpeech is a speech recognition system developed by hundreds of degrees in China. The invention improves the network architecture for estimating the audio signal representation based on the FacialSpeech framework. The present invention first determines the feature dimension of each frame of audio in an input audio sequence to be 16 x 29,
the number "16" represents a time window containing 16 audio features per frame of audio;
the number "29" indicates that the FacialSpeech alphabet is 29 in length.
Referring to fig. 3, the improved facel speech recognition framework of the present invention includes 4 convolutional layers and 3 full-link layers cascaded in sequence, and the input 16 × 29 dimensional audio features are convolution extracted by the one-dimensional features of the first convolutional layer and then output 8 × 32 dimensional audio features;
8 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the second convolution layer and then 4 x 32 dimensional audio features are output;
4 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the third convolution layer, and then 2 x 64 dimensional audio features are output;
the 2 x 64-dimensional audio features are subjected to one-dimensional feature convolution extraction of a fourth convolution layer and then 64 audio features are output;
the first fully-connected layer maps 64 audio features to 128;
the second fully-connected layer maps 128 audio features to 64;
the third fully-connected layer maps 64 audio features into an audio characterization vector of length 32.
The convolution kernel size of the 4 convolutional layers is 3, and the step size is 2.
The specific estimation process with respect to the characterization of the audio signal is not set forth herein.
In step S3, a preset three-dimensional face model is driven to act based on the audio signal representation, which is beneficial to improving the fidelity of the face action of the target video frame and reducing the complexity of the face action synthesis algorithm. If the audio signal representation (expression for representing audio characteristics) is calculated incorrectly, and therefore the direct-driven target video frame human face may have poor reality sense, even has the situations of face distortion and the like, the invention firstly drives the three-dimensional face model to act, transfers the facial action of the model to the target video frame image under the condition of ensuring the fidelity, and is beneficial to improving the reality sense of the target video frame human face action. And the algorithm of the direct drive target frame human face action is quite complex, and the real-time performance of the human face drive is influenced.
Referring to fig. 4, the following briefly explains the principle of transferring the facial motion of a three-dimensional face model to a target video frame image:
extracting a face region on a target video frame image based on a three-dimensional face model, then mapping model actions to the target face region (many existing face mapping methods are available and are not specifically described here), and finally synthesizing the target face subjected to action mapping to the target video frame image to realize real-time transfer of the face actions driven by voice.
The present invention also provides a voice-driven real-time transfer system for human face actions, which can implement the above-mentioned real-time transfer method for human face actions, as shown in fig. 2, the system includes:
the audio sequence input module 1 is used for inputting an audio sequence of a source role;
the audio signal expression estimation module 2 is connected with the audio sequence input module 1 and is used for estimating the representation of the audio signal of each frame in the audio sequence;
the model action driving module 3 is connected with the audio signal expression estimation module 2 and used for driving a three-dimensional face model action according to the representation of the audio signal of each audio frame;
a target video frame obtaining module 4, configured to obtain a target video frame;
the target frame human face action prediction module 5 is respectively connected with the model action driving module 3 and the target video frame acquisition module 4 and is used for predicting human face actions on a target video frame image based on the driven three-dimensional face model to obtain a human face action prediction result;
and the human face action transfer module 6 is connected with the target frame human face action prediction module 5 and is used for synthesizing the predicted human face action prediction result to a corresponding frame image in a target video frame so as to realize real-time transfer of the human face action driven by voice.
The human face action real-time transfer system provided by the invention estimates the representation of the audio signal of each frame in the audio sequence based on a FacialSpeech speech recognition framework.
Specifically, as shown in fig. 3, the improved facel spech framework of the present invention includes 4 convolutional layers and 3 fully-connected layers, which are cascaded in sequence, and the input 16 × 29 dimensional (the number "16" represents a time window in which each frame of audio contains 16 audio features; the number "29" represents that the length of the facel spech alphabet is 29.) audio features are convolved and extracted by one-dimensional features of the first convolutional layer, and then 8 × 32 dimensional audio features are output;
8 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the second convolution layer and then 4 x 32 dimensional audio features are output;
4 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the third convolution layer, and then 2 x 64 dimensional audio features are output;
the 2 x 64-dimensional audio features are subjected to one-dimensional feature convolution extraction of a fourth convolution layer and then 64 audio features are output;
the first fully-connected layer maps 64 audio features to 128;
the second fully-connected layer maps 128 audio features to 64;
the third fully-connected layer maps 64 audio features into an audio characterization vector of length 32.
The convolution kernel size of 4 of the convolutional layers is 3, and the step size is 2.
It should be understood that the above-described embodiments are merely preferred embodiments of the invention and the technical principles applied thereto. It will be understood by those skilled in the art that various modifications, equivalents, changes, and the like can be made to the present invention. However, such variations are within the scope of the invention as long as they do not depart from the spirit of the invention. In addition, certain terms used in the specification and claims of the present application are not limiting, but are used merely for convenience of description.

Claims (10)

1. A voice-driven human face action real-time transfer method is characterized by comprising the following steps:
inputting an audio sequence of a source character;
estimating an audio signal characterization for each frame in the audio sequence;
driving a three-dimensional face model action according to the estimated audio signal representation of each audio frame;
acquiring a target video frame;
predicting the human face action on the target video frame image based on the driven three-dimensional face model to obtain a human face action prediction result;
and synthesizing the predicted face action prediction result to a corresponding frame image in the target video, so as to realize real-time transfer of the face action driven by voice.
2. The method of claim 1, wherein the audio signal characterization for each frame in the audio sequence is estimated based on a FacialSpeech speech recognition framework.
3. The method of claim 2, wherein the feature dimension of each frame of audio input into the FacialSpeech speech recognition framework is 16 x 29,
the number "16" represents a time window containing 16 audio features per frame of audio;
the number "29" indicates that the FacialSpeech alphabet is 29 in length.
4. The real-time human face motion transfer method according to claim 3, wherein the FacialSpeech speech recognition framework comprises 4 convolutional layers and 3 full-link layers which are cascaded in sequence, and the input 16 x 29-dimensional audio features are subjected to one-dimensional feature convolution extraction of the first convolutional layer and then output 8 x 32-dimensional audio features;
8 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the second convolution layer and then 4 x 32 dimensional audio features are output;
4 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the third convolution layer, and then 2 x 64 dimensional audio features are output;
the 2 x 64-dimensional audio features are subjected to one-dimensional feature convolution extraction of a fourth convolution layer and then 64 audio features are output;
the first fully-connected layer maps 64 audio features to 128;
the second fully-connected layer maps 128 audio features to 64;
the third fully-connected layer maps 64 audio features into an audio characterization vector of length 32.
5. The method according to claim 4, wherein the convolution kernel size of 4 convolution layers is 3, and the step size is 2.
6. A voice-driven real-time human face motion transfer system, which can implement the method of any one of claims 1-5, and comprises:
the audio sequence input module is used for inputting an audio sequence of a source role;
the audio signal expression estimation module is connected with the audio sequence input module and used for estimating the representation of the audio signal of each frame in the audio sequence;
the model action driving module is connected with the audio signal expression estimation module and used for driving a three-dimensional face model action according to the representation of the audio signal of each audio frame;
the target video frame acquisition module is used for acquiring a target video frame;
the target frame human face action prediction module is respectively connected with the model action driving module and the target video frame acquisition module and is used for predicting human face actions on the target video frame images based on the driven three-dimensional face model to obtain a human face action prediction result;
and the face action transfer module is connected with the target frame face action prediction module and used for synthesizing the predicted face action prediction result to a corresponding frame image in the target video frame so as to realize real-time transfer of the face action driven by voice.
7. The system of claim 6, wherein the audio signal characterization for each frame in the audio sequence is estimated based on a FacialSpeech speech recognition framework.
8. The system of claim 7, wherein the feature dimension of each frame of audio input into the FacialSpeech speech recognition framework is 16 x 29,
the number "16" represents a time window containing 16 audio features per frame of audio;
the number "29" indicates that the FacialSpeech alphabet is 29 in length.
9. The system of claim 8, wherein the facespecific speech recognition framework comprises 4 convolutional layers and 3 full-link layers which are cascaded in sequence, and the input 16 x 29-dimensional audio features are subjected to one-dimensional feature convolution extraction of the first convolutional layer and then output 8 x 32-dimensional audio features;
8 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the second convolution layer and then 4 x 32 dimensional audio features are output;
4 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the third convolution layer, and then 2 x 64 dimensional audio features are output;
the 2 x 64-dimensional audio features are subjected to one-dimensional feature convolution extraction of a fourth convolution layer and then 64 audio features are output;
the first fully-connected layer maps 64 audio features to 128;
the second fully-connected layer maps 128 audio features to 64;
the third fully-connected layer maps 64 audio features into an audio characterization vector of length 32.
10. The system of claim 9, wherein the convolution kernel size of 4 convolution layers is 3 and the step size is 2.
CN202011027777.9A 2020-09-28 2020-09-28 Voice-driven human face action real-time transfer method and system Pending CN112215926A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011027777.9A CN112215926A (en) 2020-09-28 2020-09-28 Voice-driven human face action real-time transfer method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011027777.9A CN112215926A (en) 2020-09-28 2020-09-28 Voice-driven human face action real-time transfer method and system

Publications (1)

Publication Number Publication Date
CN112215926A true CN112215926A (en) 2021-01-12

Family

ID=74051267

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011027777.9A Pending CN112215926A (en) 2020-09-28 2020-09-28 Voice-driven human face action real-time transfer method and system

Country Status (1)

Country Link
CN (1) CN112215926A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113035198A (en) * 2021-02-26 2021-06-25 北京百度网讯科技有限公司 Lip movement control method, device and medium for three-dimensional face
CN113132815A (en) * 2021-04-22 2021-07-16 北京房江湖科技有限公司 Video generation method and device, computer-readable storage medium and electronic equipment
CN113160799A (en) * 2021-04-22 2021-07-23 北京房江湖科技有限公司 Video generation method and device, computer-readable storage medium and electronic equipment
CN113408449A (en) * 2021-06-25 2021-09-17 达闼科技(北京)有限公司 Face action synthesis method based on voice drive, electronic equipment and storage medium
WO2023088080A1 (en) * 2021-11-22 2023-05-25 上海商汤智能科技有限公司 Speaking video generation method and apparatus, and electronic device and storage medium
CN117729298A (en) * 2023-12-15 2024-03-19 北京中科金财科技股份有限公司 Photo driving method based on action driving and mouth shape driving
CN117831126A (en) * 2024-01-02 2024-04-05 暗物质(北京)智能科技有限公司 Voice-driven 3D digital human action generation method, system, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102054287A (en) * 2009-11-09 2011-05-11 腾讯科技(深圳)有限公司 Facial animation video generating method and device
CN106485774A (en) * 2016-12-30 2017-03-08 当家移动绿色互联网技术集团有限公司 Expression based on voice Real Time Drive person model and the method for attitude
CN109308731A (en) * 2018-08-24 2019-02-05 浙江大学 The synchronous face video composition algorithm of the voice-driven lip of concatenated convolutional LSTM
CN111243065A (en) * 2019-12-26 2020-06-05 浙江大学 Voice signal driven face animation generation method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102054287A (en) * 2009-11-09 2011-05-11 腾讯科技(深圳)有限公司 Facial animation video generating method and device
CN106485774A (en) * 2016-12-30 2017-03-08 当家移动绿色互联网技术集团有限公司 Expression based on voice Real Time Drive person model and the method for attitude
CN109308731A (en) * 2018-08-24 2019-02-05 浙江大学 The synchronous face video composition algorithm of the voice-driven lip of concatenated convolutional LSTM
CN111243065A (en) * 2019-12-26 2020-06-05 浙江大学 Voice signal driven face animation generation method

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113035198A (en) * 2021-02-26 2021-06-25 北京百度网讯科技有限公司 Lip movement control method, device and medium for three-dimensional face
CN113035198B (en) * 2021-02-26 2023-11-21 北京百度网讯科技有限公司 Three-dimensional face lip movement control method, equipment and medium
CN113132815A (en) * 2021-04-22 2021-07-16 北京房江湖科技有限公司 Video generation method and device, computer-readable storage medium and electronic equipment
CN113160799A (en) * 2021-04-22 2021-07-23 北京房江湖科技有限公司 Video generation method and device, computer-readable storage medium and electronic equipment
CN113408449A (en) * 2021-06-25 2021-09-17 达闼科技(北京)有限公司 Face action synthesis method based on voice drive, electronic equipment and storage medium
CN113408449B (en) * 2021-06-25 2022-12-06 达闼科技(北京)有限公司 Face action synthesis method based on voice drive, electronic equipment and storage medium
WO2023088080A1 (en) * 2021-11-22 2023-05-25 上海商汤智能科技有限公司 Speaking video generation method and apparatus, and electronic device and storage medium
CN117729298A (en) * 2023-12-15 2024-03-19 北京中科金财科技股份有限公司 Photo driving method based on action driving and mouth shape driving
CN117831126A (en) * 2024-01-02 2024-04-05 暗物质(北京)智能科技有限公司 Voice-driven 3D digital human action generation method, system, equipment and medium

Similar Documents

Publication Publication Date Title
CN112215926A (en) Voice-driven human face action real-time transfer method and system
Guo et al. Ad-nerf: Audio driven neural radiance fields for talking head synthesis
CN110119757B (en) Model training method, video category detection method, device, electronic equipment and computer readable medium
JP2022515620A (en) Image area recognition method by artificial intelligence, model training method, image processing equipment, terminal equipment, server, computer equipment and computer program
CN108388882B (en) Gesture recognition method based on global-local RGB-D multi-mode
CN113592985B (en) Method and device for outputting mixed deformation value, storage medium and electronic device
CN108363973B (en) Unconstrained 3D expression migration method
Tang et al. Real-time neural radiance talking portrait synthesis via audio-spatial decomposition
CN111046734B (en) Multi-modal fusion sight line estimation method based on expansion convolution
CN111276240B (en) Multi-label multi-mode holographic pulse condition identification method based on graph convolution network
CN114245215B (en) Method, device, electronic equipment, medium and product for generating speaking video
CN100505840C (en) Method and device for transmitting face synthesized video
JP2023546173A (en) Facial recognition type person re-identification system
CN114078275A (en) Expression recognition method and system and computer equipment
CN109784243A (en) Identity determines method and device, neural network training method and device, medium
CN116704084B (en) Training method of facial animation generation network, facial animation generation method and device
CN112634413B (en) Method, apparatus, device and storage medium for generating model and generating 3D animation
CN113705384A (en) Facial expression recognition method considering local space-time characteristics and global time sequence clues
CN112069877B (en) Face information identification method based on edge information and attention mechanism
CN113223125A (en) Face driving method, device, equipment and medium for virtual image
CN116993948A (en) Face three-dimensional reconstruction method, system and intelligent terminal
CN116945170A (en) Grabbing stability assessment method based on vision-touch fusion sensing and multi-mode space-time convolution
CN116417008A (en) Cross-mode audio-video fusion voice separation method
CN115578298A (en) Depth portrait video synthesis method based on content perception
CN114445529A (en) Human face image animation method and system based on motion and voice characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210112

RJ01 Rejection of invention patent application after publication