CN115375809B - Method, device and equipment for generating virtual image and storage medium - Google Patents

Method, device and equipment for generating virtual image and storage medium Download PDF

Info

Publication number
CN115375809B
CN115375809B CN202211310590.9A CN202211310590A CN115375809B CN 115375809 B CN115375809 B CN 115375809B CN 202211310590 A CN202211310590 A CN 202211310590A CN 115375809 B CN115375809 B CN 115375809B
Authority
CN
China
Prior art keywords
emotional
video
video sequence
feature
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211310590.9A
Other languages
Chinese (zh)
Other versions
CN115375809A (en
Inventor
吴小燕
何山
殷兵
刘聪
周良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202211310590.9A priority Critical patent/CN115375809B/en
Publication of CN115375809A publication Critical patent/CN115375809A/en
Application granted granted Critical
Publication of CN115375809B publication Critical patent/CN115375809B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/193Preprocessing; Feature extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Child & Adolescent Psychology (AREA)
  • Acoustics & Sound (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Ophthalmology & Optometry (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The application provides a method, a device, equipment and a storage medium for generating an avatar, and the specific implementation scheme is as follows: determining emotional characteristics and facial characteristics based on the acquired expression information; performing emotion editing processing on a specific video sequence based on the emotion characteristics to obtain a video sequence with the emotion characteristics; wherein the specific video sequence comprises a video sequence of a specific object containing a face; generating an avatar of the target object based on at least the video sequence having the emotional feature and the facial feature. According to the technical scheme of the application, the problem that the generated virtual image expresses single emotion can be effectively solved.

Description

Method, device and equipment for generating virtual image and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating an avatar.
Background
The virtual image is a new interactive display medium developed along with a voice synthesis technology and a video generation technology, the naturalness and the experience of human-computer interaction can be greatly improved, and the virtual image has wide application and development prospects in anchor broadcasting, customer service and interactive scenes. However, at present, the same voice segment of the virtual image can only correspond to one emotion expression, so that the generated virtual image has single expression emotion and weak expressive force, and the requirements of users are difficult to meet.
Disclosure of Invention
In order to solve the above problems, the present application provides a method, an apparatus, a device and a storage medium for generating an avatar, which can effectively solve the problem that the generated avatar expresses a single emotion.
According to a first aspect of embodiments of the present application, there is provided a method for generating an avatar, including:
determining emotional characteristics and facial characteristics based on the acquired expression information;
performing emotion editing processing on a specific video sequence based on the emotion characteristics to obtain a video sequence with the emotion characteristics; wherein the specific video sequence comprises a video sequence of a specific object containing a face;
generating an avatar of the target object based on at least the video sequence having the emotional feature and the facial feature.
According to a second aspect of embodiments of the present application, there is provided an avatar generation apparatus, including:
the determining module is used for determining emotional characteristics and facial characteristics based on the acquired expression information;
the processing module is used for carrying out emotion editing processing on the specific video sequence based on the emotion characteristics to obtain a video sequence with the emotion characteristics; wherein the specific video sequence comprises a video sequence of a specific object containing a face;
a generating module for generating an avatar of the target object based at least on the video sequence with the emotional feature and the facial feature.
A third aspect of the present application provides an electronic device comprising:
a memory and a processor;
the memory is connected with the processor and used for storing programs;
the processor realizes the method for generating the virtual image by running the program in the memory.
A fourth aspect of the present application provides a storage medium, wherein the storage medium stores a computer program, and when the computer program is executed by a processor, the method for generating an avatar as described above is implemented.
One embodiment in the above application has the following advantages or benefits:
the method comprises the steps of determining emotional characteristics and facial characteristics based on acquired expression information, carrying out emotion editing processing on a specific video sequence based on the emotional characteristics to obtain a video sequence with the emotional characteristics, wherein the specific video sequence comprises a video sequence of a specific object and comprises a face, therefore, the video sequence with the emotional characteristics comprises the face, and based on the face characteristics embedded in the video sequence with the emotional characteristics, virtual images comprising different emotions can be generated. The problems that the generated virtual image is single in expression emotion and weak in expressive force are solved, and therefore the user requirements are met.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart illustrating a method for generating an avatar according to an embodiment of the present application;
fig. 2 is a schematic diagram of a method for generating an avatar for a specific object according to an embodiment of the present application;
fig. 3 is a flowchart illustrating a method of generating an avatar according to another embodiment of the present application;
FIG. 4 is a schematic diagram of a training face reconstruction network according to another embodiment of the present application;
fig. 5 is a schematic diagram of a method for generating an avatar for a non-specific object according to an embodiment of the present application;
fig. 6 is a flowchart illustrating a method of generating an avatar according to another embodiment of the present application;
FIG. 7 is a schematic diagram of training a neural network model according to an embodiment of the present application;
fig. 8 is a block diagram of an avatar generation apparatus according to an embodiment of the present application;
fig. 9 is a block diagram of an electronic device for implementing the avatar generation method according to the embodiment of the present application.
Detailed Description
The technical scheme of the embodiment of the application is suitable for being applied to various human-computer interaction scenes, such as human-vehicle interaction, VR scenes, voice interaction between people and various intelligent household appliances and the like. By adopting the technical scheme of the embodiment of the application, the personalized virtual image generated for different real persons can be generated more accurately.
The technical solution of the embodiment of the present application can be exemplarily applied to hardware devices such as a processor, an electronic device, a server (including a cloud server), and the like, or packaged into a software program to be executed, and when the hardware devices execute the processing procedure of the technical solution of the embodiment of the present application, or the software program is executed, the purpose of generating an avatar containing different emotions by embedding facial features and emotional features into a specific video sequence can be achieved. The embodiment of the present application only introduces the specific processing procedure of the technical scheme of the present application by way of example, and does not limit the specific implementation form of the technical scheme of the present application, and any technical implementation form that can execute the processing procedure of the technical scheme of the present application can be adopted by the embodiment of the present application.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Exemplary method
Fig. 1 is a flowchart of a method of generating an avatar according to an embodiment of the present application. In an exemplary embodiment, there is provided an avatar generation method, including:
s110, determining emotional characteristics and facial characteristics based on the acquired expression information;
s120, performing emotion editing processing on the specific video sequence based on the emotion characteristics to obtain a video sequence with the emotion characteristics; wherein the specific video sequence comprises a video sequence of a specific object containing a face;
and S130, generating an avatar of the target object at least based on the video sequence with the emotional characteristic and the facial characteristic.
In step S110, the expression information is illustratively information for representing the emotion of the user, and the expression information may be information input by the user, information extracted from information such as audio or video input by the user, environmental information or scene information that can reflect the emotion of the user, or the like. Specifically, the above expression information may include at least one of: voice data, text data, limb action information, environment information of the user, scene information of the user and the like.
Alternatively, the voice data may be a voice containing one emotion or a voice containing a plurality of emotions; the text data may be a text containing one emotion or a piece of text containing a plurality of emotions. The emotional characteristics represent the emotions of the responses in the expression information, and the emotional characteristics may include: happy, sad, neutral, angry, surprised, etc. emotions. The facial features refer to states representing facial five sense organs, and the facial features may include: eye features, lip features, eyebrow features, etc. Optionally, the ocular characteristics comprise a current state of the eye, the current state of the eye being determined in comparison to the eye in a natural expression, e.g., the eye is open, etc., closed, etc.; the lip characteristics include the current state of the mouth, which is determined in comparison to the mouth in natural expressions, e.g., mouth opening, mouth corner lifting, etc.
Alternatively, the correspondence relationship between the expression information and the emotional characteristic and the facial characteristic may be stored in advance. Therefore, the emotion characteristics and the facial characteristics corresponding to the acquired expression information can be directly determined according to the corresponding relationship between the pre-stored expression information and the emotion characteristics and facial characteristics. The expression information can also be input into a trained neural network, and the emotional characteristics and the facial characteristics contained in the expression information are determined. The neural network is obtained by training according to different expression information and emotional characteristics and facial characteristics contained in the expression information.
Optionally, when the expression information is voice data and/or text data, the emotion to be expressed in the voice data and/or the text data may be recognized, so as to obtain an emotional characteristic, the opening and closing state of the lips may be determined according to the content of the voice data and/or the text data, and the shapes of the eyes and the lips may be changed according to the emotional characteristic. Thus, the opening and closing state of the lips and the changed eye and lip shapes are combined to determine the facial features.
Alternatively, when the expression information is a limb motion, the emotion desired to be expressed may be determined from the limb motion, thereby obtaining the emotional characteristic. For example, a tight hand is associated with tension, a tight hand is associated with mouth is associated with surprise, etc. And changing the shapes of the eyes and the lips according to the emotional characteristics, thereby determining the facial characteristics.
Optionally, when the expression information is the body movement and the voice data and/or the text data, the emotion information expressed by the body movement is judged, then the emotion which is required to be expressed in the voice data and/or the text data is judged, and then the emotion expressed by the body movement and the emotion expressed by the voice data and/or the text data are fused to determine the emotional characteristics. Furthermore, if the difference between the emotion expressed by the two is large, the emotion expressed by the voice data and/or the text data can be used as the standard. And then the opening and closing state of the lips is determined according to the content of the voice data and/or the text data, and the shapes of eyes and lips can be changed according to emotional characteristics. And determining the facial features by combining the opening and closing state of the lips and the changed eye and lip shapes.
Optionally, when the expression information includes environment information and scene information where the user is located, the emotional characteristic that the user would exhibit when the user is located in the environment or scene may be analyzed and determined according to the environment information and the scene information. For example, assuming that the user is in a classical concert venue, the user is infected with classical music, possibly in a peaceful, calm emotional state; assuming that the user is in the world cup field, the user may be in a nervous, excited emotional state. Therefore, it is possible to recognize the emotional state of the user to some extent based on the environmental information and the scene information where the user is located. Alternatively, the environmental information and the scene information may be used to assist in more accurately determining the emotional characteristics of the user through other information, such as voice data, body movements, and the like. For example, assuming that the user is in a fierce debate competition scene and the user voice data is voice data debated by the user, it is possible to determine that the emotion of the user is a fierce and fierce emotional state by combining the scene and the voice data.
In step S120, the specific video sequence includes a video sequence including a face of a specific object, which may include one emotion or a plurality of emotions, and the specific object represents an object having a life, such as a designated person or animal. Optionally, the specific video sequence may be an original video obtained by shooting a specific object in advance, or may be a segment of video captured in the original video, or may be a video sequence obtained by splicing a plurality of captured video segments. Optionally, the original video may be obtained by shooting according to a camera, and the camera may be a camera installed on any device, for example, an automobile, a mobile terminal, a computer, an unmanned aerial vehicle, and the like, which is not limited herein.
Optionally, the emotional features are added to video frames corresponding to the specific video sequence, so as to obtain a video sequence with the emotional features. Optionally, the emotional characteristics may be sequentially added to each video frame in a specific video sequence to obtain a video sequence with the emotional characteristics; or selecting a specific video frame in a specific video sequence, and adding the emotional characteristic to the specific video frame to obtain the video sequence with the emotional characteristic.
Illustratively, the emotional characteristic is added to the video frame corresponding to the specific video sequence, and the emotional state of the face image in the video frame is adjusted based on the emotional characteristic, so that the face image shows the emotional state corresponding to the added emotional characteristic. The mode of adjusting the emotional state of the face image can be that the five sense organs state of the face is adjusted to show the emotion which is wanted to be expressed.
Optionally, when a plurality of different emotional features are determined in the expression information, the plurality of emotional features may be sequentially added to the video frames of the specific video sequence according to the order of the emotional features in the expression information, so as to obtain video sequences with different emotional features. Specifically, each video frame and the emotional features can be represented by a vector respectively, and the emotional features can be added to the video sequence by fusing the two vectors.
In step S130, the avatar of the target object may be, for example, an avatar of a specific object or an avatar of a non-specific object. An avatar refers to a non-real (e.g., VR) avatar used in a virtual environment, a two-dimensional (2D) model made by software, or a three-dimensional (3D) model. Optionally, since only the emotion change exists in the video sequence with the emotion feature, the corresponding facial feature needs to be fused with the video frame with the emotion feature, so that the fused video sequence simultaneously includes the emotion feature and the facial feature. In this way, avatars with different moods can be generated from the fused video sequence. After the virtual image is obtained, the virtual image can be driven according to the voice, so that the virtual image can be matched with different emotions according to the voice, and the virtual image has different emotions.
Alternatively, as shown in fig. 2, when a specific object is driven, the mask is used to wipe out the facial details of the specific object in the specific video sequence, the pose information (such as the face shape) in the video frame is retained, the emotional feature is embedded into the video frame corresponding to the specific video sequence to obtain the video sequence with the emotional feature, and then the corresponding facial feature is embedded into the video frame with the emotional feature to determine the facial details in the video frame. Wherein the emotional feature and the facial feature are embedded by the AdaIn condition, respectively. Thereby effecting emotional editing, e.g., from sadness to happy, from neutral to sadness, etc., thereby comprising avatars of different emotions.
In the technical scheme of the application, the emotion characteristics and the facial characteristics are determined based on the acquired expression information, the emotion editing processing is carried out on the specific video sequence based on the emotion characteristics, the video sequence with the emotion characteristics is obtained, the specific video sequence comprises the video sequence of the specific object, wherein the video sequence comprises the face, therefore, the video sequence with the emotion characteristics comprises the face, and the face characteristics are embedded into the video sequence with the emotion characteristics based on the fact that the virtual image comprising different emotions can be generated. The problems that the generated virtual image can express single emotion and weak expressive force are solved, and therefore the user requirements are met.
In one embodiment, as shown in fig. 3, performing an emotion editing process on a specific video sequence based on the emotional characteristic to obtain a video sequence with the emotional characteristic, where step S120 includes:
s310, reconstructing a video frame of the specific video sequence to obtain a reconstructed video code;
s320, obtaining a video sequence with the emotion characteristics by fusing the reconstructed video coding and the emotion characteristics.
Illustratively, since it is necessary to reduce dependency on human expressiveness and edit emotions in a video, it is necessary to reconstruct a specific video sequence, determine video frames in the specific video sequence, and reconstruct each video frame to obtain a plurality of reconstructed images. Optionally, features in the video frame may be extracted, and the video frame may be reconstructed according to the features. And determining the identification codes of a plurality of reconstructed images according to the sequence of the video frames so as to obtain reconstructed video coding. And then determining a vector w corresponding to the image characteristics of the reconstructed video code, determining a vector b corresponding to the emotion characteristics, and adding the vector w and the vector b to obtain a new reconstructed video code, namely w + xb. Wherein x is a weighting coefficient, and in order not to destroy the new reconstructed video coding, the x setting threshold is small enough, and may be specifically set according to the actual situation, which is not limited herein. Thereby enabling the new reconstructed video coding to have different emotional characteristics.
In one embodiment, reconstructing a video frame of a specific video sequence to obtain a reconstructed video code, and performing a fusion process on the reconstructed video code and the emotional feature to obtain a video sequence with the emotional feature, includes:
inputting the emotional characteristics into a pre-trained face reconstruction network so that the face reconstruction network carries out video frame reconstruction on a specific video sequence to obtain a reconstructed video code, and carrying out fusion processing on the reconstructed video code and the emotional characteristics to obtain a video sequence with the emotional characteristics;
the face reconstruction network is obtained by performing face reconstruction training based on a training video sequence of the specific object, wherein the training video sequence comprises video sequences of the specific object with different emotional characteristics.
Illustratively, the face reconstruction network is used to reconstruct a particular video sequence and to emotionally edit the particular video sequence. Optionally, the face reconstruction Network is generated by combining an e4e Network structure and a style-based generated confrontation Network (style gan). In particular, as shown in fig. 4, during training, the styleGAN network is difficult to reconstruct to obtain the out-of-domain image, but has good property editing performance. Therefore, the e4e network structure is adopted to obtain the reconstruction implicit characteristic, and the video frame in the training video sequence is reconstructed according to the reconstruction implicit characteristic to obtain a reconstructed image. It should be noted that the obtained reconstructed image has a low similarity to an Identification (ID) of the video frame, but the appearance (e.g., pose, etc.) has a high similarity to the expression (e.g., lip shape, eye shape). On the basis, the obtained reconstructed image is subjected to fine adjustment by using a styleGAN network so as to reconstruct a training video sequence. And inputting training data of the emotion characteristics into a face reconstruction network, learning and reconstructing offset of the implicit characteristics after the emotion characteristics pass through a plurality of full-connection layers, wherein the offset learns change caused by emotion on the basis of video frames in a training video sequence, namely the video frame characteristics (namely the reconstructed implicit characteristics) in the training video sequence are vectors v and vectors a corresponding to the emotion characteristics, the reconstructed implicit characteristics after the change are v + xa, and x is a weighting coefficient. Meanwhile, in order not to destroy the new reconstructed video coding, the x setting threshold is small enough and can be specifically set according to the actual situation. Therefore, the reconstruction of the video can be ensured, the local emotional constraint is increased, and the ID similarity and the continuity are not changed. Finally, to ensure training accuracy, ID loss, VGG (Visual Geometry Group) loss, expression classification loss, and sequence consistency loss are added. And training to obtain a trained face reconstruction network.
Specifically, the expression characteristics are input into a trained face reconstruction network, a video sequence with emotion characteristics is output, reconstruction of the video is guaranteed, a video with dynamic emotion changes is generated, and the problem of single emotion is solved.
In one embodiment, in a case that the target object is not the specific object, the generating an avatar of the target object based on at least the video sequence having the emotional feature and the facial feature includes:
generating an avatar of the target object based on the video sequence having the emotional feature, the facial feature, and an appearance feature of the target object.
Illustratively, the appearance features may include: the shape of the target subject's five sense organs (e.g., eyes, mouth, nose, eyebrows, etc.), facial shape, hair color, etc. Alternatively, the eye shapes may include double-edged eyelid big eyes, single-edged eyelid small eyes, single-edged eyelid big eyes, etc., the face shapes may include square face shapes, long face shapes, round face shapes, etc., the nose shapes may include small nose, olecranon nose, etc., the mouth shapes may include a thicker upper lip than a lower lip, etc. Alternatively, the appearance of the target object may be determined from an image containing the target object in natural expression. Specifically, the image of the target object under the natural expression may be processed according to a face recognition model (such as an encoder network), and the appearance features of the target object may be output.
Illustratively, in the case where the target object is not the specific object, it is explained that the target object is a non-specific object. And the specific video sequence is a video sequence including a specific object, and although a mood can be edited in the specific video sequence, the appearance of a non-specific person does not exist in the specific video sequence. Therefore, it is necessary to determine the appearance of the target object. Alternatively, the appearance feature and the facial feature of the target object may be embedded together in a video frame of a video sequence having emotional features, generating an avatar containing different emotions. Or the appearance characteristic and the facial characteristic of the target object are fused firstly, and then the fused characteristic is embedded into a video frame in a video sequence with emotional characteristic to generate virtual images containing different emotions. In this way, it is possible to generate an avatar corresponding to a specific object, and also to generate an avatar corresponding to a non-specific object.
In one embodiment, the generating an avatar of the target object based on the video sequence with the emotional feature, the facial feature, and an appearance feature of the target object includes:
and fusing the facial features and the appearance features of the target object into a video sequence with the emotional features based on the motion features of the key points of the face of the target object to generate an avatar of the target object.
In the embodiment, as shown in fig. 5, since it is easy to change the background during directly embedding the appearance features of the target object into the video sequence, it is not effective to directly replace the face in the video sequence with the appearance features of the target object. Therefore, the motion characteristics of the key points of the face of the target object can be determined according to the motion condition of the key points of the face through self-supervision learning or the motion condition of the key points of the face extracted by using the three-dimensional face model. According to the motion characteristics, the appearance characteristics are embedded into the video frames of the video sequence, and then the emotion characteristics and the facial characteristics are embedded into the video frames of the video sequence, and specifically, a styleGAN network can be adopted. Wherein, the emotional characteristic and the facial characteristic are respectively embedded by AdaIn condition, thereby realizing emotional change, such as sadness to happiness, neutrality to sadness, and the like, and further comprising virtual images of different emotions.
In one embodiment, as shown in fig. 6, the step S110 of determining the emotional characteristic and the facial characteristic based on the acquired expression information includes:
s610, determining an information characteristic sequence in the expression information;
and S620, extracting emotion characteristics and facial characteristics corresponding to each information characteristic in the information characteristic sequence.
Illustratively, the expression information may include: voice data or text data. Since some words or sentences having no meaning are included in the speech data or the text data, the expression information needs to be extracted. Optionally, when the expression information is voice data, the voice feature can be directly extracted from the voice data according to a deepspeed network structure; or converting the voice data into text data and extracting the characteristics of the converted text data. When the expression information is text data, feature extraction can be performed on the text data to obtain text features. Alternatively, at least one information feature may be included in the speech data or text data, and the sequence of information features may be determined according to the order in which the information features are extracted. For example, a speech feature a, a speech feature B, and a speech feature C are extracted from the speech data, respectively, and the information feature sequence is [ ab C ]. Optionally, each voice feature or text feature may be sequentially input into the trained neural network to determine an emotion feature and a facial feature corresponding to each voice feature or text feature. The emotion characteristics and the facial characteristics corresponding to the voice characteristics or the text characteristics can be determined according to the mapping relation between the voice characteristics or the text characteristics and the emotion characteristics and the facial characteristics, so that emotion changes and facial characteristic changes in the voice data or the text data can be determined more accurately. The trained neural network can be obtained by training in advance according to emotional characteristics, facial characteristics, voice characteristics or text characteristics.
In one embodiment, the step S620 of extracting an emotional feature and a facial feature corresponding to each information feature in the information feature sequence includes:
determining emotion characteristics corresponding to each information characteristic in the information characteristic sequence;
lip features and eye features are determined from the emotional features.
Illustratively, the expression information may include: voice data or text data. Taking voice data as an example, feature extraction is performed on the voice data to obtain at least one voice feature (i.e., information feature). The emotion expressed by the speech features can be recognized according to models such as a discrete speech emotion classifier and a Hidden Markov Model (HMM), so that the emotion feature corresponding to each information feature is determined. Optionally, the shape change of the lips and the shape change of the eyes may be further determined by emotional characteristics. For example, if the emotional characteristic is happy, the lip characteristic may be that the mouth corner is raised, and the eye characteristic may be that the eye corner is bent downward, or alternatively, the shape change of the lip and the shape change of the eyes may be preliminarily determined according to the voice characteristic, and then the shape change of the lip and the shape change of the eyes may be corrected according to the emotional characteristic to obtain the final lip characteristic and eye characteristic, so that the expression of the virtual model generated according to the lip characteristic, the eye characteristic, and the emotional characteristic may conform to the real person speaking mode.
In one embodiment, the determining of the information characteristic sequence in the expression information; extracting emotion characteristics and facial characteristics corresponding to each information characteristic in the information characteristic sequence, wherein the steps comprise:
inputting the expression information into a pre-trained neural network model so that the neural network model determines an information characteristic sequence in the expression information; and extracting emotion characteristics and facial characteristics corresponding to each information characteristic in the information characteristic sequence.
Illustratively, the neural network model is used to extract emotional and facial features in the expression information, and may employ a long-short-term memory network (LSTM), a gated recurrent neural network (GRU), a transformed neural network (transform), and the like. Alternatively, the input layer uses LSTM and the two latter layers use the transformer structure of mutli-head.
Optionally, as shown in fig. 7, during training, in order to enhance the expression ability of the real-person avatar, multi-dimensional expression is added, and facial expressions corresponding to different emotions are obtained by fusing different emotions according to the voice features and/or the text features. And inputting the voice feature, the text feature, the emotion feature, the eye feature and the lip feature into a neural network model for self-supervision training. The trained neural network model can extract emotional features and facial features in the expression information. Wherein the facial features include: an eye feature and a lip feature.
Alternatively, the speech feature may be a deep feature, a (Mel-frequency cepstral coefficients, MFCC) coefficient feature, or other features. Because the high-level voice features can erase part of emotional expressions and do not utilize autonomous emotional learning, multi-modal text features are introduced on the basis of the voice features. And because the emotional characteristics of a sentence are stable, GPT-2 is adopted to extract the text characteristics of the sentence level. Optionally, the emotional characteristics mainly include happiness, sadness, neutrality, anger and surprise five expressions, and the emotion can be manually set according to needs, such as fear, nausea and the like. In the specific training, 5 emotions exist, each feature is 128-dimensional, after random initialization, the vectors of the emotions are automatically updated in the training process, the first frame of the training video sequence can be selected as a reference frame, and the offset between the key points corresponding to the reference frame and the standard key points is determined by using a method of self-supervision key points, so that the emotion change can be better learned. Or fitting the corrected key points of the human face by using a three-dimensional human face model, and calculating the offset according to the key points of the human face and the standard key points.
Optionally, the lip characteristics are strongly correlated with voice, while the eye states are weakly correlated, and in order to better fit the eye states, an eye regression signal is added to the eye dimensions, namely, the eye image is intercepted according to the eye closing detector, and the opening and closing of the eyes can be determined according to the ratio of the length and the width of the eyes, so as to determine the eye characteristics.
In addition, in order to decouple the extracted eye features and lip features, the eye features and lip features are made to be related only to facial movements and not to emotions. During training, different emotion characteristics, the same eye characteristics and lip characteristics are extracted by adopting a cross prediction scheme and input into a neural network model for training.
Exemplary devices
Accordingly, fig. 8 is a schematic structural diagram of an avatar generation apparatus according to an embodiment of the present application. In an exemplary embodiment, there is provided an avatar generation apparatus including:
a determining module 810, configured to determine emotional characteristics and facial characteristics based on the obtained expression information;
a processing module 820, configured to perform emotion editing processing on a specific video sequence based on the emotion feature to obtain a video sequence with the emotion feature; wherein the specific video sequence comprises a video sequence of a specific object containing a face;
a generating module 830 for generating an avatar of the target object based on at least the video sequence with the emotional feature and the facial feature.
In one embodiment, the processing module 820 includes:
the editing module is used for reconstructing video frames of a specific video sequence to obtain reconstructed video codes;
and the fusion module is used for obtaining a video sequence with the emotion characteristics by carrying out fusion processing on the reconstructed video coding and the emotion characteristics.
In one embodiment, reconstructing a video frame of a specific video sequence to obtain a reconstructed video code, and performing a fusion process on the reconstructed video code and the emotional feature to obtain a video sequence with the emotional feature, includes:
inputting the emotional characteristics into a pre-trained face reconstruction network so that the face reconstruction network carries out video frame reconstruction on a specific video sequence to obtain a reconstructed video code, and carrying out fusion processing on the reconstructed video code and the emotional characteristics to obtain a video sequence with the emotional characteristics;
the face reconstruction network is obtained by performing face reconstruction training based on a training video sequence of the specific object, wherein the training video sequence comprises video sequences of the specific object with different emotional characteristics.
In one embodiment, in case that the target object is not the specific object, the generating module 830 includes:
generating an avatar of the target object based on the video sequence having the emotional feature, the facial feature, and an appearance feature of the target object.
In one embodiment, the generating an avatar of the target object based on the video sequence having the emotional feature, the facial feature, and a feature of the target object comprises:
and fusing the facial features and the appearance features of the target object into the video sequence with the emotion features based on the motion features of the key points of the face of the target object to generate the virtual image of the target object.
In one embodiment, the determining module 810 includes:
determining an information characteristic sequence in the expression information;
and extracting emotion characteristics and facial characteristics corresponding to each information characteristic in the information characteristic sequence.
In one embodiment, the extracting of the emotion feature and the facial feature corresponding to each information feature in the information feature sequence includes:
determining emotion characteristics corresponding to each information characteristic in the information characteristic sequence;
lip features and eye features are determined from the emotional features.
In one embodiment, the determining of the information characteristic sequence in the expression information; extracting emotion characteristics and facial characteristics corresponding to each information characteristic in the information characteristic sequence, wherein the steps comprise:
inputting the expression information into a pre-trained neural network model so that the neural network model determines an information characteristic sequence in the expression information; and extracting emotion characteristics and facial characteristics corresponding to each information characteristic in the information characteristic sequence.
The apparatus for generating an avatar provided in this embodiment is the same as the method for generating an avatar provided in the foregoing embodiment of the present application, and can execute the method for generating an avatar provided in any of the foregoing embodiments of the present application, and has functional modules and beneficial effects corresponding to the method for generating an avatar. For details of the technology that are not described in detail in this embodiment, reference may be made to specific processing contents of the method for generating an avatar provided in the foregoing embodiment of the present application, and details are not described here again.
According to the technical scheme, the acquisition, storage, application and the like of the personal information of the related user are all in accordance with the regulations of related laws and regulations, and the customs of the public order is not violated.
Exemplary electronic device
Another embodiment of the present application further provides an electronic device, as shown in fig. 9, the electronic device including:
a memory 900 and a processor 910;
wherein, the memory 900 is connected to the processor 910 for storing programs;
the processor 910 is configured to implement the method for generating an avatar disclosed in any of the above embodiments by running the program stored in the memory 900.
Specifically, the electronic device may further include: a bus, a communication interface 920, an input device 930, and an output device 940.
The processor 910, the memory 900, the communication interface 920, the input device 930, and the output device 940 are connected to each other through a bus. Wherein:
a bus may comprise a path that transfers information between components of a computer system.
The processor 910 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with the present invention. But may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components.
The processor 910 may include a main processor and may also include a baseband chip, modem, and the like.
The memory 900 stores programs for executing the technical solution of the present invention, and may also store an operating system and other key services. In particular, the program may include program code including computer operating instructions. More specifically, memory 900 may include a read-only memory (ROM), other types of static storage devices that may store static information and instructions, a Random Access Memory (RAM), other types of dynamic storage devices that may store information and instructions, a disk storage, a flash, and so forth.
Input devices 930 may include devices that receive data and information input by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.
Output device 940 may include devices that allow output of information to a user, such as a display screen, printer, speakers, etc.
Communication interface 920 may include any means for using any transceiver or the like to communicate with other devices or communication networks, such as ethernet, radio Access Network (RAN), wireless Local Area Network (WLAN), etc.
The processor 910 executes the program stored in the memory 900 and invokes other devices, which can be used to implement the steps of any avatar generation method provided in the above embodiments of the present application.
Exemplary computer program product and storage Medium
In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the method of generating an avatar according to various embodiments of the present application described in the above-mentioned "exemplary methods" section of this specification.
The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present application may also be a storage medium having stored thereon a computer program that is executed by a processor to perform steps in the avatar generation method according to various embodiments of the present application described in the above-mentioned "exemplary methods" section of the present specification.
The specific work content of the electronic device, and the specific work content of the computer program product and the computer program on the storage medium when executed by the processor, may refer to the content of the above method embodiment, and are not described herein again.
While, for purposes of simplicity of explanation, the foregoing method embodiments are presented as a series of acts or combinations, it will be appreciated by those of ordinary skill in the art that the present application is not limited by the illustrated ordering of acts, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and reference may be made to the partial description of the method embodiment for relevant points.
The steps in the method of each embodiment of the present application may be sequentially adjusted, combined, and deleted according to actual needs, and technical features described in each embodiment may be replaced or combined.
The modules and sub-modules in the device and the terminal in the embodiments of the application can be combined, divided and deleted according to actual needs.
In the several embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of a module or a sub-module is only one logical function division, and other division manners may be available in actual implementation, for example, a plurality of sub-modules or modules may be combined or integrated into another module, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
The modules or sub-modules described as separate parts may or may not be physically separate, and parts that are modules or sub-modules may or may not be physical modules or sub-modules, may be located in one place, or may be distributed over a plurality of network modules or sub-modules. Some or all of the modules or sub-modules can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, each functional module or sub-module in the embodiments of the present application may be integrated into one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated into one module. The integrated modules or sub-modules may be implemented in the form of hardware, or may be implemented in the form of software functional modules or sub-modules.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software cells may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for generating an avatar, comprising:
determining emotional characteristics and facial characteristics based on the acquired expression information;
reconstructing a video frame of a specific video sequence to obtain a reconstructed video code;
obtaining a video sequence with the emotion characteristics by carrying out fusion processing on the reconstructed video coding and the emotion characteristics; wherein the particular video sequence comprises a video sequence of a particular object that includes a face; the reconstructed video coding comprises characteristics of video frames and identification codes of a plurality of reconstructed images reconstructed according to the characteristics of the video frames;
generating an avatar of the target object based on at least the video sequence having the emotional feature and the facial feature.
2. The method according to claim 1, wherein the reconstructing video frames of a specific video sequence to obtain a reconstructed video code, and obtaining a video sequence with the emotional feature by performing a fusion process on the reconstructed video code and the emotional feature comprises:
inputting the emotional characteristics into a pre-trained face reconstruction network so that the face reconstruction network carries out video frame reconstruction on a specific video sequence to obtain a reconstructed video code, and carrying out fusion processing on the reconstructed video code and the emotional characteristics to obtain a video sequence with the emotional characteristics;
the face reconstruction network is obtained by performing face reconstruction training based on a training video sequence of the specific object, wherein the training video sequence comprises video sequences of the specific object with different emotional characteristics.
3. The method of claim 1, wherein in the case that the target object is not the specific object, the generating an avatar of the target object based on at least the video sequence with the emotional feature and the facial feature comprises:
generating an avatar of the target object based on the video sequence with the emotional feature, the facial feature, and an appearance feature of the target object.
4. The method of claim 3, wherein generating an avatar for a target object based on the video sequence with the emotional features, the facial features, and appearance features of the target object comprises:
and fusing the facial features and the appearance features of the target object into a video sequence with the emotional features based on the motion features of the key points of the face of the target object to generate an avatar of the target object.
5. The method of claim 1, wherein the determining emotional and facial features based on the obtained expression information comprises:
determining an information characteristic sequence in the expression information;
and extracting emotion characteristics and facial characteristics corresponding to each information characteristic in the information characteristic sequence.
6. The method of claim 5, wherein the extracting emotional features and facial features corresponding to each information feature in the information feature sequence comprises:
determining emotion characteristics corresponding to each information characteristic in the information characteristic sequence;
lip features and eye features are determined from the emotional features.
7. The method of claim 5, wherein the determining of the sequence of information features in the expression information; extracting emotion characteristics and facial characteristics corresponding to each information characteristic in the information characteristic sequence, wherein the steps comprise:
inputting the expression information into a pre-trained neural network model so that the neural network model determines an information characteristic sequence in the expression information; and extracting emotion characteristics and facial characteristics corresponding to each information characteristic in the information characteristic sequence.
8. An avatar generation apparatus, comprising:
the determining module is used for determining emotional characteristics and facial characteristics based on the acquired expression information;
the processing module is used for reconstructing video frames of a specific video sequence to obtain reconstructed video codes; obtaining a video sequence with the emotion characteristics by carrying out fusion processing on the reconstructed video coding and the emotion characteristics; wherein the specific video sequence comprises a video sequence of a specific object containing a face; the reconstructed video coding comprises characteristics of video frames and identification codes of a plurality of reconstructed images reconstructed according to the characteristics of the video frames;
a generating module for generating an avatar of the target object based at least on the video sequence with the emotional feature and the facial feature.
9. An electronic device, comprising:
a memory and a processor;
the memory is connected with the processor and used for storing programs;
the processor implements the avatar generation method of any of claims 1-7 by executing the program in the memory.
10. A storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, implements a method of generating an avatar according to any of claims 1 to 7.
CN202211310590.9A 2022-10-25 2022-10-25 Method, device and equipment for generating virtual image and storage medium Active CN115375809B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211310590.9A CN115375809B (en) 2022-10-25 2022-10-25 Method, device and equipment for generating virtual image and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211310590.9A CN115375809B (en) 2022-10-25 2022-10-25 Method, device and equipment for generating virtual image and storage medium

Publications (2)

Publication Number Publication Date
CN115375809A CN115375809A (en) 2022-11-22
CN115375809B true CN115375809B (en) 2023-03-14

Family

ID=84074221

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211310590.9A Active CN115375809B (en) 2022-10-25 2022-10-25 Method, device and equipment for generating virtual image and storage medium

Country Status (1)

Country Link
CN (1) CN115375809B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111145282A (en) * 2019-12-12 2020-05-12 科大讯飞股份有限公司 Virtual image synthesis method and device, electronic equipment and storage medium
CN113760101A (en) * 2021-09-23 2021-12-07 北京字跳网络技术有限公司 Virtual character control method and device, computer equipment and storage medium
CN113923462A (en) * 2021-09-10 2022-01-11 阿里巴巴达摩院(杭州)科技有限公司 Video generation method, live broadcast processing method, video generation device, live broadcast processing device and readable medium
CN114245215A (en) * 2021-11-24 2022-03-25 清华大学 Method, device, electronic equipment, medium and product for generating speaking video
CN114399818A (en) * 2022-01-05 2022-04-26 广东电网有限责任公司 Multi-mode face emotion recognition method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10748579B2 (en) * 2016-10-26 2020-08-18 Adobe Inc. Employing live camera feeds to edit facial expressions
US10953334B2 (en) * 2019-03-27 2021-03-23 Electronic Arts Inc. Virtual character generation from image or video data
CN110688911B (en) * 2019-09-05 2021-04-02 深圳追一科技有限公司 Video processing method, device, system, terminal equipment and storage medium
CN114513678A (en) * 2020-11-16 2022-05-17 阿里巴巴集团控股有限公司 Face information generation method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111145282A (en) * 2019-12-12 2020-05-12 科大讯飞股份有限公司 Virtual image synthesis method and device, electronic equipment and storage medium
CN113923462A (en) * 2021-09-10 2022-01-11 阿里巴巴达摩院(杭州)科技有限公司 Video generation method, live broadcast processing method, video generation device, live broadcast processing device and readable medium
CN113760101A (en) * 2021-09-23 2021-12-07 北京字跳网络技术有限公司 Virtual character control method and device, computer equipment and storage medium
CN114245215A (en) * 2021-11-24 2022-03-25 清华大学 Method, device, electronic equipment, medium and product for generating speaking video
CN114399818A (en) * 2022-01-05 2022-04-26 广东电网有限责任公司 Multi-mode face emotion recognition method and device

Also Published As

Publication number Publication date
CN115375809A (en) 2022-11-22

Similar Documents

Publication Publication Date Title
CN110688911B (en) Video processing method, device, system, terminal equipment and storage medium
CN111145282B (en) Avatar composition method, apparatus, electronic device, and storage medium
KR101558202B1 (en) Apparatus and method for generating animation using avatar
US8224652B2 (en) Speech and text driven HMM-based body animation synthesis
Chuang et al. Mood swings: expressive speech animation
KR101604593B1 (en) Method for modifying a representation based upon a user instruction
US20120130717A1 (en) Real-time Animation for an Expressive Avatar
CN108942919B (en) Interaction method and system based on virtual human
KR102509666B1 (en) Real-time face replay based on text and audio
WO2023284435A1 (en) Method and apparatus for generating animation
US11860925B2 (en) Human centered computing based digital persona generation
CN114357135A (en) Interaction method, interaction device, electronic equipment and storage medium
EP4154093A1 (en) Speech-driven gesture synthesis
KR101738142B1 (en) System for generating digital life based on emotion and controlling method therefore
CN115797488A (en) Image generation method and device, electronic equipment and storage medium
CN112819933A (en) Data processing method and device, electronic equipment and storage medium
KR102437039B1 (en) Learning device and method for generating image
CN115908657A (en) Method, device and equipment for generating virtual image and storage medium
CN113903067A (en) Virtual object video generation method, device, equipment and medium
CN116129013A (en) Method, device and storage medium for generating virtual person animation video
CN115497448A (en) Method and device for synthesizing voice animation, electronic equipment and storage medium
Vasani et al. Generation of indian sign language by sentence processing and generative adversarial networks
Filntisis et al. Video-realistic expressive audio-visual speech synthesis for the Greek language
Tao et al. Emotional Chinese talking head system
CN115375809B (en) Method, device and equipment for generating virtual image and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant