CN111862277A

CN111862277A - Method, apparatus, device and storage medium for generating animation

Info

Publication number: CN111862277A
Application number: CN202010710731.0A
Authority: CN
Inventors: 赵洋; 杨少雄
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2020-10-30

Abstract

The application discloses a method, a device, equipment and a storage medium for generating animation, and relates to the field of augmented reality technology, animation and deep learning. The specific implementation scheme is as follows: acquiring a face image set and audio matched with the face image set; analyzing the facial images in the facial image set, and determining head posture information and first expression information of a facial object; determining a model and audio according to pre-trained mouth information, and determining the mouth information of the human face object, wherein the mouth information determination model is used for representing the corresponding relation between the audio and the mouth information; fusing the first expression information and the mouth information to determine second expression information; and generating the animation based on the head posture information and the second expression information. This implementation mode can make the animation that obtains more lifelike smooth, has improved the stability under the mouth shelters from the condition.

Description

Method, apparatus, device and storage medium for generating animation

Technical Field

The present application relates to the field of computer technologies, and in particular, to the field of augmented reality technologies, animations, and deep learning, and in particular, to a method, apparatus, device, and storage medium for generating animations.

Background

Human face animation is increasingly applied to multi-modal human-computer interaction, movie making, computer games, video conferences, virtual hosts and the like. The face animation needs to drive the mixed shape deformation of the face through the facial expression information in real time. The key link of the facial animation driving technology for extracting the facial expression information enables a facial model driven by a face mixed shape to be more vivid only by accurately capturing the facial expressions.

The current human face driving method is to obtain human face expression information by solving the global optimal matching of each preset expression on the basis of extracting human face sparse key points by taking an image as input. Due to individual face differences, global optimal matching is difficult to achieve, and errors exist in extraction of face sparse key points.

Disclosure of Invention

A method, apparatus, device, and storage medium for generating an animation are provided.

According to a first aspect, there is provided a method for generating an animation, comprising: acquiring a face image set and audio matched with the face image set; analyzing the facial images in the facial image set, and determining head posture information and first expression information of a facial object; determining a model and audio according to pre-trained mouth information, and determining the mouth information of the human face object, wherein the mouth information determination model is used for representing the corresponding relation between the audio and the mouth information; fusing the first expression information and the mouth information to determine second expression information; and generating the animation based on the head posture information and the second expression information.

According to a second aspect, there is provided an apparatus for generating an animation, comprising: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire a face image set and audio matched with the face image set; the first determining unit is configured to analyze the facial images in the facial image set and determine head posture information and first expression information of the facial object; the second determining unit is configured to determine a model and audio according to pre-trained mouth information, and determine the mouth information of the human face object, wherein the mouth information determining model is used for representing the corresponding relation between the audio and the mouth information; a fusion unit configured to fuse the first expression information and the mouth information and determine second expression information; a generating unit configured to generate an animation based on the head posture information and the second expression information.

According to a third aspect, there is provided an electronic device for generating an animation, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the first aspect.

According to a fourth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described in the first aspect.

According to the technology of the application, the problem that the existing facial expression information is inaccurate is solved, the facial expression information and the mouth information of the face are respectively extracted by utilizing the facial image and the audio matched with the facial image, and the facial expression information and the mouth information are fused, so that the obtained animation is more vivid and smooth, and the stability under the condition that the mouth is shielded is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for generating an animation according to the present application;

FIG. 3 is a schematic diagram of an application scenario of a method for generating an animation according to the application;

FIG. 4 is a flow diagram of another embodiment of a method for generating an animation according to the present application;

FIG. 5 is a schematic diagram illustrating the structure of one embodiment of an apparatus for generating an animation according to the present application;

FIG. 6 is a block diagram of an electronic device for implementing a method for generating an animation according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the method for generating an animation or the apparatus for generating an animation of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as an animation playing application, a video playing application, an image browsing application, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, car computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as a background server that processes images and audio provided on the

terminal devices

101, 102, 103. The backend server can process the images and audio, generate animation, and feed the animation back to the

terminal devices

101, 102, 103.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for generating an animation provided in the embodiment of the present application may be executed by the

terminal devices

101, 102, and 103, or may be executed by the server 105. Accordingly, the apparatus for generating animation may be provided in the

terminal devices

101, 102, 103, or may be provided in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for generating an animation according to the present application is shown. The method for generating the animation of the embodiment comprises the following steps:

step 201, a face image set and an audio matched with the face image set are obtained.

In this embodiment, an execution subject of the method for generating an animation (for example, the

terminal device

101, 102, 103 or the server 105 shown in fig. 1) may acquire a face image set and audio matched with the face image set. Here, the face image set may include a plurality of images of the same face object. The audio may be audio that matches each face image in the set of face images. Matching means that the mouth movements corresponding to the respective factors in the audio are matched with the mouth image in the face image. In some application scenarios, the set of facial images and the audio are from the same video.

Step 202, analyzing the facial images in the facial image set, and determining the head posture information and the first expression information of the facial object.

After the execution main body obtains the face image set, the face images in the face image set can be analyzed, and the head posture information and the first expression information of the face object are determined. Here, the execution subject may extract feature information of each face image in the face image set, respectively, or the execution subject may perform processing such as semantic segmentation on the face images. Through the above processing, the execution subject can obtain the head pose information and the first expression information of the human face object from the analysis result. For example, the executing subject may input the extracted feature information into a head pose information determination model trained in advance to determine head pose information. Alternatively, the executive body may determine a head contour from the results of the semantic segmentation, and determine head pose information from the contour. Here, the head pose information is used to indicate the relative position of the head of the human face object and the camera coordinate system. Similarly, the execution subject may further obtain the first expression information of the human face object through the analysis result. For example, the executing subject may input the extracted feature information into a pre-trained expression information determination model to determine the first expression information. Here, the first expression information may include a plurality of parameter values, each of which is used to describe a state of a different part of the face. For example, the first expression information may include a parameter describing the opening and closing degree of the mouth, and the parameter may take a value between 0 and 1, wherein a larger value indicates a larger opening and closing degree of the mouth, and a smaller value indicates a smaller opening and closing degree of the mouth. Or the first expression information may further include a parameter for describing the opening and closing degree of the eyes, and the parameter may also take a value between 0 and 1, wherein a larger value indicates a larger opening and closing degree of the eyes, and a smaller value indicates a smaller opening and closing degree of the eyes.

Step 203, determining a model and audio according to the pre-trained mouth information, and determining the mouth information of the human face object.

The executing entity may also process the acquired audio, for example, input the audio into a mouth information determination model trained in advance, so as to obtain the mouth information of the human face object. Here, the mouth information determination model may be used to characterize correspondence of audio and mouth information. It is understood that the mouth information may also include parameters describing the lip shape, the parameters correspond to a range of values, and different parameter values are used to represent different lip shapes. In some applications, the executive may also pre-process the audio before inputting it into the mouth information determination model. The preprocessing includes removing noise from the audio, performing spectral analysis on the audio, and the like.

And step 204, fusing the first expression information and the mouth information, and determining second expression information.

After obtaining the first expression information and the mouth information, the execution subject may fuse the first expression information and the mouth information to obtain second expression information. Specifically, the executive body may combine the mouth information with information describing the mouth in the first expression information to obtain the second expression information.

Step 205, generating an animation based on the head posture information and the second expression information.

After obtaining the second expression information, the execution subject may generate an animation based on the head posture information and the second expression information. Specifically, the execution subject may determine the position of the face material according to the head pose information, and determine the weighting coefficient of the basic blending shapes (Blendshapes) according to the second expression information, thereby driving the animation. In the animation field, the above-described second expression coefficient and head posture information may be imported by the content creation software Maya, and the creation software Maya coefficient is passed as a parameter to the virtual face for producing facial animation.

With continued reference to FIG. 3, a schematic diagram of one application scenario of a method for generating an animation according to the present application is shown. In the application scenario of fig. 3, a user inputs a face image and an audio in a speech video into the terminal 301, and the terminal 301 performs the processing of steps 202 to 205 on the face image and the audio to obtain an animation.

According to the method for generating the animation, the facial image and the audio matched with the facial image are utilized, the expression information and the mouth information of the face are respectively extracted, and the expression information and the mouth information are fused, so that the obtained animation is more vivid and smooth, and the stability under the condition that the mouth is shielded is improved.

With continued reference to FIG. 4, a flow 400 of another embodiment of a method for generating an animation according to the present application is shown. As shown in fig. 4, the method for generating an animation according to the present embodiment may include the following steps:

step 401, acquiring a face image set and an audio matched with the face image set.

Step 402, extracting feature information of the face images in the face image set.

In this embodiment, the execution subject may perform feature extraction on the face images in the face image set to obtain feature information of each face image. Specifically, the execution subject performs feature extraction using the cascaded HourGlass. HourGlass, european international conference on computer vision ECCV2016, filed by the research team at michigan university. It has good effect on the estimation of the human body posture.

And step 403, determining a model according to the feature information and the pre-trained head pose information, and determining the head pose information of the human face object.

The execution subject may input the obtained feature information into a pre-trained head pose information determination model to obtain head pose information of the human face object. Here, the head pose information determination model is used to characterize the correspondence relationship of the feature information and the head pose information. The head pose information determination model may include a plurality of convolutional layers and a fully connected layer. The convolution layer is used for continuously extracting the characteristics of the characteristic information, and the full-connection layer is used for classifying the extracted characteristic information to obtain the head posture information of the human face object.

Step 404, determining a model according to the feature information and the pre-trained expression information, and determining first expression information of the human face object.

The execution main body may further input the feature information obtained in step 401 into a pre-trained expression information determination model to obtain first expression information of the human face object. Here, the expression information determination model is used to characterize the correspondence between the feature information and the expression information. The expression information determination model may include a plurality of sub-models, each for extracting feature information of different parts of the face. For example, one sub-model is used to extract feature information of the eyes, another sub-model is used to extract feature information of the mouth, and another sub-model is used to extract feature information of the eyebrows. The expression information determination model can also comprise a full connection layer which is used for splicing the characteristic information extracted by each sub-model to obtain first expression information.

Step 405, determining a model and audio according to the pre-trained mouth information, and determining the mouth information of the human face object.

In some optional implementations of this embodiment, the mouth information determination model may be obtained through the following steps not shown in fig. 4: acquiring a training sample set, wherein the training sample comprises audio and labeled mouth information corresponding to each phoneme in the audio; and taking the audio of the training samples in the training sample set as input, taking the labeled mouth information corresponding to each phoneme in the input audio as expected output, and training to obtain a mouth information determination model.

In this implementation, the executing agent may first obtain a set of training samples. The training samples may include audio and labeled mouth information corresponding to each phoneme in the audio. Here, the technician may label the mouth of the face object in the video according to the acquired video including the face image, for example, may label a value according to the opening and closing degree of the mouth of the face object. Then, the execution subject may train the audio of the training sample in the training sample set as an input, and the labeled mouth information corresponding to each phoneme in the input audio as an expected output, to obtain the mouth information determination model.

It should be noted that the training of the mouth information determination model may be performed by the execution subject of the present embodiment, or may be performed by other electronic devices besides the execution subject. When executed by another electronic device, the other electronic device may transmit the mouth information determination model obtained by training to the execution subject of the present embodiment.

Step 406, determining a first weight corresponding to the first expression information and a second weight corresponding to the mouth information.

The execution subject may set weights for the first expression information and the mouth information, respectively, when fusing the two. Specifically, the execution subject may determine the first weight according to the sharpness of each face image in the face image set or a ratio of the face image to the front face image. The execution body may also determine the second weight according to a degree of noise included in the audio.

Step 407, determining second expression information according to the first expression information, the first weight, the mouth information, and the second weight.

After determining the first weight and the second weight, the execution subject may determine the second expression information according to the first expression information, the first weight, the mouth information, and the second weight. Specifically, the executing body may calculate a product of the first expression information and the first weight and a product of the mouth information and the second weight, respectively, and then add the two products to obtain the second expression information.

Step 408, outputting second expression information; modification information for the second expression information is received.

In this embodiment, the execution subject may also directly output the second expression information for the user to view or adjust. In some applications, the face animation that the user needs to generate is more exaggerated than the actual face, and the second expression information can be manually adjusted.

If the user needs to adjust the second facial expression information, the execution subject may receive modification information for the second facial expression information. The modification information may include values of the adjusted parameters.

Step 409, generating an animation based on the head pose information and the modification information.

The execution body may generate an animation based on the head pose information and the modification information. The animation generated in this way is more in line with the requirements of users.

According to the method for generating the animation, the weights of the first expression information and the mouth information can be dynamically adjusted according to the quality of the face image and the quality of the audio, and the accuracy of the face expression depiction is provided as much as possible; the method can interact with the user, allows the user to adjust the expression information, and improves user experience.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for generating an animation, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 5, the animation generation apparatus 500 of the present embodiment includes: an acquisition unit 501, a first determination unit 502, a second determination unit 503, a fusion unit 504, and a generation unit 505.

An obtaining unit 501 configured to obtain a face image set and audio matched with the face image set.

The first determining unit 502 is configured to analyze the facial images in the facial image set and determine the head pose information and the first expression information of the facial object.

A second determination unit 503 configured to determine mouth information of the human face object based on the pre-trained mouth information determination model and the audio. The mouth information determination model is used for representing the corresponding relation between the audio and the mouth information.

A fusion unit 504 configured to fuse the first expression information and the mouth information and determine second expression information.

A generating unit 505 configured to generate an animation based on the head pose information and the second expression information.

In some optional implementations of this embodiment, the fusion unit 504 may be further configured to: determining a first weight corresponding to the first expression information and a second weight corresponding to the mouth information; and determining second expression information according to the first expression information, the first weight, the mouth information and the second weight.

In some optional implementations of this embodiment, the apparatus 500 may further include a receiving unit, not shown in fig. 5, configured to: outputting second expression information; modification information for the second expression information is received.

In some optional implementations of this embodiment, the generating unit 505 may be further configured to: based on the head pose information and the modification information, an animation is generated.

In some optional implementations of this embodiment, the first determining unit 502 may be further configured to: extracting feature information of the face images in the face image set; determining a model according to the feature information and pre-trained head pose information, and determining the head pose information of the human face object, wherein the head pose information determination model is used for representing the corresponding relation between the feature information and the head pose information; and determining a model according to the feature information and pre-trained expression information, and determining first expression information of the face object, wherein the expression information determination model is used for representing the corresponding relation between the feature information and the expression information.

In some optional implementations of this embodiment, the apparatus 500 may further include a training unit, not shown in fig. 5, configured to: acquiring a training sample set, wherein the training sample comprises audio and labeled mouth information corresponding to each phoneme in the audio; and taking the audio of the training samples in the training sample set as input, taking the labeled mouth information corresponding to each phoneme in the input audio as expected output, and training to obtain a mouth information determination model.

It should be understood that units 501 to 505, which are described in the apparatus 500 for generating an animation, correspond to respective steps in the method described with reference to fig. 2, respectively. Thus, the operations and features described above for the method for generating an animation are equally applicable to the apparatus 500 and the units included therein, and will not be described again here.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 6, is a block diagram of an electronic device performing a method for generating an animation according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.

The memory 602 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the methods provided herein for generating an animation. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the methods provided herein for generating an animation.

The memory 602, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the execution of the method for generating an animation in the embodiment of the present application (for example, the acquisition unit 501, the first determination unit 502, the second determination unit 503, the fusion unit 504, and the generation unit 505 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and modules stored in the memory 602, that is, implementing the method for generating animation performed in the above method embodiment.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of an electronic device that performs generation of animation, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 602 optionally includes memory located remotely from processor 601, which may be connected over a network to an electronic device executing instructions for generating an animation. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device performing the method for generating an animation may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.

The input device 603 may receive input numeric or character information and generate key signal inputs related to performing user settings and function control of the electronic apparatus for generating an animation, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS").

According to the technical scheme of the embodiment of the application, the facial image and the audio matched with the facial image are utilized to respectively extract the expression information and the mouth information of the face, and the expression information and the mouth information are fused, so that the obtained animation is more vivid and smooth, and the stability under the condition that the mouth is shielded is improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for generating an animation, comprising:

acquiring a face image set and audio matched with the face image set;

Analyzing the facial images in the facial image set to determine head posture information and first expression information of a facial object;

determining mouth information of the human face object according to a pre-trained mouth information determination model and the audio, wherein the mouth information determination model is used for representing the corresponding relation between the audio and the mouth information;

fusing the first expression information and the mouth information to determine second expression information;

generating an animation based on the head pose information and the second expression information.

2. The method of claim 1, wherein the fusing the first facial information and the mouth information to determine second facial information comprises:

determining a first weight corresponding to the first expression information and a second weight corresponding to the mouth information;

and determining the second expression information according to the first expression information, the first weight, the mouth information and the second weight.

3. The method of claim 1, wherein the method further comprises:

outputting the second expression information;

receiving modification information for the second expression information.

4. The method of claim 3, wherein the generating an animation based on the head pose information and the second expression information comprises:

generating an animation based on the head pose information and the modification information.

5. The method of claim 1, wherein the analyzing the facial images in the facial image set to determine the head pose information and the first expression information of the human face object comprises:

extracting feature information of the face images in the face image set;

determining a model according to the feature information and pre-trained head pose information, and determining the head pose information of the human face object, wherein the head pose information determination model is used for representing the corresponding relation between the feature information and the head pose information;

and determining a model according to the feature information and pre-trained expression information, and determining first expression information of the human face object, wherein the expression information determination model is used for representing the corresponding relation between the feature information and the expression information.

6. The method of claim 1, wherein the mouth information determination model is trained by:

Acquiring a training sample set, wherein the training sample comprises audio and labeled mouth information corresponding to each phoneme in the audio;

and taking the audio of the training samples in the training sample set as input, taking the labeled mouth information corresponding to each phoneme in the input audio as expected output, and training to obtain the mouth information determination model.

7. An apparatus for generating an animation, comprising:

an acquisition unit configured to acquire a face image set and an audio matched with the face image set;

the first determining unit is configured to analyze the facial images in the facial image set and determine head posture information and first expression information of a facial object;

a second determining unit, configured to determine mouth information of the face object according to a pre-trained mouth information determination model and the audio, wherein the mouth information determination model is used for representing a corresponding relation between the audio and the mouth information;

a fusion unit configured to fuse the first expression information and the mouth information and determine second expression information;

a generating unit configured to generate an animation based on the head pose information and the second expression information.

8. The apparatus of claim 7, wherein the fusion unit is further configured to:

9. The apparatus of claim 7, wherein the apparatus further comprises a receiving unit configured to:

outputting the second expression information;

receiving modification information for the second expression information.

10. The apparatus of claim 9, wherein the generating unit is further configured to:

11. The apparatus of claim 7, wherein the first determining unit is further configured to:

extracting feature information of the face images in the face image set;

12. The apparatus of claim 7, wherein the apparatus further comprises a training unit configured to:

13. An electronic device for generating an animation, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.