CN113870399B

CN113870399B - Expression driving method and device, electronic equipment and storage medium

Info

Publication number: CN113870399B
Application number: CN202111117185.0A
Authority: CN
Inventors: 梁柏荣; 郭知智; 洪智滨
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-09-23
Filing date: 2021-09-23
Publication date: 2022-12-02
Anticipated expiration: 2041-09-23
Also published as: WO2023045317A1; CN113870399A

Abstract

The disclosure provides an expression driving method and device, electronic equipment and a storage medium, relates to the technical field of artificial intelligence, in particular to the technical field of computer vision and deep learning, and can be applied to scenes such as face image processing, face recognition and the like. The specific implementation scheme is as follows: respectively inputting a source image with expression and a target image without expression into a three-dimensional expression model to obtain a plurality of first facial attributes and a plurality of second facial attributes, replacing corresponding facial attributes in the second facial attributes by adopting at least part of the first facial attributes, performing three-dimensional facial reconstruction and rendering on the replaced second facial attributes, and performing expression driving on a three-dimensional facial image to be rendered through an expression driving model. Therefore, the facial expressions and facial gestures in the source image and the target image can be decoupled, and further, the facial expressions and facial gestures of the target image can be controlled independently, so that more various expression drives can be better met.

Description

Expression driving method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of computer vision and deep learning technologies, which can be applied to scenes such as face image processing and face recognition, and in particular, to an expression driving method and apparatus, an electronic device, and a storage medium.

Background

The facial expression driving technology is one of important computer vision technologies, and the task is to drive the facial expression of a target picture through a facial expression picture so that the facial expressions of the target picture and the facial expression picture are consistent as much as possible. Facial expression-driven technology is very widespread in general entertainment applications.

Disclosure of Invention

The disclosure provides a method and a device for expression driving, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided an expression driving method including: acquiring a source image with an expression and a target image without the expression; inputting the source image and the target image into a three-dimensional expression model respectively to obtain a plurality of first facial attributes corresponding to the source image and a plurality of second facial attributes corresponding to the target image; replacing corresponding face attributes in the second face attributes by at least part of face attributes in the first face attributes to obtain a plurality of replaced second face attributes; according to the plurality of second face attributes after the replacement processing, performing three-dimensional face reconstruction and rendering on the face in the target image to obtain a rendered three-dimensional face image; and inputting the rendered three-dimensional face image into an expression driving model so as to drive the face in the target image in an expression mode.

According to another aspect of the present disclosure, there is provided an expression driving apparatus including: the first acquisition module is used for acquiring a source image with an expression and a target image without the expression; a second obtaining module, configured to input the source image and the target image into a three-dimensional expression model respectively, so as to obtain a plurality of first facial attributes corresponding to the source image and a plurality of second facial attributes corresponding to the target image; a replacing module, configured to replace, by at least part of the plurality of first face attributes, corresponding face attributes in the plurality of second face attributes with at least part of the plurality of first face attributes, so as to obtain a plurality of second face attributes after replacement processing; the processing module is used for carrying out three-dimensional face reconstruction and rendering on the face in the target image according to the plurality of replaced second face attributes to obtain a rendered three-dimensional face image; and the driving module is used for inputting the rendered three-dimensional face image into an expression driving model so as to drive the expression of the face in the target image.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of the first aspect of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of an embodiment of the first aspect of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 5 is a flow chart diagram of an expression driving method according to an embodiment of the disclosure;

FIG. 6 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the related art, a face 2D key point of a driving image is detected, and the face 2D key point is expression-expressed to generate a face picture driven by a corresponding expression.

However, the expression driving technology based on the 2D facial key points cannot decouple the expression and the facial pose, and when the difference between the pose of the driving picture and the pose of the target image is large, the pose of the generated picture changes along with the driving image, the original pose of the target image cannot be maintained, and more various expression drivers cannot be satisfied.

In order to solve the above problems, the present disclosure provides an expression driving method, an expression driving apparatus, an electronic device, and a storage medium.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. It should be noted that the expression driving method of the embodiment of the present disclosure may be applied to the expression driving apparatus of the embodiment of the present disclosure, and the apparatus may be configured in an electronic device. The electronic device may be a mobile terminal, for example, a mobile phone, a tablet computer, a personal digital assistant, and other hardware devices with various operating systems.

As shown in fig. 1, the expression driving method may include the steps of:

step 101, obtaining a source image with an expression and a target image without the expression.

In the embodiment of the disclosure, an object can be shot by using an image acquisition device to obtain a source image with an expression and a target image without the expression, or the source image with the expression and the target image without the expression are downloaded from a network. Wherein the expression in the source image may comprise: happy, angry, excited or angry facial expressions.

Step 102, inputting the source image and the target image into the three-dimensional expression model respectively to obtain a plurality of first facial attributes corresponding to the source image and a plurality of second facial attributes corresponding to the target image.

In order to achieve decoupling between each of the facial attributes, the source image and the target image may be respectively input into a three-dimensional expression model, and the three-dimensional expression model may output a plurality of first facial attributes corresponding to the source image and a plurality of second facial attributes corresponding to the target image. It should be noted that the first face attribute and the second face attribute include: at least one of a facial expression, a facial pose, facial lighting, and facial shape, the first facial attribute may be different from the second facial attribute.

In addition, it should be noted that the three-dimensional expression model may include an encoding layer and a decoding layer; the encoding layer is used for respectively inputting the source image and the target image into the three-dimensional expression model so as to obtain a plurality of first facial attributes corresponding to the source image and a plurality of second facial attributes corresponding to the target image, and therefore decoupling of all the facial attributes in the facial attributes is achieved; and the decoding layer is used for carrying out three-dimensional face reconstruction on the face in the target image according to the plurality of replaced second face attributes to obtain a reconstructed three-dimensional face image so as to realize face reconstruction on the plurality of replaced second face attributes.

As an application scenario, in a face image processing and face recognition scenario, the three-dimensional expression model may be a face 3D deformation statistical model (referred to as 3 DMM), and in order to achieve decoupling between each face attribute in the face attributes, the face image and the target image may be respectively input into a coding layer of the 3DMM, so as to obtain a plurality of first face attributes corresponding to the face image and a plurality of second face attributes corresponding to the target image.

And 103, replacing the corresponding face attribute in the second face attributes with at least part of the first face attributes to obtain a plurality of replaced second face attributes.

In order to make the target image keep the original facial pose and only perform expression driving on the target image, in the embodiment of the present disclosure, at least part of the plurality of first facial attributes may be used to replace the corresponding facial attributes in the plurality of second facial attributes, so as to obtain a plurality of second facial attributes after replacement processing. For example, the facial expression in the second facial attribute may be replaced with the facial expression in the first facial attribute, and the second facial attribute after replacing the facial expression may be used as the plurality of second facial attributes after the replacement processing.

And 104, performing three-dimensional face reconstruction and rendering on the face in the target image according to the plurality of replaced second face attributes to obtain a rendered three-dimensional face image.

In order to present the replaced plurality of second facial attributes, the replaced plurality of second facial attributes may be input into a decoding layer of the three-dimensional expression model to obtain a reconstructed three-dimensional face image. Further, a rendered three-dimensional face image is obtained by a 3D rendering technique.

And 105, inputting the rendered three-dimensional face image into an expression driving model so as to drive the face in the target image in an expression mode.

It can be understood that, because the rendered three-dimensional face image has poor reality, in order to make the expression-driven target image more realistic, in the embodiment of the present disclosure, the rendered three-dimensional face image may be input into the expression-driven model to perform expression driving on the face in the target image.

In conclusion, a source image with an expression and a target image without the expression are obtained; respectively inputting a source image and a target image into a three-dimensional expression model so as to obtain a plurality of first facial attributes corresponding to the source image and a plurality of second facial attributes corresponding to the target image; replacing corresponding face attributes in the second face attributes with at least part of face attributes in the first face attributes to obtain replaced second face attributes; according to the plurality of replaced second facial attributes, performing three-dimensional facial reconstruction and rendering on the face in the target image to obtain a rendered three-dimensional facial image; and inputting the rendered three-dimensional face image into an expression driving model so as to drive the expression of the face in the target image. Therefore, the decoupling of the facial expressions and facial gestures in the source image and the target image can be realized, further, the facial expressions and facial gestures of the target image can be controlled independently, and more diversified expression drivers can be better met.

In order to keep the target image in the original facial pose, only the target image is expression-driven, as shown in fig. 2, and fig. 2 is a schematic diagram according to a second embodiment of the present disclosure. In the embodiment of the present disclosure, the facial expression in the second facial attribute may be replaced with a facial expression in the plurality of first facial attributes to obtain a replacement-processed second facial attribute. The embodiment shown in fig. 2 may include the following steps:

step 201, a source image with an expression and a target image without the expression are obtained.

Step 202, inputting the source image and the target image into the three-dimensional expression model respectively to obtain a plurality of first facial attributes corresponding to the source image and a plurality of second facial attributes corresponding to the target image.

And step 203, performing replacement processing on the facial expression in the second facial attribute according to the facial expression in the plurality of first facial attributes.

In the disclosed embodiment, each of the first and second face attributes may include: facial shape, facial pose, facial expression, and facial illumination, and the facial expression in the second facial attribute may be replaced with the facial expression in the first facial attribute.

Step 204, the facial expressions after the replacement processing in the second facial attribute and the facial gestures, facial shapes and facial illumination remained by the replacement processing in the second facial attribute are used as a plurality of second facial attributes after the replacement processing.

That is, after the facial expression in the second facial attribute is subjected to the replacement processing using the facial expression in the first facial attribute, the replacement-processed facial expression in the second facial attribute, in which the facial pose, the facial shape, and the facial illumination originally retained, may be used as the plurality of second facial attributes after the replacement processing.

And step 205, performing three-dimensional face reconstruction and rendering on the face in the target image according to the plurality of replaced second face attributes to obtain a rendered three-dimensional face image.

And step 206, inputting the rendered three-dimensional face image into an expression driving model so as to drive the face in the target image in an expression mode.

It should be noted that the execution processes of steps 201 to 202 and steps 205 to 206 may be implemented by any one of the embodiments of the present disclosure, and the embodiments of the present disclosure do not limit this and are not described again.

In conclusion, the facial expression in the second facial attribute is replaced according to the facial expression in the plurality of first facial attributes; the facial expression after the replacement processing in the second facial attribute and the facial pose, the facial shape and the facial illumination retained by the replacement processing in the second facial attribute are used as a plurality of second facial attributes after the replacement processing. Therefore, the target image can keep the original facial posture, and only the expression of the target image is driven.

In order to perform face reconstruction on the replaced plurality of second facial attributes, as shown in fig. 3, fig. 3 is a schematic diagram according to a third embodiment of the present disclosure, in which a three-dimensional face reconstruction and rendering may be performed on a face in a target image according to the replaced plurality of second facial attributes to obtain a reconstructed three-dimensional face image. The embodiment shown in fig. 3 may include the following steps:

step 301, acquiring a source image with an expression and a target image without the expression.

Step 302, inputting the source image and the target image into the three-dimensional expression model respectively to obtain a plurality of first facial attributes corresponding to the source image and a plurality of second facial attributes corresponding to the target image.

And step 303, replacing the corresponding face attribute in the second face attributes with at least part of the first face attributes to obtain a plurality of replaced second face attributes.

And 304, performing three-dimensional face reconstruction on the face in the target image according to the plurality of replaced second face attribute coefficients to obtain a reconstructed three-dimensional face image.

In the embodiment of the disclosure, the plurality of second facial attribute coefficients after the replacement processing may be input into a decoding layer of the three-dimensional expression model, and the three-dimensional expression model may output a reconstructed three-dimensional facial image.

And 305, performing three-dimensional face rendering on the reconstructed three-dimensional face image to obtain a rendered three-dimensional face image.

In order to make the acquired three-dimensional face image more accurate and real, a 3D rendering technology can be adopted to perform three-dimensional face rendering on the reconstructed three-dimensional face image so as to obtain a rendered three-dimensional face image.

Step 306, inputting the rendered three-dimensional face image into an expression driving model to perform expression driving on the face in the target image.

It should be noted that the execution processes of steps 301 to 303 and step 306 may be implemented by any one of the embodiments of the present disclosure, and the embodiments of the present disclosure do not limit this, and are not described again.

In conclusion, the face in the target image is subjected to three-dimensional face reconstruction according to the plurality of second face attribute coefficients after the replacement processing, so that a reconstructed three-dimensional face image is obtained; and performing three-dimensional face rendering on the reconstructed three-dimensional face image to obtain a rendered three-dimensional face image, so that the plurality of replaced second face attributes can be subjected to face reconstruction.

In order to enable the expression driving model to perform expression driving on the rendered three-dimensional facial image to obtain a more real facial driving image, as shown in fig. 4, fig. 4 is a schematic diagram according to a fourth embodiment of the present disclosure, in the embodiment of the present disclosure, before the rendered three-dimensional facial image is input to the expression driving model, the expression driving model may be trained to output the more real facial driving image, and the embodiment shown in fig. 4 may include the following steps:

step 401, acquiring a source image with an expression and a target image without the expression.

Step 402, inputting the source image and the target image into the three-dimensional expression model respectively to obtain a plurality of first facial attributes corresponding to the source image and a plurality of second facial attributes corresponding to the target image.

And step 403, replacing the corresponding face attribute in the second face attributes with at least part of the first face attributes to obtain a plurality of replaced second face attributes.

And step 404, performing three-dimensional face reconstruction and rendering on the face in the target image according to the plurality of replaced second face attributes to obtain a rendered three-dimensional face image.

Step 405, acquiring a plurality of sample images with expressions.

In the embodiment of the present disclosure, a plurality of frames of sample images with expressions may be obtained by downloading through an image acquisition device or a network, where it should be noted that the plurality of frames of sample images with expressions may be sample images with different expressions of the same object, or the plurality of frames of sample images with expressions may be sample images with different expressions of different objects.

Step 406, inputting the sample image into a coding layer of the three-dimensional expression model for each frame of sample image to obtain a sample facial attribute corresponding to the sample image; wherein the sample facial attributes comprise: at least one of a sample facial expression, a sample facial shape, a sample facial pose, and a sample facial illumination.

Further, each frame of sample image of a plurality of frames of sample images with expressions may be respectively input into an encoding layer of a three-dimensional expression model, and the three-dimensional expression model may output a sample facial attribute corresponding to each frame of sample image, where it should be noted that the sample facial attribute may include: at least one of a sample facial expression, a sample facial shape, a sample facial pose, and a sample facial illumination.

Step 407, inputting the sample facial expression, the sample facial shape, the sample facial pose and the sample facial illumination into a decoding layer of the three-dimensional expression model to perform three-dimensional facial reconstruction on the face in the sample image, so as to obtain a reconstructed three-dimensional sample facial image.

Furthermore, the sample facial expression, the sample facial shape, the sample facial posture and the sample facial illumination can be input into a decoding layer of the three-dimensional expression model, and the three-dimensional expression model can perform three-dimensional facial reconstruction on the sample facial attribute to obtain a reconstructed three-dimensional sample facial image.

And step 408, performing three-dimensional face rendering on the reconstructed three-dimensional sample face image to obtain a rendered three-dimensional sample face image.

In the embodiment of the disclosure, the three-dimensional face rendering can be performed on the reconstructed three-dimensional sample face image by using a three-dimensional rendering technology, so as to obtain a rendered three-dimensional sample face image.

And 409, training the initial expression driving model according to the rendered three-dimensional sample facial image and the rendered sample image to generate an expression driving model.

As an example, inputting a rendered three-dimensional sample facial image into an initial expression driving model to obtain an expression prediction image; determining a loss function value according to the difference between the sample image and the expression predicted image; the initial expression-driven model is trained to minimize the loss function values based on the loss function values.

That is, in order to improve the accuracy of the expression driving model, the rendered three-dimensional sample facial image may be input into an initial expression driving model, the initial expression driving model may output an expression prediction image, and further, the sample image may be compared with the expression prediction image to determine a difference between the sample image and the expression prediction image, and a loss function value may be determined according to the difference, for example, the loss function value may include a first sub-loss function value and a second sub-loss function value, wherein the first sub-loss function value may be determined according to an absolute value of a difference between the sample image and the expression prediction image, and at the same time, the sample image and the expression prediction image may be input into a trained Visual Graphics Generator (VGG), a semantic vector corresponding to the sample image and a semantic vector corresponding to the expression prediction image may be generated, and the second sub-loss function value may be determined according to an absolute value of a difference between the semantic vector corresponding to the sample image and the semantic vector corresponding to the expression. Furthermore, according to the loss function value, the initial expression driving model can be trained in a gradient feedback mode so as to minimize the loss function value.

As another example, image normalization processing is performed on the rendered three-dimensional sample face image and the sample image to obtain a target three-dimensional sample face image; inputting a target three-dimensional sample facial image into an initial expression driving model to obtain an expression predicted image; determining a loss function value according to the difference between the sample image and the expression predicted image; the initial expression-driven model is trained to minimize the loss function values based on the loss function values.

In order to distribute data of the rendered three-dimensional sample face image and the rendered sample image in the same area, reduce the difference between the rendered three-dimensional sample face image and the sample image, facilitate training of the initial expression driving model, and perform image normalization processing on the rendered three-dimensional sample face image and the rendered sample image to obtain a target three-dimensional sample face image. For example, the pixel value of each pixel in the rendered three-dimensional sample face image and sample image may be divided by 255 and then subtracted by 1, such that the pixel value of each pixel is between [ -0.5,0.5 ]. Then, the target three-dimensional sample face image can be input into an initial expression driving model, the initial expression model can output an expression predicted image, and then the sample image and the expression predicted image can be compared to determine the difference between the sample image and the expression predicted image, and a loss function value is determined according to the difference. According to the loss function value, the initial expression driving model can be trained in a gradient return mode so as to minimize the loss function value.

Step 410, inputting the rendered three-dimensional face image into an expression driving model to perform expression driving on the face in the target image.

It should be noted that the execution processes of steps 401 to 404 and step 410 may be implemented by any one of the embodiments of the present disclosure, and the embodiments of the present disclosure do not limit this, and are not described again.

In conclusion, multiple frames of sample images with expressions are obtained; inputting the sample image into a coding layer of a three-dimensional expression model aiming at each frame of sample image so as to obtain a sample face attribute corresponding to the sample image; wherein the sample facial attributes comprise: at least one of a sample facial expression, a sample facial shape, a sample facial pose, and a sample facial illumination; inputting the sample facial expression, the sample facial shape, the sample facial posture and the sample facial illumination into a decoding layer of the three-dimensional expression model so as to carry out three-dimensional facial reconstruction on the face in the sample image and obtain a reconstructed three-dimensional sample facial image; performing three-dimensional face rendering on the reconstructed three-dimensional sample face image to obtain a rendered three-dimensional sample face image; and training the initial expression driving model according to the rendered three-dimensional sample face image and the rendered sample image to generate an expression driving model. Therefore, the expression driving model can perform expression driving on the rendered three-dimensional face image to acquire a more real face driving image.

In order to more clearly illustrate the above embodiments, the description will now be made by way of example.

For example, as shown in fig. 5, a source image (source image) may represent a source image having an expression, a target image may represent a target image without an expression, a 3DMM may represent a three-dimensional expression model, and the source image and the target image may be respectively input into an encoding layer of the 3DMM to obtain a shape (facial shape), a pos (facial pose), a light (facial illumination), and an exp (facial expression) corresponding to the source image, and a shape (facial shape), a pos (facial pose), a light (facial illumination), and an exp (facial expression) corresponding to the target image. Then, replacing exp in the target image by exp in the source image, wherein the replaced face attribute corresponding to the target image comprises: and the replaced exp, the original shape, the position and the light remained in the target image. Furthermore, the face attribute after the replacement processing corresponding to the target image may be input to a decoding layer of the 3d dm model to perform three-dimensional face reconstruction and rendering to obtain a rendered three-dimensional face image, and finally, the rendered three-dimensional face image may be input to a translator model (expression driver model) that outputs an expression driver image corresponding to the target image.

According to the expression driving method, a source image with an expression and a target image without the expression are obtained; respectively inputting a source image and a target image into a three-dimensional expression model to obtain a plurality of first facial attributes corresponding to the source image and a plurality of second facial attributes corresponding to the target image; replacing corresponding face attributes in the second face attributes with at least part of face attributes in the first face attributes to obtain replaced second face attributes; according to the plurality of replaced second facial attributes, performing three-dimensional facial reconstruction and rendering on the face in the target image to obtain a rendered three-dimensional facial image; and inputting the rendered three-dimensional face image into an expression driving model so as to drive the face in the target image in an expression mode. The method includes the steps that a source image and a target image are respectively input into a three-dimensional expression model, a plurality of first facial attributes corresponding to the source image and a plurality of second facial attributes corresponding to the target image are obtained, furthermore, at least part of the first facial attributes are adopted to replace the corresponding facial attributes in the second facial attributes, three-dimensional facial reconstruction and rendering are carried out on the replaced second facial attributes, and finally expression driving is carried out on a rendered three-dimensional facial image through an expression driving model. Therefore, the decoupling of the facial expressions and facial gestures in the source image and the target image can be realized, further, the facial expressions and facial gestures of the target image can be controlled independently, and more diversified expression drivers can be better met.

In order to realize the embodiment, the present disclosure further provides an expression driving apparatus.

Fig. 6 is a schematic diagram according to a fifth embodiment of the present disclosure, and as shown in fig. 6, an expression driving apparatus 600 includes: a first acquisition module 610, a second acquisition module 620, a replacement module 630, a processing module 640, and a driving module 650.

The first obtaining module 610 is configured to obtain a source image with an expression and a target image without the expression; a second obtaining module 620, configured to input the source image and the target image into the three-dimensional expression model respectively, so as to obtain a plurality of first facial attributes corresponding to the source image and a plurality of second facial attributes corresponding to the target image; a replacing module 630, configured to replace, by at least part of the plurality of first facial attributes, corresponding facial attributes in the plurality of second facial attributes to obtain a plurality of second facial attributes after replacement processing; the processing module 640 is configured to perform three-dimensional face reconstruction and rendering on a face in the target image according to the plurality of second face attributes after the replacement processing, so as to obtain a rendered three-dimensional face image; and the driving module 650 is configured to input the rendered three-dimensional facial image into an expression driving model, so as to perform expression driving on the face in the target image.

As a possible implementation manner of the embodiment of the present disclosure, the replacing module 630 is specifically configured to: performing replacement processing on the facial expression in the second facial attribute according to the facial expression in the plurality of first facial attributes; the facial expression after the replacement processing in the second facial attribute and the facial pose, the facial shape and the facial illumination retained by the replacement processing in the second facial attribute are used as a plurality of second facial attributes after the replacement processing.

As a possible implementation manner of the embodiment of the present disclosure, the processing module 640 is specifically configured to: according to the plurality of second face attribute coefficients after the replacement processing, performing three-dimensional face reconstruction on the face in the target image to obtain a reconstructed three-dimensional face image; and performing three-dimensional face rendering on the reconstructed three-dimensional face image to obtain a rendered three-dimensional face image.

As a possible implementation manner of the embodiment of the present disclosure, the three-dimensional expression model includes a coding layer and a decoding layer; the encoding layer is used for respectively inputting a source image and a target image into the three-dimensional expression model so as to obtain a plurality of first facial attributes corresponding to the source image and a plurality of second facial attributes corresponding to the target image; and the decoding layer is used for carrying out three-dimensional face reconstruction on the face in the target image according to the plurality of replaced second face attributes to obtain a reconstructed three-dimensional face image.

As a possible implementation manner of the embodiment of the present disclosure, the expression driving apparatus 600 further includes: the device comprises a third acquisition module, a fourth acquisition module, a reconstruction module, a rendering module and a training module.

The third acquisition module is used for acquiring a plurality of frames of sample images with expressions; the fourth acquisition module is used for inputting the sample image into a coding layer of the three-dimensional expression model aiming at each frame of sample image so as to acquire the sample facial attribute corresponding to the sample image; wherein the sample facial attributes comprise: at least one of a sample facial expression, a sample facial shape, a sample facial pose, and a sample facial illumination; the reconstruction module is used for inputting the sample facial expression, the sample facial shape, the sample facial posture and the sample facial illumination into a decoding layer of the three-dimensional expression model so as to carry out three-dimensional facial reconstruction on the face in the sample image and obtain a reconstructed three-dimensional sample facial image; the rendering module is used for performing three-dimensional face rendering on the reconstructed three-dimensional sample face image to obtain a rendered three-dimensional sample face image; and the training module is used for training the initial expression driving model according to the rendered three-dimensional sample facial image and the rendered sample image so as to generate the expression driving model.

As a possible implementation manner of the embodiment of the present disclosure, the training module is specifically configured to: inputting the rendered three-dimensional sample facial image into an initial expression driving model to obtain an expression predicted image; determining a loss function value according to the difference between the sample image and the expression predicted image; and training the initial expression driving model according to the loss function value so as to minimize the loss function value.

As a possible implementation manner of the embodiment of the present disclosure, the training module is specifically configured to: performing image normalization processing on the rendered three-dimensional face image and the sample image to obtain a target three-dimensional sample face image; inputting a target three-dimensional sample facial image into an initial expression driving model to obtain an expression predicted image; determining a loss function value according to the difference between the sample image and the expression predicted image; the initial expression-driven model is trained to minimize the loss function values based on the loss function values.

The expression driving device of the embodiment of the disclosure acquires a source image with an expression and a target image without the expression; respectively inputting a source image and a target image into a three-dimensional expression model so as to obtain a plurality of first facial attributes corresponding to the source image and a plurality of second facial attributes corresponding to the target image; replacing corresponding face attributes in the second face attributes with at least part of face attributes in the first face attributes to obtain replaced second face attributes; according to the plurality of replaced second facial attributes, performing three-dimensional facial reconstruction and rendering on the face in the target image to obtain a rendered three-dimensional facial image; and inputting the rendered three-dimensional face image into an expression driving model so as to drive the face in the target image in an expression mode. The device can achieve the purpose that a plurality of first facial attributes corresponding to a source image and a plurality of second facial attributes corresponding to a target image are obtained by respectively inputting the source image and the target image into the three-dimensional expression model, furthermore, at least part of the first facial attributes are adopted to replace the corresponding facial attributes in the second facial attributes, the replaced second facial attributes are subjected to three-dimensional facial reconstruction and rendering, and finally, expression driving is carried out on the rendered three-dimensional facial image through the expression driving model. Therefore, the decoupling of the facial expressions and facial gestures in the source image and the target image can be realized, further, the facial expressions and facial gestures of the target image can be controlled independently, and more diversified expression drivers can be better met.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all carried out on the premise of obtaining the consent of the user, and all accord with the regulation of related laws and regulations without violating the good custom of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, a modem, a wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 executes the respective methods and processes described above, such as the expression driving method. For example, in some embodiments, the expression driver method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the expression driving method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the expression driving method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. An expression driving method comprising:

acquiring a source image with an expression and a target image without the expression;

inputting the source image and the target image into a three-dimensional expression model respectively to obtain a plurality of first facial attributes corresponding to the source image and a plurality of second facial attributes corresponding to the target image;

replacing corresponding face attributes in the second face attributes with at least part of face attributes in the first face attributes to obtain replaced second face attributes;

according to the plurality of second facial attributes after the replacement processing, performing three-dimensional facial reconstruction and rendering on the face in the target image to obtain a rendered three-dimensional facial image;

inputting the rendered three-dimensional facial image into an expression driving model so as to drive the face in the target image in an expression manner;

before the inputting the rendered three-dimensional facial image into an expression driving model, the method further includes:

acquiring a plurality of frames of sample images with expressions;

inputting the sample image into a coding layer of the three-dimensional expression model aiming at each frame of the sample image so as to obtain a sample face attribute corresponding to the sample image; wherein the sample facial attributes comprise: at least one of a sample facial expression, a sample facial shape, a sample facial pose, and a sample facial illumination;

inputting the sample facial expression, the sample facial shape, the sample facial posture and the sample facial illumination into a decoding layer of the three-dimensional expression model so as to carry out three-dimensional facial reconstruction on the face in the sample image and obtain a reconstructed three-dimensional sample facial image;

performing three-dimensional face rendering on the reconstructed three-dimensional sample face image to obtain a rendered three-dimensional sample face image;

training an initial expression driving model according to the rendered three-dimensional sample facial image and the sample image to generate the expression driving model;

the training of an initial expression driving model according to the rendered three-dimensional sample facial image and the sample image to generate the expression driving model comprises:

inputting the rendered three-dimensional sample facial image into an initial expression driving model to obtain an expression predicted image;

determining a loss function value according to the difference between the sample image and the expression predicted image;

training the initial expression driving model according to the loss function value so as to minimize the loss function value;

wherein the loss function values comprise a first sub-loss function value and a second sub-loss function value, and the determining the loss function values according to the difference between the sample image and the expression predicted image comprises:

determining a first sub-loss function value according to the absolute value of the difference value between the sample image and the expression predicted image;

and meanwhile, inputting the sample image and the expression predicted image into a trained eye view image generator, generating a semantic vector corresponding to the sample image and a semantic vector corresponding to the expression predicted image, and determining a second sub-loss function value according to the absolute value of the difference between the semantic vector corresponding to the sample image and the semantic vector corresponding to the expression predicted image.

2. The method of claim 1, wherein said replacing corresponding ones of said plurality of second facial attributes with at least some of said plurality of first facial attributes to obtain replacement processed plurality of second facial attributes comprises:

performing replacement processing on the facial expression in the second facial attribute according to the facial expression in the plurality of first facial attributes;

and using the facial expressions after the replacement processing in the second facial attributes and the facial gestures, the facial shapes and the facial illumination remained by the replacement processing in the second facial attributes as a plurality of second facial attributes after the replacement processing.

3. The method of claim 1, wherein the performing three-dimensional facial reconstruction and rendering of the face in the target image according to the plurality of second facial attributes after the replacement processing to obtain a rendered three-dimensional facial image comprises:

performing three-dimensional face reconstruction on the face in the target image according to the plurality of second face attribute coefficients after the replacement processing to obtain a reconstructed three-dimensional face image;

and performing three-dimensional face rendering on the reconstructed three-dimensional face image to obtain a rendered three-dimensional face image.

4. The method of claim 3, wherein the three-dimensional expression model comprises an encoding layer and a decoding layer;

the encoding layer is used for respectively inputting the source image and the target image into a three-dimensional expression model so as to obtain a plurality of first facial attributes corresponding to the source image and a plurality of second facial attributes corresponding to the target image;

and the decoding layer is used for carrying out three-dimensional face reconstruction on the face in the target image according to the plurality of replaced second face attributes to obtain a reconstructed three-dimensional face image.

5. The method of claim 1, wherein training an initial expression-driven model from the rendered three-dimensional sample facial image and the sample image to generate the expression-driven model comprises:

performing image normalization processing on the rendered three-dimensional sample face image and the sample image to obtain a target three-dimensional sample face image;

inputting the target three-dimensional sample facial image into an initial expression driving model to obtain an expression predicted image;

and training the initial expression driving model according to the loss function value so as to minimize the loss function value.

6. An expression driving apparatus comprising:

the first acquisition module is used for acquiring a source image with an expression and a target image without the expression;

the second acquisition module is used for respectively inputting the source image and the target image into the three-dimensional expression model so as to acquire a plurality of first facial attributes corresponding to the source image and a plurality of second facial attributes corresponding to the target image;

a replacing module, configured to replace, by at least part of the plurality of first face attributes, corresponding face attributes in the plurality of second face attributes with at least part of the plurality of first face attributes, so as to obtain a plurality of second face attributes after replacement processing;

the processing module is used for carrying out three-dimensional face reconstruction and rendering on the face in the target image according to the plurality of replaced second face attributes to obtain a rendered three-dimensional face image;

the driving module is used for inputting the rendered three-dimensional facial image into an expression driving model so as to drive the expression of the face in the target image;

the device further comprises:

the third acquisition module is used for acquiring a plurality of frames of sample images with expressions;

a fourth obtaining module, configured to, for each frame of the sample image, input the sample image into a coding layer of the three-dimensional expression model to obtain a sample facial attribute corresponding to the sample image; wherein the sample facial attributes comprise: at least one of a sample facial expression, a sample facial shape, a sample facial pose, and a sample facial illumination;

the reconstruction module is used for inputting the sample facial expression, the sample facial shape, the sample facial posture and the sample facial illumination into a decoding layer of the three-dimensional expression model so as to carry out three-dimensional facial reconstruction on the face in the sample image and obtain a reconstructed three-dimensional sample facial image;

the rendering module is used for performing three-dimensional face rendering on the reconstructed three-dimensional sample face image to obtain a rendered three-dimensional sample face image;

the training module is used for training an initial expression driving model according to the rendered three-dimensional sample facial image and the sample image so as to generate the expression driving model;

the training module is specifically configured to:

7. The apparatus according to claim 6, wherein the replacement module is specifically configured to:

8. The apparatus according to claim 6, wherein the processing module is specifically configured to:

9. The apparatus of claim 8, wherein the three-dimensional expression model comprises an encoding layer and a decoding layer;

10. The apparatus of claim 6, wherein the training module is specifically configured to:

carrying out image normalization processing on the rendered three-dimensional face image and the sample image to obtain a target three-dimensional sample face image;

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.

13. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method according to any one of claims 1-5.