CN110620884A

CN110620884A - Expression-driven-based virtual video synthesis method and device and storage medium

Info

Publication number: CN110620884A
Application number: CN201910885913.9A
Authority: CN
Inventors: 孙太武; 张艳; 周超勇; 刘玉宇
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-09-19
Filing date: 2019-09-19
Publication date: 2019-12-27
Anticipated expiration: 2039-09-19
Also published as: CN110620884B; WO2021051605A1

Abstract

The invention relates to the technical field of video synthesis, and provides a virtual video synthesis method, a virtual video synthesis device and a storage medium based on expression driving, wherein the method comprises the following steps: acquiring an image set to be synthesized, and determining an image to be synthesized from the image set to be synthesized; synthesizing an image to be synthesized and a target image based on a GAN network to form a target photo, wherein the target image is an original image of a user; intercepting a plurality of frames in an unprocessed original video as reference images; performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to the virtual transmission video to be transmitted; and splicing the frames of the transmission images to form the virtual composite video. The invention adopts the own photo and the photo of a stranger to synthesize the photo for video chat, and further virtualizes the synthesized photo into a video similar to the own photo through expression driving, so that the real appearance of the own photo can be approached, and the privacy of the own photo can be protected.

Description

Expression-driven-based virtual video synthesis method and device and storage medium

Technical Field

The invention relates to the technical field of video synthesis, in particular to a virtual video synthesis method and device based on expression driving and a computer readable storage medium.

Background

At present, the synthesis of virtual videos is widely applied in many fields, and has a large market. The virtual social contact is an important application in the field of virtual implementation, and the virtual object drive can be applied to the virtual social contact to drive personalized roles, so that the reality and the interactivity of the virtual social contact are enhanced, and the virtual reality experience of a user is optimized.

However, the existing virtual video composition mainly uses a human face motion capture device to track the change of a real human face in movie, animation and game video production, and maps the change to a virtual character to drive the mouth shape and expression of the virtual character, and cannot realize virtual video composition similar to the facial features of the virtual character.

Similarly, in the current social field, the situation of mutual video chat among strangers generally exists, and how to select videos which are close to the strangers in the long run and do not belong to the real face of the strangers to chat under the situation is also a technical problem which needs to be solved urgently at present.

Disclosure of Invention

The invention provides a virtual video synthesis method based on expression driving, an electronic device and a computer readable storage medium, and mainly aims to synthesize a photo for video chat by adopting a photo of a user and a photo of a stranger, further virtualize the synthesized photo into a video similar to the user through expression driving, and can approach the real appearance of the user and protect the privacy of the user.

In order to achieve the above object, the present invention provides a virtual video synthesis method based on expression driving, applied to an electronic device, the method including:

acquiring an image set to be synthesized, and determining an image to be synthesized from the image set to be synthesized;

synthesizing the image to be synthesized and a target image based on a GAN network to form a target photo, wherein the target image is an original image of a user;

intercepting a plurality of frame images in an unprocessed original video as reference images;

performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to the virtual transmission video to be transmitted;

and splicing the frames of the transmission images to form the virtual composite video.

Preferably, the image set to be synthesized includes a plurality of groups of images, each group of images includes a plurality of expression images of the same person, and the plurality of expression images serve as expression bases of the image set to be synthesized.

Preferably, the first and second electrodes are formed of a metal,

the target image is a group of images corresponding to the expression base of the image to be synthesized;

the step of synthesizing the image to be synthesized and the target image comprises the following steps: and synthesizing the target image and the image with the same expression in the image to be synthesized, wherein the expression of the synthesized target picture is consistent with the expression of the image with the same expression in the image to be synthesized.

Preferably, the step of performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to the to-be-transmitted virtual transmission video includes:

setting an average face, a group of expression bases and a group of identity bases;

and setting the coefficients of the average face and the identity base as fixed values, and controlling the expression base coefficient of the target picture to change along with the change of the expression base coefficient of the reference image to form a transmission image corresponding to the reference image.

Preferably, the step of setting the coefficients of the average face and the identity base to fixed values and controlling the expression base coefficient of the target photograph to vary with the variation of the coefficients of the expression base of the reference image, and the step of forming the transmission image corresponding to the reference image includes:

converting the 3D grid images of the average face, the expression base and the identity base into corresponding 2D images, and acquiring corresponding face key point coordinates based on the 2D images;

acquiring the coordinates of the key points of the face of the reference image;

changing the coefficients of expression bases through iteration, and enabling the Euclidean distance between the coordinates of the key points of the face obtained based on the 2D image and the coordinates of the key points of the face of the reference image to be minimum, so that the coefficients of a group of expression bases are determined;

and applying the obtained coefficients of the expression bases to the target photo to enable the coefficients of the same expression bases to be the same, and obtaining a final transmission image.

To achieve the above object, the present invention also provides an electronic device, including: the virtual video synthesis program based on expression drive is executed by the processor to realize the following steps:

acquiring the coordinates of the key points of the face of the reference image;

In addition, to achieve the above object, the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes an expression-driven virtual video composition program, and when the expression-driven virtual video composition program is executed by a processor, the computer-readable storage medium implements any step in the expression-driven virtual video composition method as described above.

According to the expression-driven-based virtual video synthesis method, the electronic device and the computer-readable storage medium, the photo for video chat is synthesized by adopting the photo of the user and the photo of a stranger, and then the synthesized photo is virtualized into the video similar to the user through the expression drive, so that the reality and the appearance of the user can be approached, and the privacy of the user can be protected.

Drawings

FIG. 1 is a schematic diagram of an application environment of a virtual video synthesis method based on expression driving according to a preferred embodiment of the present invention;

FIG. 2 is a block diagram of a preferred embodiment of the emotion-driven-based virtual video composition process of FIG. 1;

FIG. 3 is a flowchart illustrating a method for synthesizing a virtual video based on expression driving according to a preferred embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a virtual video synthesis method based on expression driving, which is applied to an electronic device 1. Fig. 1 is a schematic diagram of an application environment of a virtual video synthesis method based on expression driving according to a preferred embodiment of the present invention.

In the present embodiment, the electronic device 1 may be a terminal device having an arithmetic function, such as a server, a smart phone, a tablet computer, a portable computer, or a desktop computer.

The electronic device 1 includes: a processor 12, a memory 11, an imaging device 13, a network interface 14, and a communication bus 15.

The memory 11 includes at least one type of readable storage medium. The at least one type of readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card-type memory 11, and the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic apparatus 1, such as a hard disk of the electronic apparatus 1. In other embodiments, the readable storage medium may also be an external memory 11 of the electronic device 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1.

In the present embodiment, the readable storage medium of the memory 11 is generally used for storing the expression-driven virtual video synthesis program 10 installed in the electronic device 1, a face image sample library, and pre-trained AU classifiers, emotion classifiers, and the like. The memory 11 may also be used to temporarily store data that has been output or is to be output.

The processor 12 may be a Central Processing Unit (CPU), microprocessor or other data Processing chip in some embodiments, and is used to execute program codes stored in the memory 11 or process data, such as executing the expression-driven-based virtual video composition program 10.

The network interface 14 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), and is typically used to establish a communication link between the electronic apparatus 1 and other electronic devices.

The communication bus 15 is used to realize connection communication between these components.

Fig. 1 only shows the electronic device 1 with components 11-15, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may alternatively be implemented.

Optionally, the electronic device 1 may further include a user interface, the user interface may include an input unit such as a Keyboard (Keyboard), a voice input device such as a microphone (microphone) or other equipment with a voice recognition function, a voice output device such as a sound box, a headset, etc., and optionally the user interface may further include a standard wired interface, a wireless interface.

Optionally, the electronic device 1 may further comprise a display, which may also be referred to as a display screen or a display unit. In some embodiments, the display device may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch device, or the like. The display is used for displaying information processed in the electronic apparatus 1 and for displaying a visualized user interface.

Optionally, the electronic device 1 further comprises a touch sensor. The area provided by the touch sensor for the user to perform touch operation is called a touch area. Further, the touch sensor described herein may be a resistive touch sensor, a capacitive touch sensor, or the like. The touch sensor may include not only a contact type touch sensor but also a proximity type touch sensor. Further, the touch sensor may be a single sensor, or may be a plurality of sensors arranged in an array, for example.

The area of the display of the electronic device 1 may be the same as or different from the area of the touch sensor. Optionally, a display is stacked with the touch sensor to form a touch display screen. The device detects touch operation triggered by a user based on the touch display screen.

Optionally, the electronic device 1 may further include a Radio Frequency (RF) circuit, a sensor, an audio circuit, and the like, which are not described herein again.

In the apparatus embodiment shown in fig. 1, a memory 11, which is a kind of computer storage medium, may include therein an operating system, and an expression-driven based virtual video composition program 10; the processor 12 executes the expression-driven virtual video composition program 10 stored in the memory 11 to implement the following steps:

intercepting a plurality of frame images in an unprocessed original video as reference images corresponding to the actual video;

Preferably, the first and second electrodes are formed of a metal,

acquiring the coordinates of the key points of the face of the reference image;

The electronic device 1 provided in the above embodiment can reach a real-time or near real-time speed by optimizing the expression driving process, and skillfully combines with the GAN network that cannot be real-time, thereby solving the difficulty that the stranger does not want to reveal real self or cannot reveal self completely in video friend-making. By utilizing the expression-driven-based virtual video synthesis method, real person videos closer to the user can be synthesized, so that the video disagreement is reduced, meanwhile, the personal privacy can be protected, similar portraits are used for replacing real portraits to carry out video conversation, the expressions are more real and natural, and the expression is closer to the emotion that the user wants to express.

In other embodiments, the expression-driven based virtual video composition program 10 may also be divided into one or more modules, which are stored in the memory 11 and executed by the processor 12 to implement the present invention. The modules referred to herein are referred to as a series of computer program instruction segments capable of performing specified functions. Referring to fig. 2, a block diagram of a preferred embodiment of the expression-driven-based virtual video composition program 10 of fig. 1 is shown. The expression-driven based virtual video composition program 10 may be divided into: an image to be synthesized determination unit 11, a target photograph synthesis unit 12, a reference image acquisition unit 13, a transmission image determination unit 14, and a transmission video forming unit 15. The functions or operation steps performed by the modules 11-15 are similar to those described above and will not be described in detail here, for example, where:

the image to be synthesized determining unit 11 is configured to acquire an image set to be synthesized, and determine an image to be synthesized from the image set to be synthesized;

a target photo synthesizing unit 12, configured to synthesize the image to be synthesized and a target image based on a GAN network to form a target photo, where the target image is an original image of a user;

a reference image acquisition unit 13 for cutting a plurality of frame images in the unprocessed original video as reference images;

a transmission image determining unit 14, configured to perform expression driving on the target photo based on the reference image to obtain a transmission image corresponding to a to-be-transmitted virtual transmission video;

a transmission video forming unit 15, configured to splice frames of the transmission images to form the virtual composite video.

In addition, the invention also provides a virtual video synthesis method based on expression driving. Fig. 3 is a flowchart illustrating a virtual video composition method based on expression driving according to a preferred embodiment of the present invention. The method may be performed by an apparatus, which may be implemented by software and/or hardware.

In this embodiment, the expression-driven virtual video synthesis method includes:

s110: and acquiring an image set to be synthesized, and determining an image to be synthesized from the image set to be synthesized.

The step can also be understood as the selection of materials, a plurality of groups of images are collected in the image to be synthesized, each group of images are photos of the same person with different expressions, and the number of the photos is the number of expression bases. For example, if there are 47 expression bases, the representative person has 47 expressions, and the 47 expressions are independent of each other. The number of the expression bases can be set according to requirements, the more the number of the expression bases, the more exquisite and accurate the fitting of the expression bases, and certainly, the complexity of calculation is higher, the time required for processing one image is also the side length, the frame rate (the number of frames processed per second) may be affected, and the real-time performance cannot be realized. On the contrary, the smaller the number of expression bases, the faster the processing speed, and certainly, the greater the error of the expression is, the more possible the error is, and the specific number of expression bases can be set according to the actual requirement.

In this step, the image set to be synthesized is recommended according to the user preference and the historical use record; the image set to be synthesized comprises a plurality of groups of image sets, each group of image sets comprises each expression image of the same person, and each expression image is used as an expression base of the image to be synthesized. Meanwhile, a group of expression base images can be shot for the user according to the requirements of the expression base.

S120: and synthesizing the image to be synthesized and a target image based on a GAN network to form a target photo, wherein the target image is an original image of a user.

The network input of the GAN network is two images, and the network output is one image; the target image is a group of images or photos corresponding to the expression base of the image to be synthesized; the step of synthesizing the image to be synthesized and the target image comprises the following steps: and controlling the expression characteristics during synthesis, synthesizing the target image and the image with the same expression in the image to be synthesized, wherein the expression of the synthesized target picture is consistent with the expression of the target image.

In particular, the GAN network, i.e., the generative confrontation network, is trained on each other using two neural networks, one of which attempts to generate a composite image indistinguishable from a real photograph, and the other one attempts to resolve. After such training for a period of time, the image creation network may generate images that are spurious. At the same time, to ensure that the generated image and the two input images are as similar as possible, the feature space may be adjusted so that the sum of the differences between the features of the generated image and the features of the two input images is minimized as a supervised loss.

And in the synthesis process, the expression characteristics in the synthesis process are controlled, so that the corresponding expression of the picture shot by the user is synthesized in one-to-one correspondence with the corresponding expression of the picture or the picture to be synthesized selected by the user, and the synthesized expression is consistent with the expression before synthesis.

In other words, in order to ensure that there are images of the respective expressions of the synthesized portrait, two sets of data are required, one set being data of the respective expressions of "true me" (target image or self-photographed photograph), and one set being data of the respective expressions of "auxiliary image" (selected image or photograph to be synthesized), and the images of the respective expressions of "false me" (target image) can be obtained by synthesizing one by one according to the respective expressions.

Finally, a material map, namely a 3D mesh map, needs to be produced according to the different expressions.

S130: and intercepting a plurality of frame images in the unprocessed original video as reference images.

S140: performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to the virtual transmission video to be transmitted;

s150: and splicing the frames of the transmission images to form the virtual composite video.

The method comprises the following steps of carrying out expression driving on the target photo based on the reference image to obtain a transmission image corresponding to a to-be-transmitted virtual transmission video, wherein the step of obtaining the transmission image comprises the following steps:

1. setting an average face, a group of expression bases and a group of identity bases;

2. and setting the coefficients of the average face and the identity base as fixed values, and controlling the expression base coefficient of the target picture to change along with the change of the expression base coefficient of the reference image to form a transmission image corresponding to the reference image.

Further, the step of setting the coefficients of the average face and the identity base to fixed values and controlling the expression base coefficient of the target photograph to vary with the variation of the coefficients of the expression base of the reference image, and forming a transmission image corresponding to the reference image includes:

1. converting the 3D grid images of the average face, the expression base and the identity base into corresponding 2D images, and acquiring corresponding face key point coordinates based on the 2D images;

2. acquiring the coordinates of the key points of the face of the reference image;

3. changing the coefficients of expression bases through iteration, and enabling the Euclidean distance between the coordinates of the key points of the face obtained based on the 2D image and the coordinates of the key points of the face of the reference image to be minimum, so that the coefficients of a group of expression bases are determined;

4. and applying the obtained coefficients of the expression bases to the target photo to enable the coefficients of the same expression bases to be the same, and obtaining a final transmission image.

The above-described processing is performed on each frame in the actual video, and the processed frames (a set of transmission images) are combined into a final transmission video.

Specifically, an average face S0(mean face), a set of expression bases sexp (expression base), and a set of identity bases sid (identity base) are set first.

The expression base, also called expression texture map, that is, the required images of "true me" and "false me" under different expressions, are obtained in step S120. Because the identity is fixed, the coefficients of the average face and identity basis can be determined and kept constant, and are negligible here. Therefore, only by changing the coefficient of the expression base, the control transmission image (fake me) can be changed along with the expression change of the reference image (real me), namely the expression drive.

It should be noted that the average face, the expression base, and the identity base are 3D mesh maps, that is, 3D mesh, and when performing the fitting, the 3D mesh map is projected onto 2D through three matrices (unit matrix, rotation matrix, and projection matrix), so as to obtain a corresponding 2D image. And then obtaining the coordinates (the positions of eyes, mouth, nose and the like) of key points of the face of the corresponding image according to the 2D image.

In the above process of converting a 3D image into a 2D image, a series of coordinate system changes are involved, which can be simply understood as (x, y, z) coordinates in 3D coordinates, multiplied by a projection matrix, a rotation matrix and an identity matrix, to obtain (x ', y', 0) coordinates in 2D.

When the video is carried out, the camera can detect a reference image, and the positions of a group of key points can be obtained according to the reference image. When determining the coefficients of the expression bases, the coefficients of the expression bases are continuously iterated to change so that the positions of the key points (i.e., the above-mentioned x ', y') obtained by 3D mesh projection and the L2 Loss (i.e., the euclidean distance) of the positions of the key points of the reference image are minimum, and the coefficients of the expression bases can be determined.

Then, the obtained coefficients of the expression bases are applied to the target picture, so that the coefficients of the same expression bases are the same, the expression of the false transmission image is ensured to be consistent with the expression of the reference image, and the false expression is driven by the real expression. And then a final picture of the fake me is obtained according to the expression base synthesis, namely a picture which needs to be output to the video of the opposite side.

It should be noted that the "composition" in step S120 is a composition of photos with the same expression of two people, and the "composition" of the transmission image is a composition of different expression bases of the same person to obtain a final expression. In other words, the step S120 synthesizes the resultant photograph, and the resultant person is different from the two inputted persons, but the expression is the same. And for the photo synthesized by the transmission image, the person is the same, but the expression obtained by final synthesis does not belong to any one of the 47 expression bases, but is the same as the expression of the reference image of the frame obtained by video capture.

And finally, processing each frame in the actual video according to the steps and the preset frame rate, and then synthesizing the processed frames into the final transmission video. In the synthesis process, the calculation process is simplified through a least square method and linear regression, and real-time video transmission can be realized.

It should be noted that only a GAN network is used to synthesize a simple face photo, and then the face photos are pasted in the video according to the key points of the face, so that the face displayed in the video is very unnatural. In order to avoid the state, the expression driving operation is carried out by adopting a BlendShape technology or a Skeleton method, so that the real-time expression driving of the video can be realized.

At present, the GAN network is mainly used for generating data to perform data gain, super-resolution of images, style migration and the like because real-time is difficult to achieve, and the expression-driven images are mainly used for driving virtual images, such as a cartoon image similar to the self, a cat and a dog and the like. This gives a sense of unreality to people in actual chatting, influences user experience.

Therefore, the invention optimizes the expression driving process to reach the real-time or near real-time speed, and skillfully combines the non-real-time GAN network, thereby solving the problem that the stranger does not want to completely reveal the real self or not reveal the self in the video friend-making. By utilizing the expression-driven-based virtual video synthesis method, real person videos closer to the user can be synthesized, so that the video disagreement is reduced, meanwhile, the personal privacy can be protected, similar portraits are used for replacing real portraits to carry out video conversation, the expressions are more real and natural, and the expression is closer to the emotion that the user wants to express.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes an expression-driven-based virtual video composition program, and when executed by a processor, the expression-driven-based virtual video composition program implements the following operations:

Preferably, the first and second electrodes are formed of a metal,

the step of synthesizing the image to be synthesized and the target image comprises the following steps: and controlling the expression characteristics during synthesis, synthesizing the target image and the image with the same expression in the image to be synthesized, wherein the expression of the synthesized target photo is consistent with the expression of the image with the same expression in the image to be synthesized.

acquiring the coordinates of the key points of the face of the reference image;

The specific implementation of the computer-readable storage medium of the present invention is substantially the same as the above-mentioned virtual video synthesis method based on expression driving and the specific implementation of the electronic device, and will not be described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments. Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A virtual video synthesis method based on expression driving is applied to an electronic device, and is characterized in that the method comprises the following steps:

2. The expression-driven-based virtual video synthesis method according to claim 1, wherein the image set to be synthesized contains a plurality of groups of images, each group of images includes a plurality of expression images of the same person, and the plurality of expression images serve as expression bases of the images to be synthesized.

3. The method of claim 2, wherein the virtual video composition is based on expression driving,

4. The expression-driven-based virtual video synthesis method according to claim 1, wherein the step of performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to a virtual transmission video to be transmitted comprises:

5. The expression-drive-based virtual video synthesis method according to claim 4, wherein the step of setting the coefficients of the average face and the identity base to fixed values and controlling the expression base coefficient of the target photograph to vary with the variation of the coefficient of the expression base of the reference image, and forming the transmission image corresponding to the reference image comprises:

converting the 3D grid graph of the average face, the expression base and the identity base into corresponding 2D images, and acquiring corresponding face key point coordinates based on the 2D images;

acquiring the coordinates of the key points of the face of the reference image;

6. An electronic device, comprising: the virtual video synthesis program based on expression drive is executed by the processor to realize the following steps:

7. The electronic device of claim 6,

the image set to be synthesized comprises a plurality of groups of images, each group of images comprises a plurality of expression images of the same person, and the expression images are used as expression bases of the images to be synthesized.

8. The electronic device according to claim 6, wherein the step of performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to a virtual transmission video to be transmitted comprises:

9. The electronic device according to claim 8, wherein the step of setting the coefficients of the average face and the identity base to fixed values and controlling the expression base coefficient of the target photograph to vary with the variation of the coefficient of the expression base of the reference image, and forming the transmission image corresponding to the reference image includes:

acquiring the coordinates of the key points of the face of the reference image;

10. A computer-readable storage medium, wherein the computer-readable storage medium includes an expression-driven virtual video composition program, and when the expression-driven virtual video composition program is executed by a processor, the steps of the expression-driven virtual video composition method according to any one of claims 1 to 5 are implemented.