CN110620884A - Expression-driven-based virtual video synthesis method and device and storage medium - Google Patents
Expression-driven-based virtual video synthesis method and device and storage medium Download PDFInfo
- Publication number
- CN110620884A CN110620884A CN201910885913.9A CN201910885913A CN110620884A CN 110620884 A CN110620884 A CN 110620884A CN 201910885913 A CN201910885913 A CN 201910885913A CN 110620884 A CN110620884 A CN 110620884A
- Authority
- CN
- China
- Prior art keywords
- expression
- image
- images
- synthesized
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000014509 gene expression Effects 0.000 title claims abstract description 234
- 238000003860 storage Methods 0.000 title claims abstract description 23
- 238000001308 synthesis method Methods 0.000 title claims abstract description 19
- 230000005540 biological transmission Effects 0.000 claims abstract description 75
- 238000000034 method Methods 0.000 claims abstract description 30
- 230000002194 synthesizing effect Effects 0.000 claims abstract description 22
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 16
- 230000008859 change Effects 0.000 claims description 18
- 239000000203 mixture Substances 0.000 claims description 11
- 238000003786 synthesis reaction Methods 0.000 abstract description 13
- 230000008569 process Effects 0.000 description 13
- 239000011159 matrix material Substances 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000008451 emotion Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 239000002184 metal Substances 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 241000282326 Felis catus Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000001815 facial effect Effects 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/23424—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for inserting or substituting an advertisement
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/478—Supplemental services, e.g. displaying phone caller identification, shopping application
- H04N21/4788—Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/265—Mixing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/141—Systems for two-way working between two video terminals, e.g. videophone
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Business, Economics & Management (AREA)
- Marketing (AREA)
- General Engineering & Computer Science (AREA)
- Processing Or Creating Images (AREA)
- Image Processing (AREA)
Abstract
The invention relates to the technical field of video synthesis, and provides a virtual video synthesis method, a virtual video synthesis device and a storage medium based on expression driving, wherein the method comprises the following steps: acquiring an image set to be synthesized, and determining an image to be synthesized from the image set to be synthesized; synthesizing an image to be synthesized and a target image based on a GAN network to form a target photo, wherein the target image is an original image of a user; intercepting a plurality of frames in an unprocessed original video as reference images; performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to the virtual transmission video to be transmitted; and splicing the frames of the transmission images to form the virtual composite video. The invention adopts the own photo and the photo of a stranger to synthesize the photo for video chat, and further virtualizes the synthesized photo into a video similar to the own photo through expression driving, so that the real appearance of the own photo can be approached, and the privacy of the own photo can be protected.
Description
Technical Field
The invention relates to the technical field of video synthesis, in particular to a virtual video synthesis method and device based on expression driving and a computer readable storage medium.
Background
At present, the synthesis of virtual videos is widely applied in many fields, and has a large market. The virtual social contact is an important application in the field of virtual implementation, and the virtual object drive can be applied to the virtual social contact to drive personalized roles, so that the reality and the interactivity of the virtual social contact are enhanced, and the virtual reality experience of a user is optimized.
However, the existing virtual video composition mainly uses a human face motion capture device to track the change of a real human face in movie, animation and game video production, and maps the change to a virtual character to drive the mouth shape and expression of the virtual character, and cannot realize virtual video composition similar to the facial features of the virtual character.
Similarly, in the current social field, the situation of mutual video chat among strangers generally exists, and how to select videos which are close to the strangers in the long run and do not belong to the real face of the strangers to chat under the situation is also a technical problem which needs to be solved urgently at present.
Disclosure of Invention
The invention provides a virtual video synthesis method based on expression driving, an electronic device and a computer readable storage medium, and mainly aims to synthesize a photo for video chat by adopting a photo of a user and a photo of a stranger, further virtualize the synthesized photo into a video similar to the user through expression driving, and can approach the real appearance of the user and protect the privacy of the user.
In order to achieve the above object, the present invention provides a virtual video synthesis method based on expression driving, applied to an electronic device, the method including:
acquiring an image set to be synthesized, and determining an image to be synthesized from the image set to be synthesized;
synthesizing the image to be synthesized and a target image based on a GAN network to form a target photo, wherein the target image is an original image of a user;
intercepting a plurality of frame images in an unprocessed original video as reference images;
performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to the virtual transmission video to be transmitted;
and splicing the frames of the transmission images to form the virtual composite video.
Preferably, the image set to be synthesized includes a plurality of groups of images, each group of images includes a plurality of expression images of the same person, and the plurality of expression images serve as expression bases of the image set to be synthesized.
Preferably, the first and second electrodes are formed of a metal,
the target image is a group of images corresponding to the expression base of the image to be synthesized;
the step of synthesizing the image to be synthesized and the target image comprises the following steps: and synthesizing the target image and the image with the same expression in the image to be synthesized, wherein the expression of the synthesized target picture is consistent with the expression of the image with the same expression in the image to be synthesized.
Preferably, the step of performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to the to-be-transmitted virtual transmission video includes:
setting an average face, a group of expression bases and a group of identity bases;
and setting the coefficients of the average face and the identity base as fixed values, and controlling the expression base coefficient of the target picture to change along with the change of the expression base coefficient of the reference image to form a transmission image corresponding to the reference image.
Preferably, the step of setting the coefficients of the average face and the identity base to fixed values and controlling the expression base coefficient of the target photograph to vary with the variation of the coefficients of the expression base of the reference image, and the step of forming the transmission image corresponding to the reference image includes:
converting the 3D grid images of the average face, the expression base and the identity base into corresponding 2D images, and acquiring corresponding face key point coordinates based on the 2D images;
acquiring the coordinates of the key points of the face of the reference image;
changing the coefficients of expression bases through iteration, and enabling the Euclidean distance between the coordinates of the key points of the face obtained based on the 2D image and the coordinates of the key points of the face of the reference image to be minimum, so that the coefficients of a group of expression bases are determined;
and applying the obtained coefficients of the expression bases to the target photo to enable the coefficients of the same expression bases to be the same, and obtaining a final transmission image.
To achieve the above object, the present invention also provides an electronic device, including: the virtual video synthesis program based on expression drive is executed by the processor to realize the following steps:
acquiring an image set to be synthesized, and determining an image to be synthesized from the image set to be synthesized;
synthesizing the image to be synthesized and a target image based on a GAN network to form a target photo, wherein the target image is an original image of a user;
intercepting a plurality of frame images in an unprocessed original video as reference images;
performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to the virtual transmission video to be transmitted;
and splicing the frames of the transmission images to form the virtual composite video.
Preferably, the image set to be synthesized includes a plurality of groups of images, each group of images includes a plurality of expression images of the same person, and the plurality of expression images serve as expression bases of the image set to be synthesized.
Preferably, the step of performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to the to-be-transmitted virtual transmission video includes:
setting an average face, a group of expression bases and a group of identity bases;
and setting the coefficients of the average face and the identity base as fixed values, and controlling the expression base coefficient of the target picture to change along with the change of the expression base coefficient of the reference image to form a transmission image corresponding to the reference image.
Preferably, the step of setting the coefficients of the average face and the identity base to fixed values and controlling the expression base coefficient of the target photograph to vary with the variation of the coefficients of the expression base of the reference image, and the step of forming the transmission image corresponding to the reference image includes:
converting the 3D grid images of the average face, the expression base and the identity base into corresponding 2D images, and acquiring corresponding face key point coordinates based on the 2D images;
acquiring the coordinates of the key points of the face of the reference image;
changing the coefficients of expression bases through iteration, and enabling the Euclidean distance between the coordinates of the key points of the face obtained based on the 2D image and the coordinates of the key points of the face of the reference image to be minimum, so that the coefficients of a group of expression bases are determined;
and applying the obtained coefficients of the expression bases to the target photo to enable the coefficients of the same expression bases to be the same, and obtaining a final transmission image.
In addition, to achieve the above object, the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes an expression-driven virtual video composition program, and when the expression-driven virtual video composition program is executed by a processor, the computer-readable storage medium implements any step in the expression-driven virtual video composition method as described above.
According to the expression-driven-based virtual video synthesis method, the electronic device and the computer-readable storage medium, the photo for video chat is synthesized by adopting the photo of the user and the photo of a stranger, and then the synthesized photo is virtualized into the video similar to the user through the expression drive, so that the reality and the appearance of the user can be approached, and the privacy of the user can be protected.
Drawings
FIG. 1 is a schematic diagram of an application environment of a virtual video synthesis method based on expression driving according to a preferred embodiment of the present invention;
FIG. 2 is a block diagram of a preferred embodiment of the emotion-driven-based virtual video composition process of FIG. 1;
FIG. 3 is a flowchart illustrating a method for synthesizing a virtual video based on expression driving according to a preferred embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a virtual video synthesis method based on expression driving, which is applied to an electronic device 1. Fig. 1 is a schematic diagram of an application environment of a virtual video synthesis method based on expression driving according to a preferred embodiment of the present invention.
In the present embodiment, the electronic device 1 may be a terminal device having an arithmetic function, such as a server, a smart phone, a tablet computer, a portable computer, or a desktop computer.
The electronic device 1 includes: a processor 12, a memory 11, an imaging device 13, a network interface 14, and a communication bus 15.
The memory 11 includes at least one type of readable storage medium. The at least one type of readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card-type memory 11, and the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic apparatus 1, such as a hard disk of the electronic apparatus 1. In other embodiments, the readable storage medium may also be an external memory 11 of the electronic device 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1.
In the present embodiment, the readable storage medium of the memory 11 is generally used for storing the expression-driven virtual video synthesis program 10 installed in the electronic device 1, a face image sample library, and pre-trained AU classifiers, emotion classifiers, and the like. The memory 11 may also be used to temporarily store data that has been output or is to be output.
The processor 12 may be a Central Processing Unit (CPU), microprocessor or other data Processing chip in some embodiments, and is used to execute program codes stored in the memory 11 or process data, such as executing the expression-driven-based virtual video composition program 10.
The network interface 14 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), and is typically used to establish a communication link between the electronic apparatus 1 and other electronic devices.
The communication bus 15 is used to realize connection communication between these components.
Fig. 1 only shows the electronic device 1 with components 11-15, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may alternatively be implemented.
Optionally, the electronic device 1 may further include a user interface, the user interface may include an input unit such as a Keyboard (Keyboard), a voice input device such as a microphone (microphone) or other equipment with a voice recognition function, a voice output device such as a sound box, a headset, etc., and optionally the user interface may further include a standard wired interface, a wireless interface.
Optionally, the electronic device 1 may further comprise a display, which may also be referred to as a display screen or a display unit. In some embodiments, the display device may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch device, or the like. The display is used for displaying information processed in the electronic apparatus 1 and for displaying a visualized user interface.
Optionally, the electronic device 1 further comprises a touch sensor. The area provided by the touch sensor for the user to perform touch operation is called a touch area. Further, the touch sensor described herein may be a resistive touch sensor, a capacitive touch sensor, or the like. The touch sensor may include not only a contact type touch sensor but also a proximity type touch sensor. Further, the touch sensor may be a single sensor, or may be a plurality of sensors arranged in an array, for example.
The area of the display of the electronic device 1 may be the same as or different from the area of the touch sensor. Optionally, a display is stacked with the touch sensor to form a touch display screen. The device detects touch operation triggered by a user based on the touch display screen.
Optionally, the electronic device 1 may further include a Radio Frequency (RF) circuit, a sensor, an audio circuit, and the like, which are not described herein again.
In the apparatus embodiment shown in fig. 1, a memory 11, which is a kind of computer storage medium, may include therein an operating system, and an expression-driven based virtual video composition program 10; the processor 12 executes the expression-driven virtual video composition program 10 stored in the memory 11 to implement the following steps:
acquiring an image set to be synthesized, and determining an image to be synthesized from the image set to be synthesized;
synthesizing the image to be synthesized and a target image based on a GAN network to form a target photo, wherein the target image is an original image of a user;
intercepting a plurality of frame images in an unprocessed original video as reference images corresponding to the actual video;
performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to the virtual transmission video to be transmitted;
and splicing the frames of the transmission images to form the virtual composite video.
Preferably, the image set to be synthesized includes a plurality of groups of images, each group of images includes a plurality of expression images of the same person, and the plurality of expression images serve as expression bases of the image set to be synthesized.
Preferably, the first and second electrodes are formed of a metal,
the target image is a group of images corresponding to the expression base of the image to be synthesized;
the step of synthesizing the image to be synthesized and the target image comprises the following steps: and synthesizing the target image and the image with the same expression in the image to be synthesized, wherein the expression of the synthesized target picture is consistent with the expression of the image with the same expression in the image to be synthesized.
Preferably, the step of performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to the to-be-transmitted virtual transmission video includes:
setting an average face, a group of expression bases and a group of identity bases;
and setting the coefficients of the average face and the identity base as fixed values, and controlling the expression base coefficient of the target picture to change along with the change of the expression base coefficient of the reference image to form a transmission image corresponding to the reference image.
Preferably, the step of setting the coefficients of the average face and the identity base to fixed values and controlling the expression base coefficient of the target photograph to vary with the variation of the coefficients of the expression base of the reference image, and the step of forming the transmission image corresponding to the reference image includes:
converting the 3D grid images of the average face, the expression base and the identity base into corresponding 2D images, and acquiring corresponding face key point coordinates based on the 2D images;
acquiring the coordinates of the key points of the face of the reference image;
changing the coefficients of expression bases through iteration, and enabling the Euclidean distance between the coordinates of the key points of the face obtained based on the 2D image and the coordinates of the key points of the face of the reference image to be minimum, so that the coefficients of a group of expression bases are determined;
and applying the obtained coefficients of the expression bases to the target photo to enable the coefficients of the same expression bases to be the same, and obtaining a final transmission image.
The electronic device 1 provided in the above embodiment can reach a real-time or near real-time speed by optimizing the expression driving process, and skillfully combines with the GAN network that cannot be real-time, thereby solving the difficulty that the stranger does not want to reveal real self or cannot reveal self completely in video friend-making. By utilizing the expression-driven-based virtual video synthesis method, real person videos closer to the user can be synthesized, so that the video disagreement is reduced, meanwhile, the personal privacy can be protected, similar portraits are used for replacing real portraits to carry out video conversation, the expressions are more real and natural, and the expression is closer to the emotion that the user wants to express.
In other embodiments, the expression-driven based virtual video composition program 10 may also be divided into one or more modules, which are stored in the memory 11 and executed by the processor 12 to implement the present invention. The modules referred to herein are referred to as a series of computer program instruction segments capable of performing specified functions. Referring to fig. 2, a block diagram of a preferred embodiment of the expression-driven-based virtual video composition program 10 of fig. 1 is shown. The expression-driven based virtual video composition program 10 may be divided into: an image to be synthesized determination unit 11, a target photograph synthesis unit 12, a reference image acquisition unit 13, a transmission image determination unit 14, and a transmission video forming unit 15. The functions or operation steps performed by the modules 11-15 are similar to those described above and will not be described in detail here, for example, where:
the image to be synthesized determining unit 11 is configured to acquire an image set to be synthesized, and determine an image to be synthesized from the image set to be synthesized;
a target photo synthesizing unit 12, configured to synthesize the image to be synthesized and a target image based on a GAN network to form a target photo, where the target image is an original image of a user;
a reference image acquisition unit 13 for cutting a plurality of frame images in the unprocessed original video as reference images;
a transmission image determining unit 14, configured to perform expression driving on the target photo based on the reference image to obtain a transmission image corresponding to a to-be-transmitted virtual transmission video;
a transmission video forming unit 15, configured to splice frames of the transmission images to form the virtual composite video.
In addition, the invention also provides a virtual video synthesis method based on expression driving. Fig. 3 is a flowchart illustrating a virtual video composition method based on expression driving according to a preferred embodiment of the present invention. The method may be performed by an apparatus, which may be implemented by software and/or hardware.
In this embodiment, the expression-driven virtual video synthesis method includes:
s110: and acquiring an image set to be synthesized, and determining an image to be synthesized from the image set to be synthesized.
The step can also be understood as the selection of materials, a plurality of groups of images are collected in the image to be synthesized, each group of images are photos of the same person with different expressions, and the number of the photos is the number of expression bases. For example, if there are 47 expression bases, the representative person has 47 expressions, and the 47 expressions are independent of each other. The number of the expression bases can be set according to requirements, the more the number of the expression bases, the more exquisite and accurate the fitting of the expression bases, and certainly, the complexity of calculation is higher, the time required for processing one image is also the side length, the frame rate (the number of frames processed per second) may be affected, and the real-time performance cannot be realized. On the contrary, the smaller the number of expression bases, the faster the processing speed, and certainly, the greater the error of the expression is, the more possible the error is, and the specific number of expression bases can be set according to the actual requirement.
In this step, the image set to be synthesized is recommended according to the user preference and the historical use record; the image set to be synthesized comprises a plurality of groups of image sets, each group of image sets comprises each expression image of the same person, and each expression image is used as an expression base of the image to be synthesized. Meanwhile, a group of expression base images can be shot for the user according to the requirements of the expression base.
S120: and synthesizing the image to be synthesized and a target image based on a GAN network to form a target photo, wherein the target image is an original image of a user.
The network input of the GAN network is two images, and the network output is one image; the target image is a group of images or photos corresponding to the expression base of the image to be synthesized; the step of synthesizing the image to be synthesized and the target image comprises the following steps: and controlling the expression characteristics during synthesis, synthesizing the target image and the image with the same expression in the image to be synthesized, wherein the expression of the synthesized target picture is consistent with the expression of the target image.
In particular, the GAN network, i.e., the generative confrontation network, is trained on each other using two neural networks, one of which attempts to generate a composite image indistinguishable from a real photograph, and the other one attempts to resolve. After such training for a period of time, the image creation network may generate images that are spurious. At the same time, to ensure that the generated image and the two input images are as similar as possible, the feature space may be adjusted so that the sum of the differences between the features of the generated image and the features of the two input images is minimized as a supervised loss.
And in the synthesis process, the expression characteristics in the synthesis process are controlled, so that the corresponding expression of the picture shot by the user is synthesized in one-to-one correspondence with the corresponding expression of the picture or the picture to be synthesized selected by the user, and the synthesized expression is consistent with the expression before synthesis.
In other words, in order to ensure that there are images of the respective expressions of the synthesized portrait, two sets of data are required, one set being data of the respective expressions of "true me" (target image or self-photographed photograph), and one set being data of the respective expressions of "auxiliary image" (selected image or photograph to be synthesized), and the images of the respective expressions of "false me" (target image) can be obtained by synthesizing one by one according to the respective expressions.
Finally, a material map, namely a 3D mesh map, needs to be produced according to the different expressions.
S130: and intercepting a plurality of frame images in the unprocessed original video as reference images.
S140: performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to the virtual transmission video to be transmitted;
s150: and splicing the frames of the transmission images to form the virtual composite video.
The method comprises the following steps of carrying out expression driving on the target photo based on the reference image to obtain a transmission image corresponding to a to-be-transmitted virtual transmission video, wherein the step of obtaining the transmission image comprises the following steps:
1. setting an average face, a group of expression bases and a group of identity bases;
2. and setting the coefficients of the average face and the identity base as fixed values, and controlling the expression base coefficient of the target picture to change along with the change of the expression base coefficient of the reference image to form a transmission image corresponding to the reference image.
Further, the step of setting the coefficients of the average face and the identity base to fixed values and controlling the expression base coefficient of the target photograph to vary with the variation of the coefficients of the expression base of the reference image, and forming a transmission image corresponding to the reference image includes:
1. converting the 3D grid images of the average face, the expression base and the identity base into corresponding 2D images, and acquiring corresponding face key point coordinates based on the 2D images;
2. acquiring the coordinates of the key points of the face of the reference image;
3. changing the coefficients of expression bases through iteration, and enabling the Euclidean distance between the coordinates of the key points of the face obtained based on the 2D image and the coordinates of the key points of the face of the reference image to be minimum, so that the coefficients of a group of expression bases are determined;
4. and applying the obtained coefficients of the expression bases to the target photo to enable the coefficients of the same expression bases to be the same, and obtaining a final transmission image.
The above-described processing is performed on each frame in the actual video, and the processed frames (a set of transmission images) are combined into a final transmission video.
Specifically, an average face S0(mean face), a set of expression bases sexp (expression base), and a set of identity bases sid (identity base) are set first.
The expression base, also called expression texture map, that is, the required images of "true me" and "false me" under different expressions, are obtained in step S120. Because the identity is fixed, the coefficients of the average face and identity basis can be determined and kept constant, and are negligible here. Therefore, only by changing the coefficient of the expression base, the control transmission image (fake me) can be changed along with the expression change of the reference image (real me), namely the expression drive.
It should be noted that the average face, the expression base, and the identity base are 3D mesh maps, that is, 3D mesh, and when performing the fitting, the 3D mesh map is projected onto 2D through three matrices (unit matrix, rotation matrix, and projection matrix), so as to obtain a corresponding 2D image. And then obtaining the coordinates (the positions of eyes, mouth, nose and the like) of key points of the face of the corresponding image according to the 2D image.
In the above process of converting a 3D image into a 2D image, a series of coordinate system changes are involved, which can be simply understood as (x, y, z) coordinates in 3D coordinates, multiplied by a projection matrix, a rotation matrix and an identity matrix, to obtain (x ', y', 0) coordinates in 2D.
When the video is carried out, the camera can detect a reference image, and the positions of a group of key points can be obtained according to the reference image. When determining the coefficients of the expression bases, the coefficients of the expression bases are continuously iterated to change so that the positions of the key points (i.e., the above-mentioned x ', y') obtained by 3D mesh projection and the L2 Loss (i.e., the euclidean distance) of the positions of the key points of the reference image are minimum, and the coefficients of the expression bases can be determined.
Then, the obtained coefficients of the expression bases are applied to the target picture, so that the coefficients of the same expression bases are the same, the expression of the false transmission image is ensured to be consistent with the expression of the reference image, and the false expression is driven by the real expression. And then a final picture of the fake me is obtained according to the expression base synthesis, namely a picture which needs to be output to the video of the opposite side.
It should be noted that the "composition" in step S120 is a composition of photos with the same expression of two people, and the "composition" of the transmission image is a composition of different expression bases of the same person to obtain a final expression. In other words, the step S120 synthesizes the resultant photograph, and the resultant person is different from the two inputted persons, but the expression is the same. And for the photo synthesized by the transmission image, the person is the same, but the expression obtained by final synthesis does not belong to any one of the 47 expression bases, but is the same as the expression of the reference image of the frame obtained by video capture.
And finally, processing each frame in the actual video according to the steps and the preset frame rate, and then synthesizing the processed frames into the final transmission video. In the synthesis process, the calculation process is simplified through a least square method and linear regression, and real-time video transmission can be realized.
It should be noted that only a GAN network is used to synthesize a simple face photo, and then the face photos are pasted in the video according to the key points of the face, so that the face displayed in the video is very unnatural. In order to avoid the state, the expression driving operation is carried out by adopting a BlendShape technology or a Skeleton method, so that the real-time expression driving of the video can be realized.
At present, the GAN network is mainly used for generating data to perform data gain, super-resolution of images, style migration and the like because real-time is difficult to achieve, and the expression-driven images are mainly used for driving virtual images, such as a cartoon image similar to the self, a cat and a dog and the like. This gives a sense of unreality to people in actual chatting, influences user experience.
Therefore, the invention optimizes the expression driving process to reach the real-time or near real-time speed, and skillfully combines the non-real-time GAN network, thereby solving the problem that the stranger does not want to completely reveal the real self or not reveal the self in the video friend-making. By utilizing the expression-driven-based virtual video synthesis method, real person videos closer to the user can be synthesized, so that the video disagreement is reduced, meanwhile, the personal privacy can be protected, similar portraits are used for replacing real portraits to carry out video conversation, the expressions are more real and natural, and the expression is closer to the emotion that the user wants to express.
In addition, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes an expression-driven-based virtual video composition program, and when executed by a processor, the expression-driven-based virtual video composition program implements the following operations:
acquiring an image set to be synthesized, and determining an image to be synthesized from the image set to be synthesized;
synthesizing the image to be synthesized and a target image based on a GAN network to form a target photo, wherein the target image is an original image of a user;
intercepting a plurality of frame images in an unprocessed original video as reference images;
performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to the virtual transmission video to be transmitted;
and splicing the frames of the transmission images to form the virtual composite video.
Preferably, the image set to be synthesized includes a plurality of groups of images, each group of images includes a plurality of expression images of the same person, and the plurality of expression images serve as expression bases of the image set to be synthesized.
Preferably, the first and second electrodes are formed of a metal,
the target image is a group of images corresponding to the expression base of the image to be synthesized;
the step of synthesizing the image to be synthesized and the target image comprises the following steps: and controlling the expression characteristics during synthesis, synthesizing the target image and the image with the same expression in the image to be synthesized, wherein the expression of the synthesized target photo is consistent with the expression of the image with the same expression in the image to be synthesized.
Preferably, the step of performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to the to-be-transmitted virtual transmission video includes:
setting an average face, a group of expression bases and a group of identity bases;
and setting the coefficients of the average face and the identity base as fixed values, and controlling the expression base coefficient of the target picture to change along with the change of the expression base coefficient of the reference image to form a transmission image corresponding to the reference image.
Preferably, the step of setting the coefficients of the average face and the identity base to fixed values and controlling the expression base coefficient of the target photograph to vary with the variation of the coefficients of the expression base of the reference image, and the step of forming the transmission image corresponding to the reference image includes:
converting the 3D grid images of the average face, the expression base and the identity base into corresponding 2D images, and acquiring corresponding face key point coordinates based on the 2D images;
acquiring the coordinates of the key points of the face of the reference image;
changing the coefficients of expression bases through iteration, and enabling the Euclidean distance between the coordinates of the key points of the face obtained based on the 2D image and the coordinates of the key points of the face of the reference image to be minimum, so that the coefficients of a group of expression bases are determined;
and applying the obtained coefficients of the expression bases to the target photo to enable the coefficients of the same expression bases to be the same, and obtaining a final transmission image.
The specific implementation of the computer-readable storage medium of the present invention is substantially the same as the above-mentioned virtual video synthesis method based on expression driving and the specific implementation of the electronic device, and will not be described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments. Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.
Claims (10)
1. A virtual video synthesis method based on expression driving is applied to an electronic device, and is characterized in that the method comprises the following steps:
acquiring an image set to be synthesized, and determining an image to be synthesized from the image set to be synthesized;
synthesizing the image to be synthesized and a target image based on a GAN network to form a target photo, wherein the target image is an original image of a user;
intercepting a plurality of frame images in an unprocessed original video as reference images;
performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to the virtual transmission video to be transmitted;
and splicing the frames of the transmission images to form the virtual composite video.
2. The expression-driven-based virtual video synthesis method according to claim 1, wherein the image set to be synthesized contains a plurality of groups of images, each group of images includes a plurality of expression images of the same person, and the plurality of expression images serve as expression bases of the images to be synthesized.
3. The method of claim 2, wherein the virtual video composition is based on expression driving,
the target image is a group of images corresponding to the expression base of the image to be synthesized;
the step of synthesizing the image to be synthesized and the target image comprises the following steps: and synthesizing the target image and the image with the same expression in the image to be synthesized, wherein the expression of the synthesized target picture is consistent with the expression of the image with the same expression in the image to be synthesized.
4. The expression-driven-based virtual video synthesis method according to claim 1, wherein the step of performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to a virtual transmission video to be transmitted comprises:
setting an average face, a group of expression bases and a group of identity bases;
and setting the coefficients of the average face and the identity base as fixed values, and controlling the expression base coefficient of the target picture to change along with the change of the expression base coefficient of the reference image to form a transmission image corresponding to the reference image.
5. The expression-drive-based virtual video synthesis method according to claim 4, wherein the step of setting the coefficients of the average face and the identity base to fixed values and controlling the expression base coefficient of the target photograph to vary with the variation of the coefficient of the expression base of the reference image, and forming the transmission image corresponding to the reference image comprises:
converting the 3D grid graph of the average face, the expression base and the identity base into corresponding 2D images, and acquiring corresponding face key point coordinates based on the 2D images;
acquiring the coordinates of the key points of the face of the reference image;
changing the coefficients of expression bases through iteration, and enabling the Euclidean distance between the coordinates of the key points of the face obtained based on the 2D image and the coordinates of the key points of the face of the reference image to be minimum, so that the coefficients of a group of expression bases are determined;
and applying the obtained coefficients of the expression bases to the target photo to enable the coefficients of the same expression bases to be the same, and obtaining a final transmission image.
6. An electronic device, comprising: the virtual video synthesis program based on expression drive is executed by the processor to realize the following steps:
acquiring an image set to be synthesized, and determining an image to be synthesized from the image set to be synthesized;
synthesizing the image to be synthesized and a target image based on a GAN network to form a target photo, wherein the target image is an original image of a user;
intercepting a plurality of frame images in an unprocessed original video as reference images;
performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to the virtual transmission video to be transmitted;
and splicing the frames of the transmission images to form the virtual composite video.
7. The electronic device of claim 6,
the image set to be synthesized comprises a plurality of groups of images, each group of images comprises a plurality of expression images of the same person, and the expression images are used as expression bases of the images to be synthesized.
8. The electronic device according to claim 6, wherein the step of performing expression driving on the target photo based on the reference image to obtain a transmission image corresponding to a virtual transmission video to be transmitted comprises:
setting an average face, a group of expression bases and a group of identity bases;
and setting the coefficients of the average face and the identity base as fixed values, and controlling the expression base coefficient of the target picture to change along with the change of the expression base coefficient of the reference image to form a transmission image corresponding to the reference image.
9. The electronic device according to claim 8, wherein the step of setting the coefficients of the average face and the identity base to fixed values and controlling the expression base coefficient of the target photograph to vary with the variation of the coefficient of the expression base of the reference image, and forming the transmission image corresponding to the reference image includes:
converting the 3D grid images of the average face, the expression base and the identity base into corresponding 2D images, and acquiring corresponding face key point coordinates based on the 2D images;
acquiring the coordinates of the key points of the face of the reference image;
changing the coefficients of expression bases through iteration, and enabling the Euclidean distance between the coordinates of the key points of the face obtained based on the 2D image and the coordinates of the key points of the face of the reference image to be minimum, so that the coefficients of a group of expression bases are determined;
and applying the obtained coefficients of the expression bases to the target photo to enable the coefficients of the same expression bases to be the same, and obtaining a final transmission image.
10. A computer-readable storage medium, wherein the computer-readable storage medium includes an expression-driven virtual video composition program, and when the expression-driven virtual video composition program is executed by a processor, the steps of the expression-driven virtual video composition method according to any one of claims 1 to 5 are implemented.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910885913.9A CN110620884B (en) | 2019-09-19 | 2019-09-19 | Expression-driven-based virtual video synthesis method and device and storage medium |
PCT/CN2019/118285 WO2021051605A1 (en) | 2019-09-19 | 2019-11-14 | Virtual video synthesis method and apparatus based on expression driving, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910885913.9A CN110620884B (en) | 2019-09-19 | 2019-09-19 | Expression-driven-based virtual video synthesis method and device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110620884A true CN110620884A (en) | 2019-12-27 |
CN110620884B CN110620884B (en) | 2022-04-22 |
Family
ID=68923758
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910885913.9A Active CN110620884B (en) | 2019-09-19 | 2019-09-19 | Expression-driven-based virtual video synthesis method and device and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110620884B (en) |
WO (1) | WO2021051605A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111614925A (en) * | 2020-05-20 | 2020-09-01 | 广州视源电子科技股份有限公司 | Figure image processing method and device, corresponding terminal and storage medium |
CN113559503A (en) * | 2021-06-30 | 2021-10-29 | 上海掌门科技有限公司 | Video generation method, device and computer readable medium |
CN114429611A (en) * | 2022-04-06 | 2022-05-03 | 北京达佳互联信息技术有限公司 | Video synthesis method and device, electronic equipment and storage medium |
JP2022531055A (en) * | 2020-03-31 | 2022-07-06 | 北京市商▲湯▼科技▲開▼▲發▼有限公司 | Interactive target drive methods, devices, devices, and recording media |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108257195A (en) * | 2018-02-23 | 2018-07-06 | 深圳市唯特视科技有限公司 | A kind of facial expression synthetic method that generation confrontation network is compared based on geometry |
CN108288072A (en) * | 2018-01-26 | 2018-07-17 | 深圳市唯特视科技有限公司 | A kind of facial expression synthetic method based on generation confrontation network |
CN108389239A (en) * | 2018-02-23 | 2018-08-10 | 深圳市唯特视科技有限公司 | A kind of smile face video generation method based on condition multimode network |
CN108875633A (en) * | 2018-06-19 | 2018-11-23 | 北京旷视科技有限公司 | Expression detection and expression driving method, device and system and storage medium |
CN109147017A (en) * | 2018-08-28 | 2019-01-04 | 百度在线网络技术(北京)有限公司 | Dynamic image generation method, device, equipment and storage medium |
CN109308727A (en) * | 2018-09-07 | 2019-02-05 | 腾讯科技(深圳)有限公司 | Virtual image model generating method, device and storage medium |
CN109448083A (en) * | 2018-09-29 | 2019-03-08 | 浙江大学 | A method of human face animation is generated from single image |
WO2019056000A1 (en) * | 2017-09-18 | 2019-03-21 | Board Of Trustees Of Michigan State University | Disentangled representation learning generative adversarial network for pose-invariant face recognition |
CN110097086A (en) * | 2019-04-03 | 2019-08-06 | 平安科技(深圳)有限公司 | Image generates model training method, image generating method, device, equipment and storage medium |
CN110148191A (en) * | 2018-10-18 | 2019-08-20 | 腾讯科技(深圳)有限公司 | The virtual expression generation method of video, device and computer readable storage medium |
-
2019
- 2019-09-19 CN CN201910885913.9A patent/CN110620884B/en active Active
- 2019-11-14 WO PCT/CN2019/118285 patent/WO2021051605A1/en active Application Filing
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019056000A1 (en) * | 2017-09-18 | 2019-03-21 | Board Of Trustees Of Michigan State University | Disentangled representation learning generative adversarial network for pose-invariant face recognition |
CN108288072A (en) * | 2018-01-26 | 2018-07-17 | 深圳市唯特视科技有限公司 | A kind of facial expression synthetic method based on generation confrontation network |
CN108257195A (en) * | 2018-02-23 | 2018-07-06 | 深圳市唯特视科技有限公司 | A kind of facial expression synthetic method that generation confrontation network is compared based on geometry |
CN108389239A (en) * | 2018-02-23 | 2018-08-10 | 深圳市唯特视科技有限公司 | A kind of smile face video generation method based on condition multimode network |
CN108875633A (en) * | 2018-06-19 | 2018-11-23 | 北京旷视科技有限公司 | Expression detection and expression driving method, device and system and storage medium |
CN109147017A (en) * | 2018-08-28 | 2019-01-04 | 百度在线网络技术(北京)有限公司 | Dynamic image generation method, device, equipment and storage medium |
CN109308727A (en) * | 2018-09-07 | 2019-02-05 | 腾讯科技(深圳)有限公司 | Virtual image model generating method, device and storage medium |
CN109448083A (en) * | 2018-09-29 | 2019-03-08 | 浙江大学 | A method of human face animation is generated from single image |
CN110148191A (en) * | 2018-10-18 | 2019-08-20 | 腾讯科技(深圳)有限公司 | The virtual expression generation method of video, device and computer readable storage medium |
CN110097086A (en) * | 2019-04-03 | 2019-08-06 | 平安科技(深圳)有限公司 | Image generates model training method, image generating method, device, equipment and storage medium |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2022531055A (en) * | 2020-03-31 | 2022-07-06 | 北京市商▲湯▼科技▲開▼▲發▼有限公司 | Interactive target drive methods, devices, devices, and recording media |
CN111614925A (en) * | 2020-05-20 | 2020-09-01 | 广州视源电子科技股份有限公司 | Figure image processing method and device, corresponding terminal and storage medium |
CN113559503A (en) * | 2021-06-30 | 2021-10-29 | 上海掌门科技有限公司 | Video generation method, device and computer readable medium |
CN113559503B (en) * | 2021-06-30 | 2024-03-12 | 上海掌门科技有限公司 | Video generation method, device and computer readable medium |
CN114429611A (en) * | 2022-04-06 | 2022-05-03 | 北京达佳互联信息技术有限公司 | Video synthesis method and device, electronic equipment and storage medium |
CN114429611B (en) * | 2022-04-06 | 2022-07-08 | 北京达佳互联信息技术有限公司 | Video synthesis method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110620884B (en) | 2022-04-22 |
WO2021051605A1 (en) | 2021-03-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110620884B (en) | Expression-driven-based virtual video synthesis method and device and storage medium | |
CN114930399A (en) | Image generation using surface-based neurosynthesis | |
CN109949390B (en) | Image generation method, dynamic expression image generation method and device | |
US11989348B2 (en) | Media content items with haptic feedback augmentations | |
US20220300728A1 (en) | True size eyewear experience in real time | |
US20230120037A1 (en) | True size eyewear in real time | |
US11823346B2 (en) | AR body part tracking system | |
CN116917938A (en) | Visual effect of whole body | |
US20220319059A1 (en) | User-defined contextual spaces | |
CN117136381A (en) | whole body segmentation | |
US20220319125A1 (en) | User-aligned spatial volumes | |
US20220207786A1 (en) | Flow-guided motion retargeting | |
EP4314999A1 (en) | User-defined contextual spaces | |
CN114004922B (en) | Bone animation display method, device, equipment, medium and computer program product | |
US20220210336A1 (en) | Selector input device to transmit media content items | |
US11922587B2 (en) | Dynamic augmented reality experience | |
US20240029382A1 (en) | Ar body part tracking system | |
US11825276B2 (en) | Selector input device to transmit audio signals | |
US20220319124A1 (en) | Auto-filling virtual content | |
US20230393730A1 (en) | User interface including multiple interaction zones | |
US20220373791A1 (en) | Automatic media capture using biometric sensor data | |
US20240248542A1 (en) | Media content items with haptic feedback augmentations | |
US20220377309A1 (en) | Hardware encoder for stereo stitching | |
US20230069614A1 (en) | High-definition real-time view synthesis | |
US20240248546A1 (en) | Controlling augmented reality effects through multi-modal human interaction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |