CN115278106A - Deep face video editing method and system based on sketch - Google Patents
Deep face video editing method and system based on sketch Download PDFInfo
- Publication number
- CN115278106A CN115278106A CN202210698610.8A CN202210698610A CN115278106A CN 115278106 A CN115278106 A CN 115278106A CN 202210698610 A CN202210698610 A CN 202210698610A CN 115278106 A CN115278106 A CN 115278106A
- Authority
- CN
- China
- Prior art keywords
- edit
- editing
- sketch
- face
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 239000013598 vector Substances 0.000 claims abstract description 98
- 230000014509 gene expression Effects 0.000 claims abstract description 71
- 230000009471 action Effects 0.000 claims abstract description 29
- 238000012886 linear function Methods 0.000 claims abstract description 13
- 230000008447 perception Effects 0.000 claims abstract description 13
- 230000007480 spreading Effects 0.000 claims abstract description 13
- 238000003892 spreading Methods 0.000 claims abstract description 13
- 238000005520 cutting process Methods 0.000 claims abstract description 12
- 238000007500 overflow downdraw method Methods 0.000 claims abstract description 8
- 230000004927 fusion Effects 0.000 claims description 28
- 238000012549 training Methods 0.000 claims description 22
- 238000010586 diagram Methods 0.000 claims description 17
- 238000009826 distribution Methods 0.000 claims description 12
- 238000009499 grossing Methods 0.000 claims description 12
- 230000002194 synthesizing effect Effects 0.000 claims description 12
- 238000005070 sampling Methods 0.000 claims description 11
- 230000006870 function Effects 0.000 claims description 10
- 230000008859 change Effects 0.000 claims description 7
- 230000011218 segmentation Effects 0.000 claims description 7
- 230000004397 blinking Effects 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 230000000644 propagated effect Effects 0.000 claims description 6
- 239000000126 substance Substances 0.000 claims description 6
- 230000000007 visual effect Effects 0.000 claims description 6
- 239000003550 marker Substances 0.000 claims description 3
- 238000003860 storage Methods 0.000 claims description 3
- 101150061215 outM gene Proteins 0.000 claims 1
- 238000005457 optimization Methods 0.000 abstract description 9
- 230000005540 biological transmission Effects 0.000 description 8
- 210000000887 face Anatomy 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 238000001514 detection method Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 210000001508 eye Anatomy 0.000 description 4
- 210000004209 hair Anatomy 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000007704 transition Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000001815 facial effect Effects 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- 210000000697 sensory organ Anatomy 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000003292 diminished effect Effects 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 210000004709 eyebrow Anatomy 0.000 description 1
- 230000004886 head movement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000033001 locomotion Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000037303 wrinkles Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/18—Eye characteristics, e.g. of the iris
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/265—Mixing
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Ophthalmology & Optometry (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention provides a deep face video editing method and system based on a sketch, which comprises the steps of aligning and cutting an original video, and coding a face to a hidden space to obtain hidden codes of all frames in the face video; adding sketch generation branchesTo StyleGAN, reverse optimization of image steganography, generating an edit vector deltaedit(ii) a Will edit the vector deltaeditThe hidden codes are superposed to all frames to finish the propagation of the time sequence irrelevant editing; editing vector delta using weight superposition of piecewise linear functionseditFinishing the editing and spreading of the action or the expression; calculating a weight superposition editing vector delta according to the similarity of the expression parameters of the current frame and the editing frameeditEnabling the edit to correspond to the specific expression, and finishing expression-driven edit propagation; and fusing different types of edits added by different frames by using a region perception fusion method, and fusing the human face to the original video.
Description
Technical Field
The invention relates to the technical field of computer graphics and computer vision, in particular to a method and a system for synthesizing and editing a face video and a sketch.
Background
Video editing is a very challenging scientific research problem, and with the development of deep learning, video editing and modification work is more and more. Most of the existing video editing methods modify the global attribute of the video, convert the black-and-white video into the color video, or perform stylization processing on the video to generate artistic video editing results. Aiming at a human face video, the prior art mainly completes editing tasks such as face changing and the like, and only modifies the global attribute of identity. Some methods can edit the detail area of the video, but require professional software such as PS, PR and the like, and require high time and effort costs. The sketch is an efficient and accurate interactive tool, has high user-friendliness, and is widely used for image generation and editing problems. However, the prior art cannot expand sketch editing from images to videos, and is difficult to deal with the problems of transmission and composition of editing operation. The video editing has wide application prospect and higher value in the culture fields of movie making, new media transmission and the like, however, the existing technology cannot simply and quickly complete the task of editing the video details.
Aiming at the problem of video editing, the prior art can automatically color the video, but the function is single, and only the color information of the video can be changed. Or the video is stylized, so that the color characteristics of the video can be changed, the content of the video is subjected to artistic transformation, and the video with artistic feeling is generated. However, the above work can only edit the global features of the video, and cannot modify the details of the video. Even if the work is to represent video in the atlas space, the relevant detail editing operation will be extended to video clips after editing the image by software such as PS. However, the above method requires professional software for operation, and the editing and generation of the video are time-consuming. The sketch is a more friendly interactive tool, and the editing operation of the user is simpler and more accurate. The prior art can realize the operation of editing the face image by the sketch image, but the editing operation cannot be propagated to the whole video.
Disclosure of Invention
In order to solve the problems that details of video contents are difficult to edit and sketch editing cannot be transmitted to a video in the prior art, the invention generates a network StyleGAN based on a face image, abstractly expresses the sketch editing as a hidden vector, designs an innovative transmission and fusion mechanism and edits the face video. The invention provides a sketch-based face video editing method and a sketch-based face video editing system, which can select any one frame/multiple frames, edit face details by using a sketch and transmit the face details to the whole video in a specified mode.
Specifically, the invention provides a deep face video editing method based on a sketch, which comprises the following steps:
step 1, aligning and cutting a face in an original video, and coding the face into a hidden space to obtain hidden codes of all frames in the face video;
step 2, adding a sketch generation branchTo a StyleGAN generation network, reversely optimizing the image hidden code to generate an edit vector deltaedit;
step 4, editing the vector delta by using the weight superposition of the piecewise linear functioneditCompleting the editing and spreading of the action or the expression;
step 5, calculating a weight superposition editing vector delta according to the similarity of the expression parameters of the current frame and the editing frameeditEnabling the edit to correspond to the specific expression, and finishing expression-driven edit propagation;
and 6, fusing different types of edits added by different frames by using a region perception fusion method, and fusing the face to the original video to obtain a face video editing result based on the sketch.
The deep face video editing method based on the sketch comprises the following steps of 1: detecting face key points of a face video, aligning and cutting the face after smoothing by using a time window to generate a video frame sequence f1,f2,…,fNWherein, N is the frame number of the face video; projecting a sequence of frames into a hidden space W+Generating a hidden code sequence w1,w2,…,wN。
The deep face video editing method based on the sketch comprises the following steps of 2:
acquiring a StyleGAN original generation network G, and constructing a generation network for modeling joint probability distribution of real face images and sketchesGenerating networksIncludedAndtwo branches are arranged on the upper surface of the main body,a network is originally generated for G, for generating a simulated face image,for generating a corresponding sketch image; given the hidden code w of the image,generating a feature map F1,F2,…,F14Wherein, in the step (A),F1is used asAn initial profile of the branch;the feature map of the branch is up-sampled and feature map FiAdding the convolved residual images to generate a sketch image corresponding to the hidden code w;
training a sketch generation network S by using a data set matched with the image and the sketch, and generating a corresponding sketch by taking the face image as input for training a training sketch generation branchRandomly sampling the hidden code w and inputting itGenerating highly realistic face imagesAnd corresponding sketchConstructing a loss functionTraining sketch generation branches
LVGGIs a perceptual loss function, and uses a VGG19 model to measure the visual similarity, LL2Is the pixel L2 loss, α1And alpha2Are all preset weights;
after modeling the distribution of real images and sketches, according to the input face imageLike x, sketching seditAnd selecting the region medit(ii) a Projecting the face image x to W+Space, obtaining the hidden code weditGenerated sketchThe same as the input sketch is in the editing area, and the generated imageIn the non-edited region, w is obtained by the following loss functionedit:
Lediting(wedit)=β1Lsketch+β2Lrgb,
LsketchThe constraint edit area has the same structure as the sketch result, LrgbConstraining the non-editing regions to remain unchanged, beta1And beta2For hyper-parameters, the network is generated by fixingTo obtain wedit;
Final edit vector deltaedit=wedit-w,δeditThe editing of the sketch is represented and spread to the whole face video; for each frame fiGenerating a corresponding edit vector:
δi=δedit,i=1,2,…,N
this step 3 comprises the step of dividing each frame fiCorresponding deltaeditAnd spreading the video to the whole face video to generate an edited frame sequence.
The deep face video editing method based on the sketch comprises the following steps of 4:
adding blinking or smiling actions at specific time in the face video, and at a specific frame ftAdding an edit vector deltaeditInput duration h and variation time l, for each frame fiThe invention uses piecewise linear functions to generate smooth propagation edit vector deltaiGet new edit directionQuantity deltai:
δi=γ·δedit,i=1,2,…,M
t1=t-h/2-l,t2=t-h/2,t3=t+h/2,t4= t + h/2+ l, t is an editing frame ftA corresponding time;
these new edit vectors δiFor synthesizing a simulated face image;
the step 5 comprises the following steps:
a plurality of key frames in the face video are givenMethod for extracting expression parameters of human face by using 3D reconstruction modeAnd corresponding edit vectorM is the number of key frames, the expression guide edits are propagated using:
the step 6 comprises the following steps:
given a series of frame sequences f1,f2,…,fNThe user selects M key frames k1,k2,…,kMEditing different regions corresponding to the M drawn mark regions M1,m2,…,mM(ii) a For each frame fiGenerating M edit vectors
For each frame f to be predictediGenerating a deformation field, deforming the input mark region to generate M new mark regionsIs mjThe region after the deformation of the action and the expression; replacing the local area of the feature map of the original frame with a new feature map:
wherein the content of the first and second substances,the initial characteristic diagram isG is a generation network of StyleGAN;
down samplingMake it andandthe resolution is the same; characteristic diagramUpdating the M editing operations, wherein the updating is performed for M times in total; the middle 5 characteristic maps of StyleGAN are updated, the resolution is from 32 x 32 to 128 x 128, and the high resolution is from the original oneHidden code wiPerforming adjustment based on a StyleGAN algorithm; applying the above-described fusion operation to all frames fiI =1, 2., N, generating an edited and fused aligned face video;
and generating face marking areas of the input frame and the editing frame by using a face segmentation method, merging the face marking areas, generating a smooth edge for the merged marking areas, further using the smooth edge as a fusion weight, fusing the faces before and after editing, reversely aligning the fused face images to the original video, and synthesizing the face video editing result.
The invention also provides a deep face video editing system based on the sketch, which comprises the following steps:
the module 1 is used for aligning and cutting a face in an original video, and coding the face into a hidden space to obtain hidden codes of all frames in the face video;
module 2 for adding sketch generating branchesGenerating an editing vector delta by a StyleGAN generation network and reversely optimizing an image hidden codeedit;
a module 4 for editing the vector δ by weight superposition of piecewise linear functionseditFinishing the editing and spreading of the action or the expression;
a module 5, configured to calculate a weight superposition editing vector δ according to similarity between expression parameters of the current frame and the editing frameeditEnabling the edit to correspond to the specific expression, and finishing expression-driven edit propagation;
and the module 6 is used for fusing different types of edits added by different frames by using a region perception fusion method, and fusing the face to the original video to obtain a face video editing result based on the sketch.
The deep human face video editing system based on the sketch is characterized in that the module 1 is used for detecting human face key points of a human face video, and after a time window is used for smoothing, the human face is pairedPerforming a blend and crop to generate a sequence of video frames f1,f2,…,fNWherein, N is the frame number of the face video; projecting a sequence of frames into a hidden space W+Generating a hidden code sequence w1,w2,…,wN。
The deep face video editing system based on the sketch comprises a module 2, a model G and a generating network, wherein the module 2 is used for acquiring a StyleGAN original generating network G and constructing the generating network for modeling the joint probability distribution of a real face image and the sketchGenerating networksIncludedAndtwo branches are arranged on the upper surface of the main body,a network is originally generated for G, for generating a simulated face image,for generating a corresponding sketch image; given the covert code w of an image,generating a feature map F1,F2,…,F14Wherein F is1Is used asAn initial profile of the branch;the feature map of the branch is up-sampled and feature map FiAdding the convolved residual images to generate a sketch map corresponding to the hidden code wAn image;
training a sketch generation network S by using a data set matched with the image and the sketch, and generating a corresponding sketch by taking the face image as input for training a training sketch generation branchRandomly sampling the hidden code w and inputting itGenerating highly realistic face imagesAnd corresponding sketchConstructing a loss functionTraining sketch generation branches
LVGGIs a perceptual loss function, and uses a VGG19 model to measure the visual similarity, LL2Is pixel L2 loss, α1And alpha2Are all preset weights;
after the distribution of the real image and the sketch is modeled, a sketch s is drawn according to the input face image xeditAnd selecting the region medit(ii) a Projecting the face image x to W+Space, obtaining the hidden code weditGenerated sketchThe same as the input sketch is in the editing area, and the generated imageIn the non-edited region, w is obtained by the following loss functionedit:
Lediting(wedit)=β1Lsketch+β2Lrgb,
LsketchThe constraint edit area has the same structure as the sketch result, LrgbConstraining the non-editing regions to remain unchanged, beta1And beta2For hyper-parameters, the network is generated by fixingTo obtain wedit;
Final edit vector deltaedit=wedit-w,δeditThe editing of the sketch is represented and spread to the whole face video; for each frame fiGenerating a corresponding edit vector:
δi=δedit,i=1,2,…,N
the module 3 comprises a frame buffer for each frame fiCorresponding deltaeditAnd spreading the video to the whole face video to generate an edited frame sequence.
The deep face video editing system based on the sketch, wherein the module 4 is used for adding blinking or smiling actions at specific time in the face video, and at a specific frame ftAdding an edit vector deltaeditInput duration h and variation time l, for each frame fiThe invention uses piecewise linear functions to generate smooth propagation edit vector deltaiTo obtain a new edit vector deltai:
δi=γ·δedit,i=1,2,…,M
t1=t-h/2-l,t2=t-h/2,t3=t+h/2,t4= t + h/2+ l, t is the edit frame ft
A corresponding time;
these new edit vectors δiFor synthesizing a simulated face image;
the module 5 comprises:
a plurality of key frames in the face video are givenMethod for extracting expression parameters of human face by using 3D reconstruction modeAnd corresponding edit vectorM is the number of key frames, the expression guide edits are propagated using:
this module 6 is intended to give a succession of frame sequences f1,f2,…,fNThe user selects M key frames k1,k2,…,kMEditing different regions, corresponding to M drawn marker regions M1,m2,…,mM(ii) a For each frame fiGenerating M edit vectors
For each frame f to be predictediGenerating a deformation field, deforming the input mark region to generate M new mark regionsIs mjThe region after the deformation of the action and the expression; replacing the local area of the feature map of the original frame with a new feature map:
wherein the content of the first and second substances,the initial characteristic diagram isG is a generation network of StyleGAN;
down samplingMake it andandthe resolution is the same; characteristic diagramUpdating the M editing operations, wherein the updating is performed for M times in total; the middle 5 characteristic maps of StyleGAN are updated, the resolution is from 32 x 32 to 128 x 128, and the high resolution is formed by the original hidden code wiPerforming adjustment based on a StyleGAN algorithm; applying the above-described fusion operation to all frames fiI =1, 2., N, generating an edited and fused aligned face video;
and generating face marking areas of the input frame and the editing frame by using a face segmentation method, merging the face marking areas, generating a smooth edge for the merged marking areas, further using the smooth edge as a fusion weight, fusing the faces before and after editing, reversely aligning the fused face images to the original video, and synthesizing the face video editing result.
The invention also provides a storage medium for storing a program for executing the any depth face video editing method based on the sketch.
The invention also provides a client used for the any depth face video editing system based on the sketch.
According to the scheme, the invention has the advantages that:
the system designed by the invention can select one or more editing frames, draw a sketch and a corresponding editing area mask by a user, and realize the editing and transmission operation of the video after the transmission mode of the editing is appointed.
Drawings
FIG. 1 is a schematic flow chart of the system of the present invention;
FIG. 2 is a schematic diagram of sketch optimization;
FIG. 3 is a diagram of time-independent editing and time window editing results;
FIG. 4 is a diagram of the results of time-independent editing and expression-driven editing;
FIG. 5 is a graph of results for different rendering styles;
FIG. 6 is a diagram of a result of a rotating face edit;
FIG. 7 is a graph of results fused using different approaches after optimizing sketch editing vectors;
FIG. 8 is a diagram of different editing fusion results;
FIG. 9 is a diagram illustrating an intermediate result of face video editing;
fig. 10 is a graph of the key point smoothing results.
Detailed Description
The defects in the prior art are caused by the fact that the problem of transmission on a video is not considered in sketch editing, because the face in the video has changes of expressions and actions, the input sketch editing operation is difficult to directly act on other frames, and meanwhile, the sketch can change the identity characteristics (such as the shape of five sense organs) of the face and also can change the expressions and actions (how smiling is added), so that how to distinguish the faces and reasonably transmit the faces is very difficult; the video editing needs to ensure the stability of the time sequence, the existing method does not consider the flicker problem of video generation, and the quality of the generated result is poor.
The inventor discovers that the defect can be overcome by designing a reasonable video coding mode and a sketch editing, spreading and fusing method through sketch editing of images and videos. After the face video is input, firstly, the face area is cut and aligned through key point detection. Further, we use the image coding network to code the face images of all frames into the hidden space of the StyleGAN generation network. Aiming at sketch editing input by a user, an optimization strategy is designed, and editing operation is abstractly expressed as an editing vector. In the transmission process of editing, the editing operation is divided into two types, namely time sequence irrelevant editing and time sequence relevant editing, the time sequence relevant editing is further divided into time window editing and expression driven editing, the specific type of editing is designated by a user, and the transmission is carried out in different modes. Meanwhile, a regional perception fusion strategy is designed, different editing operations input in different frames are fused, and an edited face video is generated. And finally, reversely aligning the generated face video to the input original video, fusing the face area and generating a sketch video editing result.
The core invention points of the invention comprise:
key point 1, styleGAN based video coding module. After a section of face video is input, a face image is cut and aligned by using a face key point detection technology of dlib, and a time window is used for smoothing. The input frame sequence is encoded into a cryptic code sequence based on an E4E face to StyleGAN cryptic spatial encoding technique. According to the input frame sequence and the hidden code sequence, based on the PTI reconstruction technology, the weight of the StyleGAN generation network is finely adjusted, so that the original video can be perfectly reconstructed, the video coding task is completed, and the subsequent video editing is served;
and 2, a key point 2, a sketch editing and optimizing module. And (4) generating a network based on the pre-trained sketch, expanding the original StyleGAN, and adding a sketch generation branch to the original StyleGAN. Further, an optimization strategy is designed, a user inputs a drawn sketch and an editing area mask, the constraint of the editing area is the same as that of the sketch, the constraint of the editing area is not the same as that of the original image, and the original hidden code is optimized in an iterative mode. And (4) subtracting the optimized hidden code from the input hidden code to obtain an editing vector, and abstractively representing the sketch editing operation.
Key point 4: time window propagation techniques. And part of editing operation represents specific actions of the human face, including smiling, eye closing and the like. The actions are divided into three stages, namely a starting stage, a continuous stage and an ending stage, the editing vectors are overlapped by linearly changing weights in the starting stage and the ending stage, and fixed weights are overlapped in the continuous stage, so that the whole process of starting, continuing and ending the actions is realized.
Key point 5: expression-driven propagation techniques. Part of the editing operation is related to expressions, such as closing eyes when smiling. And aiming at the editing, extracting expression parameters of the face by using a 3D reconstruction mode, calculating weight according to cosine similarity of the expression parameters of an editing frame and a prediction frame, and superposing editing vectors.
Key point 6: and a region perception fusion module. In the video editing process, a user often selects multiple frames and edits different areas simultaneously. The module predicts a deformation field according to the change of the action expression of the face by using deformation operation and deforms the drawn mask. Finally, the generated face area is fused and back-projected to the original frame, and fusion operation of different editions is completed.
In order to make the aforementioned features and advantages of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.
The system flow chart is shown in fig. 1, and the system comprises multiple technologies of editing vector generation, time sequence independent editing propagation, time window editing propagation, expression driven editing propagation, regional perception fusion and the like.
Sketch optimization flowchart as shown in fig. 2, the original StyleGAN generation network is extended to two branches, one branch generating a sketch image and the other branch generating an image with high realism. The optimization process comprises two loss items of L _ sketch and L _ RGB, and the corresponding sketch editing of the editing area and other areas are respectively restricted to be kept unchanged.
As shown in fig. 1, a high-quality face video editing method and system based on sketch interaction includes:
s1: after a video is input, aligning and cutting a human face, and coding the human face to a hidden space;
s2: extending StyleGAN generation network, adding sketch generation branchReverse optimizing image hidden code to generate edit vector deltaedit;
S3: time-independent editing, vector delta will be editededitDirectly superimposing the hidden codes to all frames to finish the propagation of the time sequence irrelevant editing;
s4: time window editing using weight-superimposed editing vectors delta of piecewise linear functionseditCompleting the editing and spreading of the action or the expression;
s5: expression drive editing, calculating weight superposition editing vector delta according to similarity of expression parameters of current frame and editing frameeditMaking the edit correspond to a specific expression;
s6: fusing different types of edits added by different frames by using a region perception fusion method, and fusing the face to the original video;
wherein the method of S1 comprises:
after an input video is given, detecting key points of a human face by using dlib, aligning and cutting the human face by using coordinates of smooth key points of a time window, and generating a video frame sequence f1,f2,…,fNWhere N is the number of frames. The present invention uses E4E to project a sequence of frames onto W+Space, generating hidden code sequence w1,w2,…,wN. The subsequently generated edit vector is superimposed on the crypto-code sequence. The smoothing is to smooth the coordinates of key points of the face on the sequence, and the face detection method is single-frame detection, so certain jitter exists between frames. The purpose of the smoothing is to eliminate the effects of jitter.
The method of S2 is shown in fig. 2, and includes:
s21: given the StyleGAN original generation network G, the invention designs a new generation networkAnd modeling the joint probability distribution of the real face image and the sketch. Which comprises two branches of which the number of branches is two,generating a human face image with high reality sense, generating a network for the G,and generating a corresponding sketch image. Given the hidden code w of the image,generating a feature map F1,F2,…,F14Wherein, F1Is used asInitial profile of the branch.Intermediate feature maps of branches are repeatedly upsampled, and FiAnd adding the convolved residual images. And (4) after the operation is finished for i = 2-14, finally generating a sketch image corresponding to the hidden code w. The StyleGAN3 original generation network has an extension of 10 pixels, and the invention clips the intermediate feature map and only uses the clipped pixel content.
S22: to train a sketch to generate branchesWe first train a sketch generation network S based on the Pix2PixHD network using a data set with images matched to the sketch. The sketch generation network takes a real face image as input to generate a corresponding sketch for training a training sketch generation branchThen, the invention randomly samples the hidden code w and inputs the hidden code wGenerating highly realistic face imagesAnd corresponding sketchThe invention then trains the sketch generation branches using the following loss functions:
LVGGis a perceptual loss function, and uses a VGG19 model to measure the visual similarity, LL2Is the pixel L2 loss, α1And alpha2Are all preset weights, in this case α1=α2=1。
S23: after modeling the distribution of the real image and the sketch, the invention designs an optimization technology, according to the real image x input by the user, the sketch s drawn by the usereditAnd a mark area meditGenerating an edit vector deltaedit. First, a real image x is projected onto W+And generating an initial hidden code w. Then, the invention optimizes to obtain a new implicit code weditGenerated sketchThe same as the input sketch is in the editing area, and the generated imageIn the non-editing area the same as the original image. To optimize for obtaining weditThe invention uses the following loss function:
wherein L isLPIPSIs LPIPS distance, as a matrix dot product. L issketchThe constraint edit area has the same structure as the sketch result, LrgbThe constrained non-editing regions remain unchanged. The final optimization loss function is:
Lediting(wedit)=β1Lsketch+β2Lrgb,
β1and beta2Is a hyper-parameter. In the optimization process, the weight of the fixed network and the only optimized parameter are wedit。
S24: the final edit vector is:
δedit=wedit-w
δeditthe abstraction represents the compilation of the sketch and propagates through to the entire video.
Wherein the method of S3 comprises:
some editing operations have a significant impact on the entire video, with low relevance to expressions and actions. These editing operations mainly change the basic shape of a human face, such as the shape of a face and the shape of five sense organs. Edit vector delta generated by the inventioneditThe video frame self-adaptive decoupling method has decoupling and semantic characteristics, and is directly applied to the whole video frame sequence. For each frame fiGenerating a corresponding edit vector:
δi=δedit,i=1,2,…,N
these edit vectors propagate the edits throughout the video, generating a sequence of edited frames.
Wherein the method of S4 comprises:
unlike single frame editing, video has changes in expressions and motions with time. Users often edit time-sequential facial actions, such as adding blinks or smiles at a particular time. In a particular frame ftAdding edit vector deltaeditThe user also needs to input the duration h and the change time l. Then, for each frame fiThe invention uses piecewise linear functions to generate smooth propagation edit vectors deltai:
δi=γ·δedit,i=1,2,…,M
t1=t-h/2-l,t2=t-h/2,t3=t+h/2,t4= t + h/2+ l, t is an editing frame ftThe corresponding time. These new edit vectors deltaiWill be used to synthesize highly realistic face images. By using the editing mode, the invention not only can generate the editing effect in a specific time window, but also can form smooth transition, namely, the appearance and disappearance of the editing, for example, from a natural expression to a smiling expression, and then from the smiling expression to the natural expression.
Wherein the method of S5 comprises:
in some scenarios, the user may only want to add some edits at certain expressions, while keeping the original state or adding new edits at other expressions. Such editing operations include expression-driven wrinkles (e.g., statute lines, heads-up lines, etc.), and some shape edits that occur only under certain expressions (e.g., eyes are diminished when smiling). In order to propagate the expression-driven editing, the invention extracts the expression parameters of the human face by using a 3D reconstruction mode. More specifically, key frames for a given emoji editM is the number of key frames, and the expression parameters are extracted by the methodAnd corresponding edit vectorSome key frames may not have any editing operation, but act as key reference frames, indicating that there is no editing operation in a certain expression. For these frames, the edit vector is the zero vector. The present invention propagates expression guide edits using the following:
eiis an input frame fiC is a normalization termIn the present invention, vectors are editedFor the same region.
Wherein the method of S6 comprises:
s6.1: the invention supports editing any multiframe by using the sketch and fuses the editing effect. After editing multiple frames, a plurality of editing vectors are generated, and a simple method is to directly add the editing vectors, however, as shown in fig. 7, the method generates flaws, and the invention designs a region perception fusion mode.
S6.2: given a series of frame sequences f1,f2,…,fNThe user selects M key frames k1,k2,…,kMEditing different regions corresponding to M editing mark regions M1,m2,…,mM. Using the above-mentioned edit propagation method, f is applied to each frameiGenerating M edit vectorsRepresenting different editing operations.
For each frame f to be predictediThe invention uses the First-order method to generate a deformation field, and generates M new areas for the deformation of the input mark areaInput rendered region mjMarking similar regions, but considering frame fiAnd editing key frame kjInter-expression and head movements. In order to fuse different editing operations, the invention replaces the local area of the feature map of the original frame with a new feature map:
wherein the content of the first and second substances,the initial characteristic diagram isG is the generative network of StyleGAN. Inventive downsamplingMake it andandwith the same resolution. And for M times of editing operation, iterating the formula for M times, wherein j = 1-M, and completing editing and fusion of the local areas. The invention updates and generates the middle 5 characteristic graphs of the network, the characteristic graphs mainly control the face structure information, and the resolution is from 32 x 32 to 128 x 128. High resolution is determined by the original hidden code wiThe adaptation is performed using the algorithm of the StyleGAN network. The present invention applies the above-described fusion operation to allFrame fiI =1, 2., N, generating an edited fused aligned face video.
S6.3: the invention fuses the synthesized face to the original video to synthesize the final editing video. First, a face segmentation method is used to generate face region labeling maps of an input frame and an editing frame, and a union of face regions is calculated. The merged area expands further, making the edges transition smoothly. And converting the smoothed face region label graph into fusion weight, wherein the label region weight is 1, the non-label region weight is 0, the transition edge weight is between 0 and 1, and the face before and after editing is fused based on the weight. And finally, reversely aligning the face image to the original video, and synthesizing the final edited video.
As shown in fig. 3, the result of the fusion of the time-independent editing and the time window editing of the present invention is shown. The time-independent edits in the left side add hair and beard to the editing character, and the time window edits add a header-raising action to the editing character. The first line on the right side is the original video, the second line on the right side is the video result after editing, the video after editing has more hair and beard, and meanwhile, the eyebrow picking action is achieved.
As shown in fig. 4, the result of the fusion of the time-independent editing and the time window editing of the present invention is shown. The time sequence irrelevant editing in the left side reduces the nose of the face, and in expression driven editing, the eyes are reduced when the mouth is opened, and the original shape is kept when the mouth is closed. The video frames before and after editing are shown on the right side, and the editing operation is well spread to the whole video.
As shown in fig. 5, the result of editing a face using sketches of different rendering styles is shown. The first column of images shows the drawn sketch and the selected area, and the second column of images shows the result of a single frame edit. The first line on the right shows the original video frame and the subsequent lines show the results of the edit propagation. Aiming at sketches with different drawing styles, the method generates a result with higher quality and has better robustness.
As shown in fig. 6, the result of editing for a face with angle changes is presented. The first line of images is the original sequence of video frames, the second line of images is the edited sequence of video frames, and the left side is the user-drawn sketch and the selected region. The invention generates high-quality editing results even if the input face video has rotation and angle changes.
As shown in fig. 7, the results of two editorial fusion methods are presented. The user has added two time-independent edits, varying the face and hair, while adding a time window edit. The first row shows the editing results of the edited sketch, the second row shows the original video, the third row shows the result of a fusion mode of directly adding a plurality of editing vectors, and the fourth row shows the result of a region-aware fusion mode. The quality of the result generated by the regional perception fusion mode is higher than that of the direct addition of the editing vectors, and the effectiveness of the regional perception fusion module is proved.
As shown in fig. 8, the fusion results of the different edits are shown. The first line is the original video, the second line is the result of the time sequence irrelevant editing, the hair area of the human face is modified, the third line is the time window editing, the smile is added to the human face, and the last line is the result of the fusion of the two types of editing.
As shown in fig. 9, an intermediate result of the face video editing is shown. The second row shows the real video and the third row shows the result of the alignment. The fourth line shows the result of the drawing mask deformed according to the expression and the action, and the fifth line shows the editing result of the aligned face. The sixth row shows the result of face region segmentation, and the last row shows the final anti-aligned face editing result.
As shown in fig. 10, the results of keypoint smoothing are presented. The first three lines are the result of smoothing without using key points, the cut and aligned face has very large jitter, the last three lines are the result of smoothing with key points, and the cut and aligned face has no jitter problem.
The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.
The invention also provides a deep face video editing system based on the sketch, which comprises the following steps:
the module 1 is used for aligning and cutting a face in an original video, and coding the face into a hidden space to obtain hidden codes of all frames in the face video;
module 2 for adding sketch generating branchesTo a StyleGAN generation network, reversely optimizing the image hidden code to generate an edit vector deltaedit;
a module 4 for editing the vector δ by weight superposition of piecewise linear functionseditFinishing the editing and spreading of the action or the expression;
a module 5, configured to calculate a weight superposition editing vector δ according to similarity between expression parameters of the current frame and the editing frameeditEnabling the edit to correspond to the specific expression, and finishing expression-driven edit propagation;
and the module 6 is used for fusing different types of edits added by different frames by using a regional perception fusion method, and fusing the face to the original video to obtain a face video editing result based on the sketch.
In the deep face video editing system based on the sketch, the module 1 is used for detecting face key points of a face video, aligning and cutting a face after smoothing by using a time window to generate a video frame sequence f1,f2,…,fNWherein, N is the frame number of the face video; projecting a sequence of frames into a hidden space W+Generating a hidden code sequence w1,w2,…,wN。
In the deep face video editing system based on the sketch, the module 2 is used for acquiring a StyleGAN original generation network G and constructing a generation network for modeling the joint probability distribution of a real face image and the sketchGenerating networksIncludedAndtwo branches are arranged on the upper surface of the main body,a network is originally generated for G, for generating a simulated face image,for generating a corresponding sketch image; given the covert code w of an image,generating a feature map F1,F2,…,F14Wherein F is1Is used asInitial feature maps of the branches;the feature map of the branch is up-sampled and feature map FiAdding the convolved residual images to generate a sketch image corresponding to the hidden code w;
training a sketch generation network S by using a data set matched with the image and the sketch, and generating a corresponding sketch by taking the face image as input for training a training sketch generation branchRandomly sampling hidden code w and inputting itGenerating high truesHuman face imageAnd corresponding sketchConstructing a loss functionTraining sketch generation branches
LVGGIs a perceptual loss function, and a VGG19 model is used for measuring the visual similarity, LL2Is the pixel L2 loss, α1And alpha2Are all preset weights;
after the distribution of the real image and the sketch is modeled, a sketch s is drawn according to the input face image xeditAnd selecting the region medit(ii) a Projecting the face image x to W+Space, obtaining a hidden code weditGenerated sketchThe same as the input sketch is in the editing area, and the generated imageIn the non-edited region, w is obtained by the following loss functionedit:
Lediting(wedit)=β1Lsketch+β2Lrgb,
LsketchThe constraint edit area has the same structure as the sketch result, LrgbConstraining the non-editing regions to remain unchanged, beta1And beta2For hyper-parameters, the network is generated by fixingTo obtain wedit;
Final edit vector deltaedit=wedit-w,δeditThe editing of the sketch is represented and spread to the whole face video; for each frame fiAnd generating a corresponding edit vector:
δi=δedit,i=1,2,…,N
the module 3 comprises a frame buffer for each frame fiCorresponding deltaeditAnd spreading the video to the whole face video to generate an edited frame sequence.
The deep face video editing system based on the sketch, wherein the module 4 is used for adding blinking or smiling actions at specific time in the face video, and at a specific frame ftAdding an edit vector deltaeditInput duration h and change time l, for each frame fiThe invention uses piecewise linear functions to generate smooth propagation edit vectors deltaiTo obtain a new edit vector deltai:
δi=γ·δedit,i=1,2,…,M
t1=t-h/2-l,t2=t-h/2,t3=t+h/2,t4= t + h/2+ l, t is the edit frame ftA corresponding time;
these new edit vectors deltaiFor synthesizing a simulated human face image;
the module 5 comprises:
a plurality of key frames in the face video are givenMethod for extracting expression parameters of human face by using 3D reconstruction modeAnd corresponding edit vectorM is the number of key frames, the expression guide edits are propagated using:
this module 6 is intended to give a succession of frame sequences f1,f2,…,fNThe user selects M key frames k1,k2,…,kMEditing different regions, corresponding to M drawn marker regions M1,m2,…,mM(ii) a For each frame fiGenerating M edit vectors
For each frame f to be predictediGenerating a deformation field, deforming the input mark region to generate M new mark regionsIs mjThe region after the deformation of the action and the expression; replacing the local area of the feature map of the original frame with a new feature map:
wherein the content of the first and second substances,the initial characteristic diagram isG is a generation network of StyleGAN;
down samplingMake it andandthe resolution is the same; characteristic diagramUpdating the M editing operations, wherein the updating is performed for M times in total; the middle 5 characteristic maps of StyleGAN are updated, the resolution is from 32 x 32 to 128 x 128, and the high resolution is formed by the original hidden code wiPerforming adjustment based on a StyleGAN algorithm; applying the above-described fusion operation to all frames fiI =1, 2., N, generating an edited and fused aligned face video;
and generating face marking areas of the input frame and the editing frame by using a face segmentation method, merging the face marking areas, generating a smooth edge for the merged marking areas, further using the smooth edge as a fusion weight, fusing the faces before and after editing, reversely aligning the fused face images to the original video, and synthesizing the face video editing result.
The invention also provides a storage medium for storing a program for executing the method for editing the deep face video based on the sketch.
The invention also provides a client used for the deep face video editing system based on the sketch.
Claims (10)
1. A deep face video editing method based on sketch is characterized by comprising the following steps:
step 1, aligning and cutting a face in an original video, and coding the face into a hidden space to obtain hidden codes of all frames in the face video;
step 2, adding a sketch generation branchTo a StyleGAN generation network, reversely optimizing the image hidden code to generate an edit vector deltaedit;
Step 3, editing the vector deltaeditThe hidden codes are superposed to all frames to finish the propagation of the time sequence irrelevant editing;
step 4, editing the vector delta by using the weight superposition of the piecewise linear functioneditCompleting the editing and spreading of the action or the expression;
step 5, calculating a weight superposition editing vector delta according to the similarity of the expression parameters of the current frame and the editing frameeditEnabling the edit to correspond to the specific expression, and finishing expression-driven edit propagation;
and 6, fusing different types of edits added by different frames by using a region perception fusion method, and fusing the face to the original video to obtain a face video editing result based on the sketch.
2. The method for editing deep video of human face based on sketch of claim 1, wherein the step 1 comprises: detecting face key points of a face video, aligning and cutting the face after smoothing by using a time window to generate a video frame sequence f1,f2,…,fNWherein, N is the frame number of the face video; projecting a sequence of frames into a hidden space W+Generating a crypto-code sequence w1,w2,…,wN。
3. The method for video editing of deep face based on sketch as claimed in claim 2, wherein said step 2 includes:
acquiring a StyleGAN original generation network G, and constructing a generation network for modeling joint probability distribution of real face images and sketchesGenerating networksIncludedAndtwo branches are arranged on the upper surface of the main body,a network is originally generated for G, for generating a simulated face image,for generating a corresponding sketch image; given the covert code w of an image,generating a feature map F1,F2,…,F14Wherein, F1Is used asInitial feature maps of the branches;the feature map of the branch is up-sampled and feature map FiAdding the convolved residual images to generate a sketch image corresponding to the hidden code w;
training a sketch generation network S by using a data set matched with the image and the sketch, and generating a corresponding sketch by taking the face image as input for training a training sketch generation branchThe random sampling hidden code w is a random sampling hidden code,input it intoGenerating highly realistic human face imagesAnd corresponding draftConstructing a loss functionTraining sketch generation branches
LVGGIs a perceptual loss function, and a VGG19 model is used for measuring the visual similarity, LL2Is pixel L2 loss, α1And alpha2Are all preset weights;
after the distribution of the real image and the sketch is modeled, a sketch s is drawn according to the input face image xeditAnd selecting the region medit(ii) a Projecting the face image x to W+Space, obtaining the hidden code weditGenerated sketchThe same as the input sketch is in the editing area, and the generated imageIn the non-edited region, w is obtained by the following loss functionedit:
Lediting(wedit)=β1Lsketch+β2Lrgb,
LsketchThe constraint edit area has the same structure as the sketch result, LrgbConstraining the non-editing regions to remain unchanged, beta1And beta2For hyper-parameters, the network is generated by fixingTo obtain wedit;
Final edit vector deltaedit=wedit-w,δeditThe editing of the sketch is represented and spread to the whole face video; for each frame fiAnd generating a corresponding edit vector:
δi=δedit,i=1,2,…,N
this step 3 comprises the step of dividing each frame fiCorresponding deltaeditAnd spreading the video to the whole face video to generate an edited frame sequence.
4. The method for video editing of deep face based on sketch as claimed in claim 3, wherein said step 4 includes:
adding blinking or smiling actions at specific time in the face video, and adding blinking or smiling actions at specific frame ftAdding edit vector deltaeditInput duration h and variation time l, for each frame fiThe invention uses piecewise linear functions to generate smooth propagation edit vectors deltaiTo obtain a new edit vector deltai:
δi=γ·δedit,i=1,2,…,M
t1=t-h/2-l,t2=t-h/2,t3=t+h/2,t4= t + h/2+ l, t is the edit frame ftA corresponding time;
these new edit vectors deltaiFor synthesizing a simulated face image;
the step 5 comprises the following steps:
given a plurality of key frames in the face videoMethod for extracting expression parameters of human face by using 3D reconstruction modeAnd corresponding edit vectorM is the number of key frames, the expression guide edits are propagated using:
the step 6 comprises the following steps:
given a series of frame sequences f1,f2,…,fNThe user selects M key frames k1,k2,…,kMEditing different regions, corresponding to M drawn marker regions M1,m2,…,mM(ii) a For each frame fiGenerating M edit vectors
For each frame f to be predictediGenerating a deformation field, deforming the input mark region to generate M new mark regions Is mjThe region after the deformation of the action and the expression; replacing the local area of the feature map of the original frame with a new feature map:
wherein the content of the first and second substances,the initial characteristic diagram isG is a generation network of StyleGAN;
down samplingMake it andandthe resolution is the same; characteristic diagramUpdating the M editing operations, wherein the updating is performed for M times in total; the middle 5 characteristic maps of StyleGAN are updated, the resolution is from 32 x 32 to 128 x 128, and the high resolution is formed by the original hidden code wiPerforming adjustment based on a StyleGAN algorithm; applying the above-described fusion operation to all frames fiI =1, 2., N, generating an edited and fused aligned face video;
and generating face marking areas of the input frame and the editing frame by using a face segmentation method, merging the face marking areas, generating a smooth edge for the merged marking areas, further using the smooth edge as a fusion weight, fusing the faces before and after editing, reversely aligning the fused face images to the original video, and synthesizing the face video editing result.
5. A sketch-based deep face video editing system, comprising:
the module 1 is used for aligning and cutting a face in an original video, and coding the face into a hidden space to obtain hidden codes of all frames in the face video;
module 2 for adding sketch generating branchesGenerating an editing vector delta by a StyleGAN generation network and reversely optimizing an image hidden codeedit;
Module 3 for editing the vector deltaeditThe hidden codes are superposed to all frames to finish the propagation of the time sequence irrelevant editing;
a module 4 for editing the vector δ by weight superposition of piecewise linear functionseditCompleting the editing and spreading of the action or the expression;
a module 5, configured to calculate a weight superposition editing vector δ according to similarity between expression parameters of the current frame and the editing frameeditEnabling the edit to correspond to the specific expression, and finishing expression-driven edit propagation;
and the module 6 is used for fusing different types of edits added by different frames by using a region perception fusion method, and fusing the face to the original video to obtain a face video editing result based on the sketch.
6. The sketch-based deep human face video editing system as claimed in claim 5, wherein the module 1 is used for detecting human face key points of the human face video, aligning and cutting the human face after smoothing by using a time window, and generating a video frame sequence f1,f2,…,fNWherein, N is the frame number of the face video; will be provided withProjection of a sequence of frames into a hidden space W+Generating a hidden code sequence w1,w2,…,wN。
7. The system of claim 6, wherein the module 2 is used to obtain StyleGAN original generation network G and construct a generation network for modeling joint probability distribution of real face image and sketchGenerating networksIncludedAndtwo branches are arranged on the upper surface of the main body,a network of original generation for G, for generating a simulated human face image,for generating a corresponding sketch image; given the hidden code w of the image,generating a feature map F1,F2,…,F14Wherein F is1Is used asInitial feature maps of the branches;the feature map of the branch is up-sampled and feature map FiAdding the convolved residual images to generate a sketch image corresponding to the hidden code w;
training a sketch generation network S by using a data set matched with the image and the sketch, and generating a corresponding sketch by taking the face image as input for training a training sketch generation branchRandomly sampling the hidden code w and inputting itGenerating highly realistic human face imagesAnd corresponding sketchConstructing a loss functionTraining sketch generation branches
LVGGIs a perceptual loss function, and a VGG19 model is used for measuring the visual similarity, LL2Is pixel L2 loss, α1And alpha2Are all preset weights;
after the distribution of the real image and the sketch is modeled, a sketch s is drawn according to the input face image xeditAnd selecting the region medit(ii) a Projecting the face image x to W+Space, obtaining the hidden code weditGenerated sketchThe same as the input sketch is in the editing area, and the generated imageIn the non-edited region, w is obtained by the following loss functionedit:
Lediting(wedit)=β1Lsketch+β2Lrgb,
LsketchThe constraint edit area has the same structure as the sketch result, LrgbConstraining the non-editing regions to remain unchanged, beta1And beta2For hyper-parameters, the network is generated by fixingTo obtain wedit;
Final edit vector deltaedit=wedit-w,δeditThe editing of the sketch is represented and spread to the whole face video; for each frame fiGenerating a corresponding edit vector:
δi=δedit,i=1,2,…,N
the module 3 comprises a frame buffer for each frame fiCorresponding deltaeditAnd spreading the video to the whole face video to generate an edited frame sequence.
8. The sketch-based deep human face video editing system of claim 7, wherein the module 4 is used for adding blinking or smiling actions at specific times in the human face video, and at specific frames ftAdding an edit vector deltaeditInput duration h and change time l, for each frame fiThe invention uses piecewise linear functions to generate smooth propagation edit vector deltaiTo obtain a new edit vector deltai:
δi=γ·δedit,i=1,2,…,M
t1=t-h/2-l,t2=t-h/2,t3=t+h/2,t4= t + h/2+ l, t is the edit frame ftThe corresponding time;
these new edit vectors δiFor synthesizing a simulated face image;
the module 5 comprises:
given a plurality of key frames in the face videoMethod for extracting expression parameters of human face by using 3D reconstruction modeAnd corresponding edit vectorM is the number of key frames, the expression guide edits are propagated using:
this module 6 is intended to give a succession of frame sequences f1,f2,…,fNThe user selects M key frames k1,k2,…,kMEditing different regions corresponding to the M drawn mark regions M1,m2,…,mM(ii) a For each framefiGenerating M edit vectors
For each frame f to be predictediGenerating a deformation field, deforming the input mark region to generate M new mark regions Is mjThe region after the deformation of the action and the expression; replacing the local area of the feature map of the original frame with a new feature map:
wherein the content of the first and second substances,the initial characteristic diagram isG is a generation network of StyleGAN;
down samplingMake it andandthe resolution is the same; characteristic diagramAll M editing operations are updated, and the total update is carried outM times; the middle 5 characteristic maps of StyleGAN are updated, the resolution is from 32X 32 to 128X 128, and the high resolution is composed of the original hidden code wiPerforming adjustment based on a StyleGAN algorithm; applying the above-described fusion operation to all frames fiI =1, 2., N, generating an edited and fused aligned face video;
and generating face marking areas of the input frame and the editing frame by using a face segmentation method, merging the face marking areas, generating a smooth edge for the merged marking areas, further using the smooth edge as a fusion weight, fusing the faces before and after editing, reversely aligning the fused face images to the original video, and synthesizing the face video editing result.
9. A storage medium storing a program for executing the method for video editing of a deep face based on a sketch as claimed in any one of claims 1 to 5.
10. A client for use in the sketch-based deep face video editing system as claimed in any one of claims 6 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210698610.8A CN115278106A (en) | 2022-06-20 | 2022-06-20 | Deep face video editing method and system based on sketch |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210698610.8A CN115278106A (en) | 2022-06-20 | 2022-06-20 | Deep face video editing method and system based on sketch |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115278106A true CN115278106A (en) | 2022-11-01 |
Family
ID=83761741
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210698610.8A Pending CN115278106A (en) | 2022-06-20 | 2022-06-20 | Deep face video editing method and system based on sketch |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115278106A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115810215A (en) * | 2023-02-08 | 2023-03-17 | 科大讯飞股份有限公司 | Face image generation method, device, equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111489405A (en) * | 2020-03-21 | 2020-08-04 | 复旦大学 | Face sketch synthesis system for generating confrontation network based on condition enhancement |
CN111652828A (en) * | 2020-05-27 | 2020-09-11 | 北京百度网讯科技有限公司 | Face image generation method, device, equipment and medium |
US20210209464A1 (en) * | 2020-01-08 | 2021-07-08 | Palo Alto Research Center Incorporated | System and method for synthetic image generation with localized editing |
CN113112572A (en) * | 2021-04-13 | 2021-07-13 | 复旦大学 | Hidden space search-based image editing method guided by hand-drawn sketch |
CN113901894A (en) * | 2021-09-22 | 2022-01-07 | 腾讯音乐娱乐科技(深圳)有限公司 | Video generation method, device, server and storage medium |
CN114255496A (en) * | 2021-11-30 | 2022-03-29 | 北京达佳互联信息技术有限公司 | Video generation method and device, electronic equipment and storage medium |
-
2022
- 2022-06-20 CN CN202210698610.8A patent/CN115278106A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210209464A1 (en) * | 2020-01-08 | 2021-07-08 | Palo Alto Research Center Incorporated | System and method for synthetic image generation with localized editing |
CN111489405A (en) * | 2020-03-21 | 2020-08-04 | 复旦大学 | Face sketch synthesis system for generating confrontation network based on condition enhancement |
CN111652828A (en) * | 2020-05-27 | 2020-09-11 | 北京百度网讯科技有限公司 | Face image generation method, device, equipment and medium |
CN113112572A (en) * | 2021-04-13 | 2021-07-13 | 复旦大学 | Hidden space search-based image editing method guided by hand-drawn sketch |
CN113901894A (en) * | 2021-09-22 | 2022-01-07 | 腾讯音乐娱乐科技(深圳)有限公司 | Video generation method, device, server and storage medium |
CN114255496A (en) * | 2021-11-30 | 2022-03-29 | 北京达佳互联信息技术有限公司 | Video generation method and device, electronic equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
CHENG PEN ET AL: "Analysis of Neural Style Transfer Based on Generative Adversarial Network", 《2021 IEEE INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE, ELECTRONIC INFORMATION ENGINEERING AND INTELLIGENT CONTROL TECHNOLOGY (CEI)》, 24 December 2021 (2021-12-24) * |
苏嘉洋: "定制化动作的人脸视频合成***", 《北京邮电大学硕士论文》, 15 January 2022 (2022-01-15) * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115810215A (en) * | 2023-02-08 | 2023-03-17 | 科大讯飞股份有限公司 | Face image generation method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Jiang et al. | Scfont: Structure-guided chinese font generation via deep stacked networks | |
US11880766B2 (en) | Techniques for domain to domain projection using a generative model | |
CN113194348B (en) | Virtual human lecture video generation method, system, device and storage medium | |
CN111915693B (en) | Sketch-based face image generation method and sketch-based face image generation system | |
US9734613B2 (en) | Apparatus and method for generating facial composite image, recording medium for performing the method | |
Seol et al. | Artist friendly facial animation retargeting | |
Chen et al. | PicToon: a personalized image-based cartoon system | |
Cong | Art-directed muscle simulation for high-end facial animation | |
Saunders et al. | Anonysign: Novel human appearance synthesis for sign language video anonymisation | |
CN110310351A (en) | A kind of 3 D human body skeleton cartoon automatic generation method based on sketch | |
CN115278106A (en) | Deep face video editing method and system based on sketch | |
CN112991484B (en) | Intelligent face editing method and device, storage medium and equipment | |
CN111275778A (en) | Face sketch generating method and device | |
Song et al. | FineStyle: Semantic-Aware Fine-Grained Motion Style Transfer with Dual Interactive-Flow Fusion | |
Li et al. | Orthogonal-blendshape-based editing system for facial motion capture data | |
Tejera et al. | Animation control of surface motion capture | |
Kawai et al. | Data-driven speech animation synthesis focusing on realistic inside of the mouth | |
Nakatsuka et al. | Audio-guided Video Interpolation via Human Pose Features. | |
Nakatsuka et al. | Audio-oriented video interpolation using key pose | |
CN115578298A (en) | Depth portrait video synthesis method based on content perception | |
Chen et al. | Animating lip-sync characters with dominated animeme models | |
Cao et al. | AnimeDiffusion: anime diffusion colorization | |
Sistla et al. | A state-of-the-art review on image synthesis with generative adversarial networks | |
Wang et al. | Expression-aware neural radiance fields for high-fidelity talking portrait synthesis | |
Wang et al. | Flow2Flow: Audio-visual cross-modality generation for talking face videos with rhythmic head |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |