CN115278106A - Deep face video editing method and system based on sketch - Google Patents

Deep face video editing method and system based on sketch Download PDF

Info

Publication number
CN115278106A
CN115278106A CN202210698610.8A CN202210698610A CN115278106A CN 115278106 A CN115278106 A CN 115278106A CN 202210698610 A CN202210698610 A CN 202210698610A CN 115278106 A CN115278106 A CN 115278106A
Authority
CN
China
Prior art keywords
edit
editing
sketch
face
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210698610.8A
Other languages
Chinese (zh)
Inventor
高林
陈姝宇
刘锋林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202210698610.8A priority Critical patent/CN115278106A/en
Publication of CN115278106A publication Critical patent/CN115278106A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/265Mixing

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Ophthalmology & Optometry (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention provides a deep face video editing method and system based on a sketch, which comprises the steps of aligning and cutting an original video, and coding a face to a hidden space to obtain hidden codes of all frames in the face video; adding sketch generation branches
Figure DDA0003703089990000011
To StyleGAN, reverse optimization of image steganography, generating an edit vector deltaedit(ii) a Will edit the vector deltaeditThe hidden codes are superposed to all frames to finish the propagation of the time sequence irrelevant editing; editing vector delta using weight superposition of piecewise linear functionseditFinishing the editing and spreading of the action or the expression; calculating a weight superposition editing vector delta according to the similarity of the expression parameters of the current frame and the editing frameeditEnabling the edit to correspond to the specific expression, and finishing expression-driven edit propagation; and fusing different types of edits added by different frames by using a region perception fusion method, and fusing the human face to the original video.

Description

Deep face video editing method and system based on sketch
Technical Field
The invention relates to the technical field of computer graphics and computer vision, in particular to a method and a system for synthesizing and editing a face video and a sketch.
Background
Video editing is a very challenging scientific research problem, and with the development of deep learning, video editing and modification work is more and more. Most of the existing video editing methods modify the global attribute of the video, convert the black-and-white video into the color video, or perform stylization processing on the video to generate artistic video editing results. Aiming at a human face video, the prior art mainly completes editing tasks such as face changing and the like, and only modifies the global attribute of identity. Some methods can edit the detail area of the video, but require professional software such as PS, PR and the like, and require high time and effort costs. The sketch is an efficient and accurate interactive tool, has high user-friendliness, and is widely used for image generation and editing problems. However, the prior art cannot expand sketch editing from images to videos, and is difficult to deal with the problems of transmission and composition of editing operation. The video editing has wide application prospect and higher value in the culture fields of movie making, new media transmission and the like, however, the existing technology cannot simply and quickly complete the task of editing the video details.
Aiming at the problem of video editing, the prior art can automatically color the video, but the function is single, and only the color information of the video can be changed. Or the video is stylized, so that the color characteristics of the video can be changed, the content of the video is subjected to artistic transformation, and the video with artistic feeling is generated. However, the above work can only edit the global features of the video, and cannot modify the details of the video. Even if the work is to represent video in the atlas space, the relevant detail editing operation will be extended to video clips after editing the image by software such as PS. However, the above method requires professional software for operation, and the editing and generation of the video are time-consuming. The sketch is a more friendly interactive tool, and the editing operation of the user is simpler and more accurate. The prior art can realize the operation of editing the face image by the sketch image, but the editing operation cannot be propagated to the whole video.
Disclosure of Invention
In order to solve the problems that details of video contents are difficult to edit and sketch editing cannot be transmitted to a video in the prior art, the invention generates a network StyleGAN based on a face image, abstractly expresses the sketch editing as a hidden vector, designs an innovative transmission and fusion mechanism and edits the face video. The invention provides a sketch-based face video editing method and a sketch-based face video editing system, which can select any one frame/multiple frames, edit face details by using a sketch and transmit the face details to the whole video in a specified mode.
Specifically, the invention provides a deep face video editing method based on a sketch, which comprises the following steps:
step 1, aligning and cutting a face in an original video, and coding the face into a hidden space to obtain hidden codes of all frames in the face video;
step 2, adding a sketch generation branch
Figure BDA0003703089970000021
To a StyleGAN generation network, reversely optimizing the image hidden code to generate an edit vector deltaedit
Step 3, editing the vector deltaeditThe hidden codes are superposed to all frames to finish the propagation of the time sequence irrelevant editing;
step 4, editing the vector delta by using the weight superposition of the piecewise linear functioneditCompleting the editing and spreading of the action or the expression;
step 5, calculating a weight superposition editing vector delta according to the similarity of the expression parameters of the current frame and the editing frameeditEnabling the edit to correspond to the specific expression, and finishing expression-driven edit propagation;
and 6, fusing different types of edits added by different frames by using a region perception fusion method, and fusing the face to the original video to obtain a face video editing result based on the sketch.
The deep face video editing method based on the sketch comprises the following steps of 1: detecting face key points of a face video, aligning and cutting the face after smoothing by using a time window to generate a video frame sequence f1,f2,…,fNWherein, N is the frame number of the face video; projecting a sequence of frames into a hidden space W+Generating a hidden code sequence w1,w2,…,wN
The deep face video editing method based on the sketch comprises the following steps of 2:
acquiring a StyleGAN original generation network G, and constructing a generation network for modeling joint probability distribution of real face images and sketches
Figure BDA0003703089970000022
Generating networks
Figure BDA0003703089970000023
Included
Figure BDA0003703089970000024
And
Figure BDA0003703089970000025
two branches are arranged on the upper surface of the main body,
Figure BDA0003703089970000026
a network is originally generated for G, for generating a simulated face image,
Figure BDA0003703089970000027
for generating a corresponding sketch image; given the hidden code w of the image,
Figure BDA0003703089970000028
generating a feature map F1,F2,…,F14Wherein, in the step (A),F1is used as
Figure BDA0003703089970000029
An initial profile of the branch;
Figure BDA00037030899700000210
the feature map of the branch is up-sampled and feature map FiAdding the convolved residual images to generate a sketch image corresponding to the hidden code w;
training a sketch generation network S by using a data set matched with the image and the sketch, and generating a corresponding sketch by taking the face image as input for training a training sketch generation branch
Figure BDA00037030899700000211
Randomly sampling the hidden code w and inputting it
Figure BDA00037030899700000212
Generating highly realistic face images
Figure BDA00037030899700000213
And corresponding sketch
Figure BDA00037030899700000214
Constructing a loss function
Figure BDA00037030899700000215
Training sketch generation branches
Figure BDA0003703089970000031
Figure BDA0003703089970000032
LVGGIs a perceptual loss function, and uses a VGG19 model to measure the visual similarity, LL2Is the pixel L2 loss, α1And alpha2Are all preset weights;
after modeling the distribution of real images and sketches, according to the input face imageLike x, sketching seditAnd selecting the region medit(ii) a Projecting the face image x to W+Space, obtaining the hidden code weditGenerated sketch
Figure BDA0003703089970000033
The same as the input sketch is in the editing area, and the generated image
Figure BDA0003703089970000034
In the non-edited region, w is obtained by the following loss functionedit
Lediting(wedit)=β1Lsketch2Lrgb,
LsketchThe constraint edit area has the same structure as the sketch result, LrgbConstraining the non-editing regions to remain unchanged, beta1And beta2For hyper-parameters, the network is generated by fixing
Figure BDA0003703089970000035
To obtain wedit
Final edit vector deltaedit=wedit-w,δeditThe editing of the sketch is represented and spread to the whole face video; for each frame fiGenerating a corresponding edit vector:
δi=δedit,i=1,2,…,N
this step 3 comprises the step of dividing each frame fiCorresponding deltaeditAnd spreading the video to the whole face video to generate an edited frame sequence.
The deep face video editing method based on the sketch comprises the following steps of 4:
adding blinking or smiling actions at specific time in the face video, and at a specific frame ftAdding an edit vector deltaeditInput duration h and variation time l, for each frame fiThe invention uses piecewise linear functions to generate smooth propagation edit vector deltaiGet new edit directionQuantity deltai
δi=γ·δedit,i=1,2,…,M
Figure BDA0003703089970000036
t1=t-h/2-l,t2=t-h/2,t3=t+h/2,t4= t + h/2+ l, t is an editing frame ftA corresponding time;
these new edit vectors δiFor synthesizing a simulated face image;
the step 5 comprises the following steps:
a plurality of key frames in the face video are given
Figure BDA0003703089970000041
Method for extracting expression parameters of human face by using 3D reconstruction mode
Figure BDA0003703089970000042
And corresponding edit vector
Figure BDA0003703089970000043
M is the number of key frames, the expression guide edits are propagated using:
Figure BDA0003703089970000044
eiis an input frame fiC is a normalization term
Figure BDA0003703089970000045
And edit the vector
Figure BDA0003703089970000046
For the same region;
the step 6 comprises the following steps:
given a series of frame sequences f1,f2,…,fNThe user selects M key frames k1,k2,…,kMEditing different regions corresponding to the M drawn mark regions M1,m2,…,mM(ii) a For each frame fiGenerating M edit vectors
Figure BDA0003703089970000047
For each frame f to be predictediGenerating a deformation field, deforming the input mark region to generate M new mark regions
Figure BDA0003703089970000048
Is mjThe region after the deformation of the action and the expression; replacing the local area of the feature map of the original frame with a new feature map:
Figure BDA0003703089970000049
wherein the content of the first and second substances,
Figure BDA00037030899700000410
the initial characteristic diagram is
Figure BDA00037030899700000411
G is a generation network of StyleGAN;
down sampling
Figure BDA00037030899700000412
Make it and
Figure BDA00037030899700000413
and
Figure BDA00037030899700000414
the resolution is the same; characteristic diagram
Figure BDA00037030899700000415
Updating the M editing operations, wherein the updating is performed for M times in total; the middle 5 characteristic maps of StyleGAN are updated, the resolution is from 32 x 32 to 128 x 128, and the high resolution is from the original oneHidden code wiPerforming adjustment based on a StyleGAN algorithm; applying the above-described fusion operation to all frames fiI =1, 2., N, generating an edited and fused aligned face video;
and generating face marking areas of the input frame and the editing frame by using a face segmentation method, merging the face marking areas, generating a smooth edge for the merged marking areas, further using the smooth edge as a fusion weight, fusing the faces before and after editing, reversely aligning the fused face images to the original video, and synthesizing the face video editing result.
The invention also provides a deep face video editing system based on the sketch, which comprises the following steps:
the module 1 is used for aligning and cutting a face in an original video, and coding the face into a hidden space to obtain hidden codes of all frames in the face video;
module 2 for adding sketch generating branches
Figure BDA00037030899700000416
Generating an editing vector delta by a StyleGAN generation network and reversely optimizing an image hidden codeedit
Module 3 for editing the vector deltaeditThe hidden codes are superposed to all frames to finish the propagation of the time sequence irrelevant editing;
a module 4 for editing the vector δ by weight superposition of piecewise linear functionseditFinishing the editing and spreading of the action or the expression;
a module 5, configured to calculate a weight superposition editing vector δ according to similarity between expression parameters of the current frame and the editing frameeditEnabling the edit to correspond to the specific expression, and finishing expression-driven edit propagation;
and the module 6 is used for fusing different types of edits added by different frames by using a region perception fusion method, and fusing the face to the original video to obtain a face video editing result based on the sketch.
The deep human face video editing system based on the sketch is characterized in that the module 1 is used for detecting human face key points of a human face video, and after a time window is used for smoothing, the human face is pairedPerforming a blend and crop to generate a sequence of video frames f1,f2,…,fNWherein, N is the frame number of the face video; projecting a sequence of frames into a hidden space W+Generating a hidden code sequence w1,w2,…,wN
The deep face video editing system based on the sketch comprises a module 2, a model G and a generating network, wherein the module 2 is used for acquiring a StyleGAN original generating network G and constructing the generating network for modeling the joint probability distribution of a real face image and the sketch
Figure BDA0003703089970000051
Generating networks
Figure BDA0003703089970000052
Included
Figure BDA0003703089970000053
And
Figure BDA0003703089970000054
two branches are arranged on the upper surface of the main body,
Figure BDA0003703089970000055
a network is originally generated for G, for generating a simulated face image,
Figure BDA0003703089970000056
for generating a corresponding sketch image; given the covert code w of an image,
Figure BDA0003703089970000057
generating a feature map F1,F2,…,F14Wherein F is1Is used as
Figure BDA0003703089970000058
An initial profile of the branch;
Figure BDA0003703089970000059
the feature map of the branch is up-sampled and feature map FiAdding the convolved residual images to generate a sketch map corresponding to the hidden code wAn image;
training a sketch generation network S by using a data set matched with the image and the sketch, and generating a corresponding sketch by taking the face image as input for training a training sketch generation branch
Figure BDA00037030899700000510
Randomly sampling the hidden code w and inputting it
Figure BDA00037030899700000511
Generating highly realistic face images
Figure BDA00037030899700000512
And corresponding sketch
Figure BDA00037030899700000513
Constructing a loss function
Figure BDA00037030899700000514
Training sketch generation branches
Figure BDA00037030899700000515
Figure BDA00037030899700000516
LVGGIs a perceptual loss function, and uses a VGG19 model to measure the visual similarity, LL2Is pixel L2 loss, α1And alpha2Are all preset weights;
after the distribution of the real image and the sketch is modeled, a sketch s is drawn according to the input face image xeditAnd selecting the region medit(ii) a Projecting the face image x to W+Space, obtaining the hidden code weditGenerated sketch
Figure BDA00037030899700000517
The same as the input sketch is in the editing area, and the generated image
Figure BDA00037030899700000518
In the non-edited region, w is obtained by the following loss functionedit
Lediting(wedit)=β1Lsketch2Lrgb,
LsketchThe constraint edit area has the same structure as the sketch result, LrgbConstraining the non-editing regions to remain unchanged, beta1And beta2For hyper-parameters, the network is generated by fixing
Figure BDA0003703089970000061
To obtain wedit
Final edit vector deltaedit=wedit-w,δeditThe editing of the sketch is represented and spread to the whole face video; for each frame fiGenerating a corresponding edit vector:
δi=δedit,i=1,2,…,N
the module 3 comprises a frame buffer for each frame fiCorresponding deltaeditAnd spreading the video to the whole face video to generate an edited frame sequence.
The deep face video editing system based on the sketch, wherein the module 4 is used for adding blinking or smiling actions at specific time in the face video, and at a specific frame ftAdding an edit vector deltaeditInput duration h and variation time l, for each frame fiThe invention uses piecewise linear functions to generate smooth propagation edit vector deltaiTo obtain a new edit vector deltai
δi=γ·δedit,i=1,2,…,M
Figure BDA0003703089970000062
t1=t-h/2-l,t2=t-h/2,t3=t+h/2,t4= t + h/2+ l, t is the edit frame ft
A corresponding time;
these new edit vectors δiFor synthesizing a simulated face image;
the module 5 comprises:
a plurality of key frames in the face video are given
Figure BDA0003703089970000063
Method for extracting expression parameters of human face by using 3D reconstruction mode
Figure BDA0003703089970000064
And corresponding edit vector
Figure BDA0003703089970000065
M is the number of key frames, the expression guide edits are propagated using:
Figure BDA0003703089970000066
eiis an input frame fiC is a normalization term
Figure BDA0003703089970000067
And edit the vector
Figure BDA0003703089970000068
For the same region;
this module 6 is intended to give a succession of frame sequences f1,f2,…,fNThe user selects M key frames k1,k2,…,kMEditing different regions, corresponding to M drawn marker regions M1,m2,…,mM(ii) a For each frame fiGenerating M edit vectors
Figure BDA0003703089970000069
For each frame f to be predictediGenerating a deformation field, deforming the input mark region to generate M new mark regions
Figure BDA0003703089970000071
Is mjThe region after the deformation of the action and the expression; replacing the local area of the feature map of the original frame with a new feature map:
Figure BDA0003703089970000072
wherein the content of the first and second substances,
Figure BDA0003703089970000073
the initial characteristic diagram is
Figure BDA0003703089970000074
G is a generation network of StyleGAN;
down sampling
Figure BDA0003703089970000075
Make it and
Figure BDA0003703089970000076
and
Figure BDA0003703089970000077
the resolution is the same; characteristic diagram
Figure BDA0003703089970000078
Updating the M editing operations, wherein the updating is performed for M times in total; the middle 5 characteristic maps of StyleGAN are updated, the resolution is from 32 x 32 to 128 x 128, and the high resolution is formed by the original hidden code wiPerforming adjustment based on a StyleGAN algorithm; applying the above-described fusion operation to all frames fiI =1, 2., N, generating an edited and fused aligned face video;
and generating face marking areas of the input frame and the editing frame by using a face segmentation method, merging the face marking areas, generating a smooth edge for the merged marking areas, further using the smooth edge as a fusion weight, fusing the faces before and after editing, reversely aligning the fused face images to the original video, and synthesizing the face video editing result.
The invention also provides a storage medium for storing a program for executing the any depth face video editing method based on the sketch.
The invention also provides a client used for the any depth face video editing system based on the sketch.
According to the scheme, the invention has the advantages that:
the system designed by the invention can select one or more editing frames, draw a sketch and a corresponding editing area mask by a user, and realize the editing and transmission operation of the video after the transmission mode of the editing is appointed.
Drawings
FIG. 1 is a schematic flow chart of the system of the present invention;
FIG. 2 is a schematic diagram of sketch optimization;
FIG. 3 is a diagram of time-independent editing and time window editing results;
FIG. 4 is a diagram of the results of time-independent editing and expression-driven editing;
FIG. 5 is a graph of results for different rendering styles;
FIG. 6 is a diagram of a result of a rotating face edit;
FIG. 7 is a graph of results fused using different approaches after optimizing sketch editing vectors;
FIG. 8 is a diagram of different editing fusion results;
FIG. 9 is a diagram illustrating an intermediate result of face video editing;
fig. 10 is a graph of the key point smoothing results.
Detailed Description
The defects in the prior art are caused by the fact that the problem of transmission on a video is not considered in sketch editing, because the face in the video has changes of expressions and actions, the input sketch editing operation is difficult to directly act on other frames, and meanwhile, the sketch can change the identity characteristics (such as the shape of five sense organs) of the face and also can change the expressions and actions (how smiling is added), so that how to distinguish the faces and reasonably transmit the faces is very difficult; the video editing needs to ensure the stability of the time sequence, the existing method does not consider the flicker problem of video generation, and the quality of the generated result is poor.
The inventor discovers that the defect can be overcome by designing a reasonable video coding mode and a sketch editing, spreading and fusing method through sketch editing of images and videos. After the face video is input, firstly, the face area is cut and aligned through key point detection. Further, we use the image coding network to code the face images of all frames into the hidden space of the StyleGAN generation network. Aiming at sketch editing input by a user, an optimization strategy is designed, and editing operation is abstractly expressed as an editing vector. In the transmission process of editing, the editing operation is divided into two types, namely time sequence irrelevant editing and time sequence relevant editing, the time sequence relevant editing is further divided into time window editing and expression driven editing, the specific type of editing is designated by a user, and the transmission is carried out in different modes. Meanwhile, a regional perception fusion strategy is designed, different editing operations input in different frames are fused, and an edited face video is generated. And finally, reversely aligning the generated face video to the input original video, fusing the face area and generating a sketch video editing result.
The core invention points of the invention comprise:
key point 1, styleGAN based video coding module. After a section of face video is input, a face image is cut and aligned by using a face key point detection technology of dlib, and a time window is used for smoothing. The input frame sequence is encoded into a cryptic code sequence based on an E4E face to StyleGAN cryptic spatial encoding technique. According to the input frame sequence and the hidden code sequence, based on the PTI reconstruction technology, the weight of the StyleGAN generation network is finely adjusted, so that the original video can be perfectly reconstructed, the video coding task is completed, and the subsequent video editing is served;
and 2, a key point 2, a sketch editing and optimizing module. And (4) generating a network based on the pre-trained sketch, expanding the original StyleGAN, and adding a sketch generation branch to the original StyleGAN. Further, an optimization strategy is designed, a user inputs a drawn sketch and an editing area mask, the constraint of the editing area is the same as that of the sketch, the constraint of the editing area is not the same as that of the original image, and the original hidden code is optimized in an iterative mode. And (4) subtracting the optimized hidden code from the input hidden code to obtain an editing vector, and abstractively representing the sketch editing operation.
Key point 3, timing independent propagation technique. The partial editing operation has a time sequence independent characteristic, that is, editing of the frame sequence should be uniformly applied to the whole video, and is independent of expressions and actions, such as editing of facial shapes of human faces and the like. In the editing, the editing vector is directly added with the hidden code of the input frame, so that the propagation effect of editing is realized.
Key point 4: time window propagation techniques. And part of editing operation represents specific actions of the human face, including smiling, eye closing and the like. The actions are divided into three stages, namely a starting stage, a continuous stage and an ending stage, the editing vectors are overlapped by linearly changing weights in the starting stage and the ending stage, and fixed weights are overlapped in the continuous stage, so that the whole process of starting, continuing and ending the actions is realized.
Key point 5: expression-driven propagation techniques. Part of the editing operation is related to expressions, such as closing eyes when smiling. And aiming at the editing, extracting expression parameters of the face by using a 3D reconstruction mode, calculating weight according to cosine similarity of the expression parameters of an editing frame and a prediction frame, and superposing editing vectors.
Key point 6: and a region perception fusion module. In the video editing process, a user often selects multiple frames and edits different areas simultaneously. The module predicts a deformation field according to the change of the action expression of the face by using deformation operation and deforms the drawn mask. Finally, the generated face area is fused and back-projected to the original frame, and fusion operation of different editions is completed.
In order to make the aforementioned features and advantages of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.
The system flow chart is shown in fig. 1, and the system comprises multiple technologies of editing vector generation, time sequence independent editing propagation, time window editing propagation, expression driven editing propagation, regional perception fusion and the like.
Sketch optimization flowchart as shown in fig. 2, the original StyleGAN generation network is extended to two branches, one branch generating a sketch image and the other branch generating an image with high realism. The optimization process comprises two loss items of L _ sketch and L _ RGB, and the corresponding sketch editing of the editing area and other areas are respectively restricted to be kept unchanged.
As shown in fig. 1, a high-quality face video editing method and system based on sketch interaction includes:
s1: after a video is input, aligning and cutting a human face, and coding the human face to a hidden space;
s2: extending StyleGAN generation network, adding sketch generation branch
Figure BDA0003703089970000091
Reverse optimizing image hidden code to generate edit vector deltaedit
S3: time-independent editing, vector delta will be editededitDirectly superimposing the hidden codes to all frames to finish the propagation of the time sequence irrelevant editing;
s4: time window editing using weight-superimposed editing vectors delta of piecewise linear functionseditCompleting the editing and spreading of the action or the expression;
s5: expression drive editing, calculating weight superposition editing vector delta according to similarity of expression parameters of current frame and editing frameeditMaking the edit correspond to a specific expression;
s6: fusing different types of edits added by different frames by using a region perception fusion method, and fusing the face to the original video;
wherein the method of S1 comprises:
after an input video is given, detecting key points of a human face by using dlib, aligning and cutting the human face by using coordinates of smooth key points of a time window, and generating a video frame sequence f1,f2,…,fNWhere N is the number of frames. The present invention uses E4E to project a sequence of frames onto W+Space, generating hidden code sequence w1,w2,…,wN. The subsequently generated edit vector is superimposed on the crypto-code sequence. The smoothing is to smooth the coordinates of key points of the face on the sequence, and the face detection method is single-frame detection, so certain jitter exists between frames. The purpose of the smoothing is to eliminate the effects of jitter.
The method of S2 is shown in fig. 2, and includes:
s21: given the StyleGAN original generation network G, the invention designs a new generation network
Figure BDA0003703089970000101
And modeling the joint probability distribution of the real face image and the sketch. Which comprises two branches of which the number of branches is two,
Figure BDA0003703089970000102
generating a human face image with high reality sense, generating a network for the G,
Figure BDA0003703089970000103
and generating a corresponding sketch image. Given the hidden code w of the image,
Figure BDA0003703089970000104
generating a feature map F1,F2,…,F14Wherein, F1Is used as
Figure BDA0003703089970000105
Initial profile of the branch.
Figure BDA0003703089970000106
Intermediate feature maps of branches are repeatedly upsampled, and FiAnd adding the convolved residual images. And (4) after the operation is finished for i = 2-14, finally generating a sketch image corresponding to the hidden code w. The StyleGAN3 original generation network has an extension of 10 pixels, and the invention clips the intermediate feature map and only uses the clipped pixel content.
S22: to train a sketch to generate branches
Figure BDA0003703089970000107
We first train a sketch generation network S based on the Pix2PixHD network using a data set with images matched to the sketch. The sketch generation network takes a real face image as input to generate a corresponding sketch for training a training sketch generation branch
Figure BDA0003703089970000108
Then, the invention randomly samples the hidden code w and inputs the hidden code w
Figure BDA0003703089970000109
Generating highly realistic face images
Figure BDA00037030899700001010
And corresponding sketch
Figure BDA00037030899700001011
The invention then trains the sketch generation branches using the following loss functions:
Figure BDA00037030899700001012
LVGGis a perceptual loss function, and uses a VGG19 model to measure the visual similarity, LL2Is the pixel L2 loss, α1And alpha2Are all preset weights, in this case α1=α2=1。
S23: after modeling the distribution of the real image and the sketch, the invention designs an optimization technology, according to the real image x input by the user, the sketch s drawn by the usereditAnd a mark area meditGenerating an edit vector deltaedit. First, a real image x is projected onto W+And generating an initial hidden code w. Then, the invention optimizes to obtain a new implicit code weditGenerated sketch
Figure BDA0003703089970000111
The same as the input sketch is in the editing area, and the generated image
Figure BDA0003703089970000112
In the non-editing area the same as the original image. To optimize for obtaining weditThe invention uses the following loss function:
Figure BDA0003703089970000113
Figure BDA0003703089970000114
wherein L isLPIPSIs LPIPS distance, as a matrix dot product. L issketchThe constraint edit area has the same structure as the sketch result, LrgbThe constrained non-editing regions remain unchanged. The final optimization loss function is:
Lediting(wedit)=β1Lsketch2Lrgb,
β1and beta2Is a hyper-parameter. In the optimization process, the weight of the fixed network and the only optimized parameter are wedit
S24: the final edit vector is:
δedit=wedit-w
δeditthe abstraction represents the compilation of the sketch and propagates through to the entire video.
Wherein the method of S3 comprises:
some editing operations have a significant impact on the entire video, with low relevance to expressions and actions. These editing operations mainly change the basic shape of a human face, such as the shape of a face and the shape of five sense organs. Edit vector delta generated by the inventioneditThe video frame self-adaptive decoupling method has decoupling and semantic characteristics, and is directly applied to the whole video frame sequence. For each frame fiGenerating a corresponding edit vector:
δi=δedit,i=1,2,…,N
these edit vectors propagate the edits throughout the video, generating a sequence of edited frames.
Wherein the method of S4 comprises:
unlike single frame editing, video has changes in expressions and motions with time. Users often edit time-sequential facial actions, such as adding blinks or smiles at a particular time. In a particular frame ftAdding edit vector deltaeditThe user also needs to input the duration h and the change time l. Then, for each frame fiThe invention uses piecewise linear functions to generate smooth propagation edit vectors deltai
δi=γ·δedit,i=1,2,…,M
Figure BDA0003703089970000121
t1=t-h/2-l,t2=t-h/2,t3=t+h/2,t4= t + h/2+ l, t is an editing frame ftThe corresponding time. These new edit vectors deltaiWill be used to synthesize highly realistic face images. By using the editing mode, the invention not only can generate the editing effect in a specific time window, but also can form smooth transition, namely, the appearance and disappearance of the editing, for example, from a natural expression to a smiling expression, and then from the smiling expression to the natural expression.
Wherein the method of S5 comprises:
in some scenarios, the user may only want to add some edits at certain expressions, while keeping the original state or adding new edits at other expressions. Such editing operations include expression-driven wrinkles (e.g., statute lines, heads-up lines, etc.), and some shape edits that occur only under certain expressions (e.g., eyes are diminished when smiling). In order to propagate the expression-driven editing, the invention extracts the expression parameters of the human face by using a 3D reconstruction mode. More specifically, key frames for a given emoji edit
Figure BDA0003703089970000122
M is the number of key frames, and the expression parameters are extracted by the method
Figure BDA0003703089970000123
And corresponding edit vector
Figure BDA0003703089970000124
Some key frames may not have any editing operation, but act as key reference frames, indicating that there is no editing operation in a certain expression. For these frames, the edit vector is the zero vector. The present invention propagates expression guide edits using the following:
Figure BDA0003703089970000125
eiis an input frame fiC is a normalization term
Figure BDA0003703089970000126
In the present invention, vectors are edited
Figure BDA0003703089970000127
For the same region.
Wherein the method of S6 comprises:
s6.1: the invention supports editing any multiframe by using the sketch and fuses the editing effect. After editing multiple frames, a plurality of editing vectors are generated, and a simple method is to directly add the editing vectors, however, as shown in fig. 7, the method generates flaws, and the invention designs a region perception fusion mode.
S6.2: given a series of frame sequences f1,f2,…,fNThe user selects M key frames k1,k2,…,kMEditing different regions corresponding to M editing mark regions M1,m2,…,mM. Using the above-mentioned edit propagation method, f is applied to each frameiGenerating M edit vectors
Figure BDA0003703089970000128
Representing different editing operations.
For each frame f to be predictediThe invention uses the First-order method to generate a deformation field, and generates M new areas for the deformation of the input mark area
Figure BDA0003703089970000131
Input rendered region mjMarking similar regions, but considering frame fiAnd editing key frame kjInter-expression and head movements. In order to fuse different editing operations, the invention replaces the local area of the feature map of the original frame with a new feature map:
Figure BDA0003703089970000132
wherein the content of the first and second substances,
Figure BDA0003703089970000133
the initial characteristic diagram is
Figure BDA0003703089970000134
G is the generative network of StyleGAN. Inventive downsampling
Figure BDA0003703089970000135
Make it and
Figure BDA0003703089970000136
and
Figure BDA0003703089970000137
with the same resolution. And for M times of editing operation, iterating the formula for M times, wherein j = 1-M, and completing editing and fusion of the local areas. The invention updates and generates the middle 5 characteristic graphs of the network, the characteristic graphs mainly control the face structure information, and the resolution is from 32 x 32 to 128 x 128. High resolution is determined by the original hidden code wiThe adaptation is performed using the algorithm of the StyleGAN network. The present invention applies the above-described fusion operation to allFrame fiI =1, 2., N, generating an edited fused aligned face video.
S6.3: the invention fuses the synthesized face to the original video to synthesize the final editing video. First, a face segmentation method is used to generate face region labeling maps of an input frame and an editing frame, and a union of face regions is calculated. The merged area expands further, making the edges transition smoothly. And converting the smoothed face region label graph into fusion weight, wherein the label region weight is 1, the non-label region weight is 0, the transition edge weight is between 0 and 1, and the face before and after editing is fused based on the weight. And finally, reversely aligning the face image to the original video, and synthesizing the final edited video.
As shown in fig. 3, the result of the fusion of the time-independent editing and the time window editing of the present invention is shown. The time-independent edits in the left side add hair and beard to the editing character, and the time window edits add a header-raising action to the editing character. The first line on the right side is the original video, the second line on the right side is the video result after editing, the video after editing has more hair and beard, and meanwhile, the eyebrow picking action is achieved.
As shown in fig. 4, the result of the fusion of the time-independent editing and the time window editing of the present invention is shown. The time sequence irrelevant editing in the left side reduces the nose of the face, and in expression driven editing, the eyes are reduced when the mouth is opened, and the original shape is kept when the mouth is closed. The video frames before and after editing are shown on the right side, and the editing operation is well spread to the whole video.
As shown in fig. 5, the result of editing a face using sketches of different rendering styles is shown. The first column of images shows the drawn sketch and the selected area, and the second column of images shows the result of a single frame edit. The first line on the right shows the original video frame and the subsequent lines show the results of the edit propagation. Aiming at sketches with different drawing styles, the method generates a result with higher quality and has better robustness.
As shown in fig. 6, the result of editing for a face with angle changes is presented. The first line of images is the original sequence of video frames, the second line of images is the edited sequence of video frames, and the left side is the user-drawn sketch and the selected region. The invention generates high-quality editing results even if the input face video has rotation and angle changes.
As shown in fig. 7, the results of two editorial fusion methods are presented. The user has added two time-independent edits, varying the face and hair, while adding a time window edit. The first row shows the editing results of the edited sketch, the second row shows the original video, the third row shows the result of a fusion mode of directly adding a plurality of editing vectors, and the fourth row shows the result of a region-aware fusion mode. The quality of the result generated by the regional perception fusion mode is higher than that of the direct addition of the editing vectors, and the effectiveness of the regional perception fusion module is proved.
As shown in fig. 8, the fusion results of the different edits are shown. The first line is the original video, the second line is the result of the time sequence irrelevant editing, the hair area of the human face is modified, the third line is the time window editing, the smile is added to the human face, and the last line is the result of the fusion of the two types of editing.
As shown in fig. 9, an intermediate result of the face video editing is shown. The second row shows the real video and the third row shows the result of the alignment. The fourth line shows the result of the drawing mask deformed according to the expression and the action, and the fifth line shows the editing result of the aligned face. The sixth row shows the result of face region segmentation, and the last row shows the final anti-aligned face editing result.
As shown in fig. 10, the results of keypoint smoothing are presented. The first three lines are the result of smoothing without using key points, the cut and aligned face has very large jitter, the last three lines are the result of smoothing with key points, and the cut and aligned face has no jitter problem.
The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.
The invention also provides a deep face video editing system based on the sketch, which comprises the following steps:
the module 1 is used for aligning and cutting a face in an original video, and coding the face into a hidden space to obtain hidden codes of all frames in the face video;
module 2 for adding sketch generating branches
Figure BDA0003703089970000141
To a StyleGAN generation network, reversely optimizing the image hidden code to generate an edit vector deltaedit
Module 3 for editing the vector deltaeditThe hidden codes are superposed to all frames to finish the propagation of the time sequence irrelevant editing;
a module 4 for editing the vector δ by weight superposition of piecewise linear functionseditFinishing the editing and spreading of the action or the expression;
a module 5, configured to calculate a weight superposition editing vector δ according to similarity between expression parameters of the current frame and the editing frameeditEnabling the edit to correspond to the specific expression, and finishing expression-driven edit propagation;
and the module 6 is used for fusing different types of edits added by different frames by using a regional perception fusion method, and fusing the face to the original video to obtain a face video editing result based on the sketch.
In the deep face video editing system based on the sketch, the module 1 is used for detecting face key points of a face video, aligning and cutting a face after smoothing by using a time window to generate a video frame sequence f1,f2,…,fNWherein, N is the frame number of the face video; projecting a sequence of frames into a hidden space W+Generating a hidden code sequence w1,w2,…,wN
In the deep face video editing system based on the sketch, the module 2 is used for acquiring a StyleGAN original generation network G and constructing a generation network for modeling the joint probability distribution of a real face image and the sketch
Figure BDA0003703089970000151
Generating networks
Figure BDA0003703089970000152
Included
Figure BDA0003703089970000153
And
Figure BDA0003703089970000154
two branches are arranged on the upper surface of the main body,
Figure BDA0003703089970000155
a network is originally generated for G, for generating a simulated face image,
Figure BDA0003703089970000156
for generating a corresponding sketch image; given the covert code w of an image,
Figure BDA0003703089970000157
generating a feature map F1,F2,…,F14Wherein F is1Is used as
Figure BDA0003703089970000158
Initial feature maps of the branches;
Figure BDA0003703089970000159
the feature map of the branch is up-sampled and feature map FiAdding the convolved residual images to generate a sketch image corresponding to the hidden code w;
training a sketch generation network S by using a data set matched with the image and the sketch, and generating a corresponding sketch by taking the face image as input for training a training sketch generation branch
Figure BDA00037030899700001510
Randomly sampling hidden code w and inputting it
Figure BDA00037030899700001511
Generating high truesHuman face image
Figure BDA00037030899700001512
And corresponding sketch
Figure BDA00037030899700001513
Constructing a loss function
Figure BDA00037030899700001514
Training sketch generation branches
Figure BDA00037030899700001515
Figure BDA00037030899700001516
LVGGIs a perceptual loss function, and a VGG19 model is used for measuring the visual similarity, LL2Is the pixel L2 loss, α1And alpha2Are all preset weights;
after the distribution of the real image and the sketch is modeled, a sketch s is drawn according to the input face image xeditAnd selecting the region medit(ii) a Projecting the face image x to W+Space, obtaining a hidden code weditGenerated sketch
Figure BDA00037030899700001517
The same as the input sketch is in the editing area, and the generated image
Figure BDA00037030899700001518
In the non-edited region, w is obtained by the following loss functionedit
Lediting(wedit)=β1Lsketch2Lrgb,
LsketchThe constraint edit area has the same structure as the sketch result, LrgbConstraining the non-editing regions to remain unchanged, beta1And beta2For hyper-parameters, the network is generated by fixing
Figure BDA0003703089970000161
To obtain wedit
Final edit vector deltaedit=wedit-w,δeditThe editing of the sketch is represented and spread to the whole face video; for each frame fiAnd generating a corresponding edit vector:
δi=δedit,i=1,2,…,N
the module 3 comprises a frame buffer for each frame fiCorresponding deltaeditAnd spreading the video to the whole face video to generate an edited frame sequence.
The deep face video editing system based on the sketch, wherein the module 4 is used for adding blinking or smiling actions at specific time in the face video, and at a specific frame ftAdding an edit vector deltaeditInput duration h and change time l, for each frame fiThe invention uses piecewise linear functions to generate smooth propagation edit vectors deltaiTo obtain a new edit vector deltai
δi=γ·δedit,i=1,2,…,M
Figure BDA0003703089970000162
t1=t-h/2-l,t2=t-h/2,t3=t+h/2,t4= t + h/2+ l, t is the edit frame ftA corresponding time;
these new edit vectors deltaiFor synthesizing a simulated human face image;
the module 5 comprises:
a plurality of key frames in the face video are given
Figure BDA0003703089970000163
Method for extracting expression parameters of human face by using 3D reconstruction mode
Figure BDA0003703089970000164
And corresponding edit vector
Figure BDA0003703089970000165
M is the number of key frames, the expression guide edits are propagated using:
Figure BDA0003703089970000166
eiis an input frame fiC is a normalization term
Figure BDA0003703089970000167
And edit the vector
Figure BDA0003703089970000168
For the same region;
this module 6 is intended to give a succession of frame sequences f1,f2,…,fNThe user selects M key frames k1,k2,…,kMEditing different regions, corresponding to M drawn marker regions M1,m2,…,mM(ii) a For each frame fiGenerating M edit vectors
Figure BDA0003703089970000169
For each frame f to be predictediGenerating a deformation field, deforming the input mark region to generate M new mark regions
Figure BDA0003703089970000171
Is mjThe region after the deformation of the action and the expression; replacing the local area of the feature map of the original frame with a new feature map:
Figure BDA0003703089970000172
wherein the content of the first and second substances,
Figure BDA0003703089970000173
the initial characteristic diagram is
Figure BDA0003703089970000174
G is a generation network of StyleGAN;
down sampling
Figure BDA0003703089970000175
Make it and
Figure BDA0003703089970000176
and
Figure BDA0003703089970000177
the resolution is the same; characteristic diagram
Figure BDA0003703089970000178
Updating the M editing operations, wherein the updating is performed for M times in total; the middle 5 characteristic maps of StyleGAN are updated, the resolution is from 32 x 32 to 128 x 128, and the high resolution is formed by the original hidden code wiPerforming adjustment based on a StyleGAN algorithm; applying the above-described fusion operation to all frames fiI =1, 2., N, generating an edited and fused aligned face video;
and generating face marking areas of the input frame and the editing frame by using a face segmentation method, merging the face marking areas, generating a smooth edge for the merged marking areas, further using the smooth edge as a fusion weight, fusing the faces before and after editing, reversely aligning the fused face images to the original video, and synthesizing the face video editing result.
The invention also provides a storage medium for storing a program for executing the method for editing the deep face video based on the sketch.
The invention also provides a client used for the deep face video editing system based on the sketch.

Claims (10)

1. A deep face video editing method based on sketch is characterized by comprising the following steps:
step 1, aligning and cutting a face in an original video, and coding the face into a hidden space to obtain hidden codes of all frames in the face video;
step 2, adding a sketch generation branch
Figure FDA0003703089960000011
To a StyleGAN generation network, reversely optimizing the image hidden code to generate an edit vector deltaedit
Step 3, editing the vector deltaeditThe hidden codes are superposed to all frames to finish the propagation of the time sequence irrelevant editing;
step 4, editing the vector delta by using the weight superposition of the piecewise linear functioneditCompleting the editing and spreading of the action or the expression;
step 5, calculating a weight superposition editing vector delta according to the similarity of the expression parameters of the current frame and the editing frameeditEnabling the edit to correspond to the specific expression, and finishing expression-driven edit propagation;
and 6, fusing different types of edits added by different frames by using a region perception fusion method, and fusing the face to the original video to obtain a face video editing result based on the sketch.
2. The method for editing deep video of human face based on sketch of claim 1, wherein the step 1 comprises: detecting face key points of a face video, aligning and cutting the face after smoothing by using a time window to generate a video frame sequence f1,f2,…,fNWherein, N is the frame number of the face video; projecting a sequence of frames into a hidden space W+Generating a crypto-code sequence w1,w2,…,wN
3. The method for video editing of deep face based on sketch as claimed in claim 2, wherein said step 2 includes:
acquiring a StyleGAN original generation network G, and constructing a generation network for modeling joint probability distribution of real face images and sketches
Figure FDA0003703089960000012
Generating networks
Figure FDA0003703089960000013
Included
Figure FDA0003703089960000014
And
Figure FDA0003703089960000015
two branches are arranged on the upper surface of the main body,
Figure FDA0003703089960000016
a network is originally generated for G, for generating a simulated face image,
Figure FDA0003703089960000017
for generating a corresponding sketch image; given the covert code w of an image,
Figure FDA0003703089960000018
generating a feature map F1,F2,…,F14Wherein, F1Is used as
Figure FDA0003703089960000019
Initial feature maps of the branches;
Figure FDA00037030899600000110
the feature map of the branch is up-sampled and feature map FiAdding the convolved residual images to generate a sketch image corresponding to the hidden code w;
training a sketch generation network S by using a data set matched with the image and the sketch, and generating a corresponding sketch by taking the face image as input for training a training sketch generation branch
Figure FDA00037030899600000111
The random sampling hidden code w is a random sampling hidden code,input it into
Figure FDA00037030899600000112
Generating highly realistic human face images
Figure FDA00037030899600000113
And corresponding draft
Figure FDA00037030899600000114
Constructing a loss function
Figure FDA00037030899600000115
Training sketch generation branches
Figure FDA00037030899600000116
Figure FDA0003703089960000021
LVGGIs a perceptual loss function, and a VGG19 model is used for measuring the visual similarity, LL2Is pixel L2 loss, α1And alpha2Are all preset weights;
after the distribution of the real image and the sketch is modeled, a sketch s is drawn according to the input face image xeditAnd selecting the region medit(ii) a Projecting the face image x to W+Space, obtaining the hidden code weditGenerated sketch
Figure FDA0003703089960000022
The same as the input sketch is in the editing area, and the generated image
Figure FDA0003703089960000023
In the non-edited region, w is obtained by the following loss functionedit
Lediting(wedit)=β1Lsketch2Lrgb,
LsketchThe constraint edit area has the same structure as the sketch result, LrgbConstraining the non-editing regions to remain unchanged, beta1And beta2For hyper-parameters, the network is generated by fixing
Figure FDA0003703089960000024
To obtain wedit
Final edit vector deltaedit=wedit-w,δeditThe editing of the sketch is represented and spread to the whole face video; for each frame fiAnd generating a corresponding edit vector:
δi=δedit,i=1,2,…,N
this step 3 comprises the step of dividing each frame fiCorresponding deltaeditAnd spreading the video to the whole face video to generate an edited frame sequence.
4. The method for video editing of deep face based on sketch as claimed in claim 3, wherein said step 4 includes:
adding blinking or smiling actions at specific time in the face video, and adding blinking or smiling actions at specific frame ftAdding edit vector deltaeditInput duration h and variation time l, for each frame fiThe invention uses piecewise linear functions to generate smooth propagation edit vectors deltaiTo obtain a new edit vector deltai
δi=γ·δedit,i=1,2,…,M
Figure FDA0003703089960000025
t1=t-h/2-l,t2=t-h/2,t3=t+h/2,t4= t + h/2+ l, t is the edit frame ftA corresponding time;
these new edit vectors deltaiFor synthesizing a simulated face image;
the step 5 comprises the following steps:
given a plurality of key frames in the face video
Figure FDA0003703089960000031
Method for extracting expression parameters of human face by using 3D reconstruction mode
Figure FDA0003703089960000032
And corresponding edit vector
Figure FDA0003703089960000033
M is the number of key frames, the expression guide edits are propagated using:
Figure FDA0003703089960000034
eiis an input frame fiC is a normalization term
Figure FDA0003703089960000035
And edit the vector
Figure FDA0003703089960000036
For the same region;
the step 6 comprises the following steps:
given a series of frame sequences f1,f2,…,fNThe user selects M key frames k1,k2,…,kMEditing different regions, corresponding to M drawn marker regions M1,m2,…,mM(ii) a For each frame fiGenerating M edit vectors
Figure FDA0003703089960000037
For each frame f to be predictediGenerating a deformation field, deforming the input mark region to generate M new mark regions
Figure FDA0003703089960000038
Figure FDA0003703089960000039
Is mjThe region after the deformation of the action and the expression; replacing the local area of the feature map of the original frame with a new feature map:
Figure FDA00037030899600000310
wherein the content of the first and second substances,
Figure FDA00037030899600000311
the initial characteristic diagram is
Figure FDA00037030899600000312
G is a generation network of StyleGAN;
down sampling
Figure FDA00037030899600000313
Make it and
Figure FDA00037030899600000314
and
Figure FDA00037030899600000315
the resolution is the same; characteristic diagram
Figure FDA00037030899600000316
Updating the M editing operations, wherein the updating is performed for M times in total; the middle 5 characteristic maps of StyleGAN are updated, the resolution is from 32 x 32 to 128 x 128, and the high resolution is formed by the original hidden code wiPerforming adjustment based on a StyleGAN algorithm; applying the above-described fusion operation to all frames fiI =1, 2., N, generating an edited and fused aligned face video;
and generating face marking areas of the input frame and the editing frame by using a face segmentation method, merging the face marking areas, generating a smooth edge for the merged marking areas, further using the smooth edge as a fusion weight, fusing the faces before and after editing, reversely aligning the fused face images to the original video, and synthesizing the face video editing result.
5. A sketch-based deep face video editing system, comprising:
the module 1 is used for aligning and cutting a face in an original video, and coding the face into a hidden space to obtain hidden codes of all frames in the face video;
module 2 for adding sketch generating branches
Figure FDA00037030899600000317
Generating an editing vector delta by a StyleGAN generation network and reversely optimizing an image hidden codeedit
Module 3 for editing the vector deltaeditThe hidden codes are superposed to all frames to finish the propagation of the time sequence irrelevant editing;
a module 4 for editing the vector δ by weight superposition of piecewise linear functionseditCompleting the editing and spreading of the action or the expression;
a module 5, configured to calculate a weight superposition editing vector δ according to similarity between expression parameters of the current frame and the editing frameeditEnabling the edit to correspond to the specific expression, and finishing expression-driven edit propagation;
and the module 6 is used for fusing different types of edits added by different frames by using a region perception fusion method, and fusing the face to the original video to obtain a face video editing result based on the sketch.
6. The sketch-based deep human face video editing system as claimed in claim 5, wherein the module 1 is used for detecting human face key points of the human face video, aligning and cutting the human face after smoothing by using a time window, and generating a video frame sequence f1,f2,…,fNWherein, N is the frame number of the face video; will be provided withProjection of a sequence of frames into a hidden space W+Generating a hidden code sequence w1,w2,…,wN
7. The system of claim 6, wherein the module 2 is used to obtain StyleGAN original generation network G and construct a generation network for modeling joint probability distribution of real face image and sketch
Figure FDA0003703089960000041
Generating networks
Figure FDA0003703089960000042
Included
Figure FDA0003703089960000043
And
Figure FDA0003703089960000044
two branches are arranged on the upper surface of the main body,
Figure FDA0003703089960000045
a network of original generation for G, for generating a simulated human face image,
Figure FDA0003703089960000046
for generating a corresponding sketch image; given the hidden code w of the image,
Figure FDA0003703089960000047
generating a feature map F1,F2,…,F14Wherein F is1Is used as
Figure FDA0003703089960000048
Initial feature maps of the branches;
Figure FDA0003703089960000049
the feature map of the branch is up-sampled and feature map FiAdding the convolved residual images to generate a sketch image corresponding to the hidden code w;
training a sketch generation network S by using a data set matched with the image and the sketch, and generating a corresponding sketch by taking the face image as input for training a training sketch generation branch
Figure FDA00037030899600000410
Randomly sampling the hidden code w and inputting it
Figure FDA00037030899600000411
Generating highly realistic human face images
Figure FDA00037030899600000412
And corresponding sketch
Figure FDA00037030899600000413
Constructing a loss function
Figure FDA00037030899600000414
Training sketch generation branches
Figure FDA00037030899600000415
Figure FDA00037030899600000416
LVGGIs a perceptual loss function, and a VGG19 model is used for measuring the visual similarity, LL2Is pixel L2 loss, α1And alpha2Are all preset weights;
after the distribution of the real image and the sketch is modeled, a sketch s is drawn according to the input face image xeditAnd selecting the region medit(ii) a Projecting the face image x to W+Space, obtaining the hidden code weditGenerated sketch
Figure FDA00037030899600000417
The same as the input sketch is in the editing area, and the generated image
Figure FDA00037030899600000418
In the non-edited region, w is obtained by the following loss functionedit
Lediting(wedit)=β1Lsketch2Lrgb,
LsketchThe constraint edit area has the same structure as the sketch result, LrgbConstraining the non-editing regions to remain unchanged, beta1And beta2For hyper-parameters, the network is generated by fixing
Figure FDA0003703089960000051
To obtain wedit
Final edit vector deltaedit=wedit-w,δeditThe editing of the sketch is represented and spread to the whole face video; for each frame fiGenerating a corresponding edit vector:
δi=δedit,i=1,2,…,N
the module 3 comprises a frame buffer for each frame fiCorresponding deltaeditAnd spreading the video to the whole face video to generate an edited frame sequence.
8. The sketch-based deep human face video editing system of claim 7, wherein the module 4 is used for adding blinking or smiling actions at specific times in the human face video, and at specific frames ftAdding an edit vector deltaeditInput duration h and change time l, for each frame fiThe invention uses piecewise linear functions to generate smooth propagation edit vector deltaiTo obtain a new edit vector deltai
δi=γ·δedit,i=1,2,…,M
Figure FDA0003703089960000052
t1=t-h/2-l,t2=t-h/2,t3=t+h/2,t4= t + h/2+ l, t is the edit frame ftThe corresponding time;
these new edit vectors δiFor synthesizing a simulated face image;
the module 5 comprises:
given a plurality of key frames in the face video
Figure FDA0003703089960000053
Method for extracting expression parameters of human face by using 3D reconstruction mode
Figure FDA0003703089960000054
And corresponding edit vector
Figure FDA0003703089960000055
M is the number of key frames, the expression guide edits are propagated using:
Figure FDA0003703089960000056
eiis an input frame fiC is a normalization term
Figure FDA0003703089960000057
And edit the vector
Figure FDA0003703089960000058
For the same region;
this module 6 is intended to give a succession of frame sequences f1,f2,…,fNThe user selects M key frames k1,k2,…,kMEditing different regions corresponding to the M drawn mark regions M1,m2,…,mM(ii) a For each framefiGenerating M edit vectors
Figure FDA0003703089960000061
For each frame f to be predictediGenerating a deformation field, deforming the input mark region to generate M new mark regions
Figure FDA0003703089960000062
Figure FDA0003703089960000063
Is mjThe region after the deformation of the action and the expression; replacing the local area of the feature map of the original frame with a new feature map:
Figure FDA0003703089960000064
wherein the content of the first and second substances,
Figure FDA0003703089960000065
the initial characteristic diagram is
Figure FDA0003703089960000066
G is a generation network of StyleGAN;
down sampling
Figure FDA0003703089960000067
Make it and
Figure FDA0003703089960000068
and
Figure FDA0003703089960000069
the resolution is the same; characteristic diagram
Figure FDA00037030899600000610
All M editing operations are updated, and the total update is carried outM times; the middle 5 characteristic maps of StyleGAN are updated, the resolution is from 32X 32 to 128X 128, and the high resolution is composed of the original hidden code wiPerforming adjustment based on a StyleGAN algorithm; applying the above-described fusion operation to all frames fiI =1, 2., N, generating an edited and fused aligned face video;
and generating face marking areas of the input frame and the editing frame by using a face segmentation method, merging the face marking areas, generating a smooth edge for the merged marking areas, further using the smooth edge as a fusion weight, fusing the faces before and after editing, reversely aligning the fused face images to the original video, and synthesizing the face video editing result.
9. A storage medium storing a program for executing the method for video editing of a deep face based on a sketch as claimed in any one of claims 1 to 5.
10. A client for use in the sketch-based deep face video editing system as claimed in any one of claims 6 to 8.
CN202210698610.8A 2022-06-20 2022-06-20 Deep face video editing method and system based on sketch Pending CN115278106A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210698610.8A CN115278106A (en) 2022-06-20 2022-06-20 Deep face video editing method and system based on sketch

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210698610.8A CN115278106A (en) 2022-06-20 2022-06-20 Deep face video editing method and system based on sketch

Publications (1)

Publication Number Publication Date
CN115278106A true CN115278106A (en) 2022-11-01

Family

ID=83761741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210698610.8A Pending CN115278106A (en) 2022-06-20 2022-06-20 Deep face video editing method and system based on sketch

Country Status (1)

Country Link
CN (1) CN115278106A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115810215A (en) * 2023-02-08 2023-03-17 科大讯飞股份有限公司 Face image generation method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111489405A (en) * 2020-03-21 2020-08-04 复旦大学 Face sketch synthesis system for generating confrontation network based on condition enhancement
CN111652828A (en) * 2020-05-27 2020-09-11 北京百度网讯科技有限公司 Face image generation method, device, equipment and medium
US20210209464A1 (en) * 2020-01-08 2021-07-08 Palo Alto Research Center Incorporated System and method for synthetic image generation with localized editing
CN113112572A (en) * 2021-04-13 2021-07-13 复旦大学 Hidden space search-based image editing method guided by hand-drawn sketch
CN113901894A (en) * 2021-09-22 2022-01-07 腾讯音乐娱乐科技(深圳)有限公司 Video generation method, device, server and storage medium
CN114255496A (en) * 2021-11-30 2022-03-29 北京达佳互联信息技术有限公司 Video generation method and device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210209464A1 (en) * 2020-01-08 2021-07-08 Palo Alto Research Center Incorporated System and method for synthetic image generation with localized editing
CN111489405A (en) * 2020-03-21 2020-08-04 复旦大学 Face sketch synthesis system for generating confrontation network based on condition enhancement
CN111652828A (en) * 2020-05-27 2020-09-11 北京百度网讯科技有限公司 Face image generation method, device, equipment and medium
CN113112572A (en) * 2021-04-13 2021-07-13 复旦大学 Hidden space search-based image editing method guided by hand-drawn sketch
CN113901894A (en) * 2021-09-22 2022-01-07 腾讯音乐娱乐科技(深圳)有限公司 Video generation method, device, server and storage medium
CN114255496A (en) * 2021-11-30 2022-03-29 北京达佳互联信息技术有限公司 Video generation method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHENG PEN ET AL: "Analysis of Neural Style Transfer Based on Generative Adversarial Network", 《2021 IEEE INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE, ELECTRONIC INFORMATION ENGINEERING AND INTELLIGENT CONTROL TECHNOLOGY (CEI)》, 24 December 2021 (2021-12-24) *
苏嘉洋: "定制化动作的人脸视频合成***", 《北京邮电大学硕士论文》, 15 January 2022 (2022-01-15) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115810215A (en) * 2023-02-08 2023-03-17 科大讯飞股份有限公司 Face image generation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
Jiang et al. Scfont: Structure-guided chinese font generation via deep stacked networks
US11880766B2 (en) Techniques for domain to domain projection using a generative model
CN113194348B (en) Virtual human lecture video generation method, system, device and storage medium
CN111915693B (en) Sketch-based face image generation method and sketch-based face image generation system
US9734613B2 (en) Apparatus and method for generating facial composite image, recording medium for performing the method
Seol et al. Artist friendly facial animation retargeting
Chen et al. PicToon: a personalized image-based cartoon system
Cong Art-directed muscle simulation for high-end facial animation
Saunders et al. Anonysign: Novel human appearance synthesis for sign language video anonymisation
CN110310351A (en) A kind of 3 D human body skeleton cartoon automatic generation method based on sketch
CN115278106A (en) Deep face video editing method and system based on sketch
CN112991484B (en) Intelligent face editing method and device, storage medium and equipment
CN111275778A (en) Face sketch generating method and device
Song et al. FineStyle: Semantic-Aware Fine-Grained Motion Style Transfer with Dual Interactive-Flow Fusion
Li et al. Orthogonal-blendshape-based editing system for facial motion capture data
Tejera et al. Animation control of surface motion capture
Kawai et al. Data-driven speech animation synthesis focusing on realistic inside of the mouth
Nakatsuka et al. Audio-guided Video Interpolation via Human Pose Features.
Nakatsuka et al. Audio-oriented video interpolation using key pose
CN115578298A (en) Depth portrait video synthesis method based on content perception
Chen et al. Animating lip-sync characters with dominated animeme models
Cao et al. AnimeDiffusion: anime diffusion colorization
Sistla et al. A state-of-the-art review on image synthesis with generative adversarial networks
Wang et al. Expression-aware neural radiance fields for high-fidelity talking portrait synthesis
Wang et al. Flow2Flow: Audio-visual cross-modality generation for talking face videos with rhythmic head

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination