CN115278106A

CN115278106A - Deep face video editing method and system based on sketch

Info

Publication number: CN115278106A
Application number: CN202210698610.8A
Authority: CN
Inventors: 高林; 陈姝宇; 刘锋林
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2022-11-01

Abstract

The invention provides a deep face video editing method and system based on a sketch, which comprises the steps of aligning and cutting an original video, and coding a face to a hidden space to obtain hidden codes of all frames in the face video; adding sketch generation branches

To StyleGAN, reverse optimization of image steganography, generating an edit vector delta_edit(ii) a Will edit the vector delta_editThe hidden codes are superposed to all frames to finish the propagation of the time sequence irrelevant editing; editing vector delta using weight superposition of piecewise linear functions_editFinishing the editing and spreading of the action or the expression; calculating a weight superposition editing vector delta according to the similarity of the expression parameters of the current frame and the editing frame_editEnabling the edit to correspond to the specific expression, and finishing expression-driven edit propagation; and fusing different types of edits added by different frames by using a region perception fusion method, and fusing the human face to the original video.

Description

Deep face video editing method and system based on sketch

Technical Field

The invention relates to the technical field of computer graphics and computer vision, in particular to a method and a system for synthesizing and editing a face video and a sketch.

Background

Video editing is a very challenging scientific research problem, and with the development of deep learning, video editing and modification work is more and more. Most of the existing video editing methods modify the global attribute of the video, convert the black-and-white video into the color video, or perform stylization processing on the video to generate artistic video editing results. Aiming at a human face video, the prior art mainly completes editing tasks such as face changing and the like, and only modifies the global attribute of identity. Some methods can edit the detail area of the video, but require professional software such as PS, PR and the like, and require high time and effort costs. The sketch is an efficient and accurate interactive tool, has high user-friendliness, and is widely used for image generation and editing problems. However, the prior art cannot expand sketch editing from images to videos, and is difficult to deal with the problems of transmission and composition of editing operation. The video editing has wide application prospect and higher value in the culture fields of movie making, new media transmission and the like, however, the existing technology cannot simply and quickly complete the task of editing the video details.

Aiming at the problem of video editing, the prior art can automatically color the video, but the function is single, and only the color information of the video can be changed. Or the video is stylized, so that the color characteristics of the video can be changed, the content of the video is subjected to artistic transformation, and the video with artistic feeling is generated. However, the above work can only edit the global features of the video, and cannot modify the details of the video. Even if the work is to represent video in the atlas space, the relevant detail editing operation will be extended to video clips after editing the image by software such as PS. However, the above method requires professional software for operation, and the editing and generation of the video are time-consuming. The sketch is a more friendly interactive tool, and the editing operation of the user is simpler and more accurate. The prior art can realize the operation of editing the face image by the sketch image, but the editing operation cannot be propagated to the whole video.

Disclosure of Invention

In order to solve the problems that details of video contents are difficult to edit and sketch editing cannot be transmitted to a video in the prior art, the invention generates a network StyleGAN based on a face image, abstractly expresses the sketch editing as a hidden vector, designs an innovative transmission and fusion mechanism and edits the face video. The invention provides a sketch-based face video editing method and a sketch-based face video editing system, which can select any one frame/multiple frames, edit face details by using a sketch and transmit the face details to the whole video in a specified mode.

Specifically, the invention provides a deep face video editing method based on a sketch, which comprises the following steps:

step 1, aligning and cutting a face in an original video, and coding the face into a hidden space to obtain hidden codes of all frames in the face video;

step 2, adding a sketch generation branch

To a StyleGAN generation network, reversely optimizing the image hidden code to generate an edit vector delta_edit；

Step 3, editing the vector delta_editThe hidden codes are superposed to all frames to finish the propagation of the time sequence irrelevant editing;

step 4, editing the vector delta by using the weight superposition of the piecewise linear function_editCompleting the editing and spreading of the action or the expression;

step 5, calculating a weight superposition editing vector delta according to the similarity of the expression parameters of the current frame and the editing frame_editEnabling the edit to correspond to the specific expression, and finishing expression-driven edit propagation;

and 6, fusing different types of edits added by different frames by using a region perception fusion method, and fusing the face to the original video to obtain a face video editing result based on the sketch.

The deep face video editing method based on the sketch comprises the following steps of 1: detecting face key points of a face video, aligning and cutting the face after smoothing by using a time window to generate a video frame sequence f₁,f₂,…,f_NWherein, N is the frame number of the face video; projecting a sequence of frames into a hidden space W⁺Generating a hidden code sequence w₁,w₂,…,w_N。

The deep face video editing method based on the sketch comprises the following steps of 2:

acquiring a StyleGAN original generation network G, and constructing a generation network for modeling joint probability distribution of real face images and sketches

Generating networks

Included

And

two branches are arranged on the upper surface of the main body,

a network is originally generated for G, for generating a simulated face image,

for generating a corresponding sketch image; given the hidden code w of the image,

generating a feature map F₁,F₂,…,F₁₄Wherein, in the step (A),F₁is used as

An initial profile of the branch;

the feature map of the branch is up-sampled and feature map F_iAdding the convolved residual images to generate a sketch image corresponding to the hidden code w;

training a sketch generation network S by using a data set matched with the image and the sketch, and generating a corresponding sketch by taking the face image as input for training a training sketch generation branch

Randomly sampling the hidden code w and inputting it

Generating highly realistic face images

And corresponding sketch

Constructing a loss function

Training sketch generation branches

L_VGGIs a perceptual loss function, and uses a VGG19 model to measure the visual similarity, L_L2Is the pixel L2 loss, α₁And alpha₂Are all preset weights;

after modeling the distribution of real images and sketches, according to the input face imageLike x, sketching s_editAnd selecting the region m_edit(ii) a Projecting the face image x to W⁺Space, obtaining the hidden code w_editGenerated sketch

The same as the input sketch is in the editing area, and the generated image

In the non-edited region, w is obtained by the following loss function_edit：

L_editing(w_edit)＝β₁L_sketch+β₂L_rgb,

L_sketchThe constraint edit area has the same structure as the sketch result, L_rgbConstraining the non-editing regions to remain unchanged, beta₁And beta₂For hyper-parameters, the network is generated by fixing

To obtain w_edit；

Final edit vector delta_edit＝w_edit-w，δ_editThe editing of the sketch is represented and spread to the whole face video; for each frame f_iGenerating a corresponding edit vector:

δ_i＝δ_edit,i＝1,2,…,N

this step 3 comprises the step of dividing each frame f_iCorresponding delta_editAnd spreading the video to the whole face video to generate an edited frame sequence.

The deep face video editing method based on the sketch comprises the following steps of 4:

adding blinking or smiling actions at specific time in the face video, and at a specific frame f_tAdding an edit vector delta_editInput duration h and variation time l, for each frame f_iThe invention uses piecewise linear functions to generate smooth propagation edit vector delta_iGet new edit directionQuantity delta_i：

δ_i＝γ·δ_edit,i＝1,2,…,M

t₁＝t-h/2-l,t₂＝t-h/2,t₃＝t+h/2,t₄= t + h/2+ l, t is an editing frame f_tA corresponding time;

these new edit vectors δ_iFor synthesizing a simulated face image;

the step 5 comprises the following steps:

a plurality of key frames in the face video are given

Method for extracting expression parameters of human face by using 3D reconstruction mode

And corresponding edit vector

M is the number of key frames, the expression guide edits are propagated using:

e_iis an input frame f_iC is a normalization term

And edit the vector

For the same region;

the step 6 comprises the following steps:

given a series of frame sequences f₁,f₂,…,f_NThe user selects M key frames k¹,k²,…,k^MEditing different regions corresponding to the M drawn mark regions M¹,m²,…,m^M(ii) a For each frame f_iGenerating M edit vectors

For each frame f to be predicted_iGenerating a deformation field, deforming the input mark region to generate M new mark regions

Is m^jThe region after the deformation of the action and the expression; replacing the local area of the feature map of the original frame with a new feature map:

wherein the content of the first and second substances,

the initial characteristic diagram is

G is a generation network of StyleGAN;

down sampling

Make it and

and

the resolution is the same; characteristic diagram

Updating the M editing operations, wherein the updating is performed for M times in total; the middle 5 characteristic maps of StyleGAN are updated, the resolution is from 32 x 32 to 128 x 128, and the high resolution is from the original oneHidden code w_iPerforming adjustment based on a StyleGAN algorithm; applying the above-described fusion operation to all frames f_iI =1, 2., N, generating an edited and fused aligned face video;

and generating face marking areas of the input frame and the editing frame by using a face segmentation method, merging the face marking areas, generating a smooth edge for the merged marking areas, further using the smooth edge as a fusion weight, fusing the faces before and after editing, reversely aligning the fused face images to the original video, and synthesizing the face video editing result.

The invention also provides a deep face video editing system based on the sketch, which comprises the following steps:

the module 1 is used for aligning and cutting a face in an original video, and coding the face into a hidden space to obtain hidden codes of all frames in the face video;

module 2 for adding sketch generating branches

Generating an editing vector delta by a StyleGAN generation network and reversely optimizing an image hidden code_edit；

Module 3 for editing the vector delta_editThe hidden codes are superposed to all frames to finish the propagation of the time sequence irrelevant editing;

a module 4 for editing the vector δ by weight superposition of piecewise linear functions_editFinishing the editing and spreading of the action or the expression;

a module 5, configured to calculate a weight superposition editing vector δ according to similarity between expression parameters of the current frame and the editing frame_editEnabling the edit to correspond to the specific expression, and finishing expression-driven edit propagation;

and the module 6 is used for fusing different types of edits added by different frames by using a region perception fusion method, and fusing the face to the original video to obtain a face video editing result based on the sketch.

The deep human face video editing system based on the sketch is characterized in that the module 1 is used for detecting human face key points of a human face video, and after a time window is used for smoothing, the human face is pairedPerforming a blend and crop to generate a sequence of video frames f₁,f₂,…,f_NWherein, N is the frame number of the face video; projecting a sequence of frames into a hidden space W⁺Generating a hidden code sequence w₁,w₂,…,w_N。

The deep face video editing system based on the sketch comprises a module 2, a model G and a generating network, wherein the module 2 is used for acquiring a StyleGAN original generating network G and constructing the generating network for modeling the joint probability distribution of a real face image and the sketch

Generating networks

Included

And

two branches are arranged on the upper surface of the main body,

a network is originally generated for G, for generating a simulated face image,

for generating a corresponding sketch image; given the covert code w of an image,

generating a feature map F₁,F₂,…,F₁₄Wherein F is₁Is used as

An initial profile of the branch;

the feature map of the branch is up-sampled and feature map F_iAdding the convolved residual images to generate a sketch map corresponding to the hidden code wAn image;

Randomly sampling the hidden code w and inputting it

Generating highly realistic face images

And corresponding sketch

Constructing a loss function

Training sketch generation branches

L_VGGIs a perceptual loss function, and uses a VGG19 model to measure the visual similarity, L_L2Is pixel L2 loss, α₁And alpha₂Are all preset weights;

after the distribution of the real image and the sketch is modeled, a sketch s is drawn according to the input face image x_editAnd selecting the region m_edit(ii) a Projecting the face image x to W⁺Space, obtaining the hidden code w_editGenerated sketch

The same as the input sketch is in the editing area, and the generated image

In the non-edited region, w is obtained by the following loss function_edit：

L_editing(w_edit)＝β₁L_sketch+β₂L_rgb,

To obtain w_edit；

δ_i＝δ_edit,i＝1,2,…,N

the module 3 comprises a frame buffer for each frame f_iCorresponding delta_editAnd spreading the video to the whole face video to generate an edited frame sequence.

The deep face video editing system based on the sketch, wherein the module 4 is used for adding blinking or smiling actions at specific time in the face video, and at a specific frame f_tAdding an edit vector delta_editInput duration h and variation time l, for each frame f_iThe invention uses piecewise linear functions to generate smooth propagation edit vector delta_iTo obtain a new edit vector delta_i：

δ_i＝γ·δ_edit,i＝1,2,…,M

t₁＝t-h/2-l,t₂＝t-h/2,t₃＝t+h/2,t₄= t + h/2+ l, t is the edit frame f_t

A corresponding time;

these new edit vectors δ_iFor synthesizing a simulated face image;

the module 5 comprises:

a plurality of key frames in the face video are given

And corresponding edit vector

M is the number of key frames, the expression guide edits are propagated using:

e_iis an input frame f_iC is a normalization term

And edit the vector

For the same region;

this module 6 is intended to give a succession of frame sequences f₁,f₂,…,f_NThe user selects M key frames k¹,k²,…,k^MEditing different regions, corresponding to M drawn marker regions M¹,m²,…,m^M(ii) a For each frame f_iGenerating M edit vectors

wherein the content of the first and second substances,

the initial characteristic diagram is

G is a generation network of StyleGAN;

down sampling

Make it and

and

the resolution is the same; characteristic diagram

Updating the M editing operations, wherein the updating is performed for M times in total; the middle 5 characteristic maps of StyleGAN are updated, the resolution is from 32 x 32 to 128 x 128, and the high resolution is formed by the original hidden code w_iPerforming adjustment based on a StyleGAN algorithm; applying the above-described fusion operation to all frames f_iI =1, 2., N, generating an edited and fused aligned face video;

The invention also provides a storage medium for storing a program for executing the any depth face video editing method based on the sketch.

The invention also provides a client used for the any depth face video editing system based on the sketch.

According to the scheme, the invention has the advantages that:

the system designed by the invention can select one or more editing frames, draw a sketch and a corresponding editing area mask by a user, and realize the editing and transmission operation of the video after the transmission mode of the editing is appointed.

Drawings

FIG. 1 is a schematic flow chart of the system of the present invention;

FIG. 2 is a schematic diagram of sketch optimization;

FIG. 3 is a diagram of time-independent editing and time window editing results;

FIG. 4 is a diagram of the results of time-independent editing and expression-driven editing;

FIG. 5 is a graph of results for different rendering styles;

FIG. 6 is a diagram of a result of a rotating face edit;

FIG. 7 is a graph of results fused using different approaches after optimizing sketch editing vectors;

FIG. 8 is a diagram of different editing fusion results;

FIG. 9 is a diagram illustrating an intermediate result of face video editing;

fig. 10 is a graph of the key point smoothing results.

Detailed Description

The defects in the prior art are caused by the fact that the problem of transmission on a video is not considered in sketch editing, because the face in the video has changes of expressions and actions, the input sketch editing operation is difficult to directly act on other frames, and meanwhile, the sketch can change the identity characteristics (such as the shape of five sense organs) of the face and also can change the expressions and actions (how smiling is added), so that how to distinguish the faces and reasonably transmit the faces is very difficult; the video editing needs to ensure the stability of the time sequence, the existing method does not consider the flicker problem of video generation, and the quality of the generated result is poor.

The inventor discovers that the defect can be overcome by designing a reasonable video coding mode and a sketch editing, spreading and fusing method through sketch editing of images and videos. After the face video is input, firstly, the face area is cut and aligned through key point detection. Further, we use the image coding network to code the face images of all frames into the hidden space of the StyleGAN generation network. Aiming at sketch editing input by a user, an optimization strategy is designed, and editing operation is abstractly expressed as an editing vector. In the transmission process of editing, the editing operation is divided into two types, namely time sequence irrelevant editing and time sequence relevant editing, the time sequence relevant editing is further divided into time window editing and expression driven editing, the specific type of editing is designated by a user, and the transmission is carried out in different modes. Meanwhile, a regional perception fusion strategy is designed, different editing operations input in different frames are fused, and an edited face video is generated. And finally, reversely aligning the generated face video to the input original video, fusing the face area and generating a sketch video editing result.

The core invention points of the invention comprise:

key point 1, styleGAN based video coding module. After a section of face video is input, a face image is cut and aligned by using a face key point detection technology of dlib, and a time window is used for smoothing. The input frame sequence is encoded into a cryptic code sequence based on an E4E face to StyleGAN cryptic spatial encoding technique. According to the input frame sequence and the hidden code sequence, based on the PTI reconstruction technology, the weight of the StyleGAN generation network is finely adjusted, so that the original video can be perfectly reconstructed, the video coding task is completed, and the subsequent video editing is served;

and 2, a key point 2, a sketch editing and optimizing module. And (4) generating a network based on the pre-trained sketch, expanding the original StyleGAN, and adding a sketch generation branch to the original StyleGAN. Further, an optimization strategy is designed, a user inputs a drawn sketch and an editing area mask, the constraint of the editing area is the same as that of the sketch, the constraint of the editing area is not the same as that of the original image, and the original hidden code is optimized in an iterative mode. And (4) subtracting the optimized hidden code from the input hidden code to obtain an editing vector, and abstractively representing the sketch editing operation.

Key point 3, timing independent propagation technique. The partial editing operation has a time sequence independent characteristic, that is, editing of the frame sequence should be uniformly applied to the whole video, and is independent of expressions and actions, such as editing of facial shapes of human faces and the like. In the editing, the editing vector is directly added with the hidden code of the input frame, so that the propagation effect of editing is realized.

Key point 4: time window propagation techniques. And part of editing operation represents specific actions of the human face, including smiling, eye closing and the like. The actions are divided into three stages, namely a starting stage, a continuous stage and an ending stage, the editing vectors are overlapped by linearly changing weights in the starting stage and the ending stage, and fixed weights are overlapped in the continuous stage, so that the whole process of starting, continuing and ending the actions is realized.

Key point 5: expression-driven propagation techniques. Part of the editing operation is related to expressions, such as closing eyes when smiling. And aiming at the editing, extracting expression parameters of the face by using a 3D reconstruction mode, calculating weight according to cosine similarity of the expression parameters of an editing frame and a prediction frame, and superposing editing vectors.

Key point 6: and a region perception fusion module. In the video editing process, a user often selects multiple frames and edits different areas simultaneously. The module predicts a deformation field according to the change of the action expression of the face by using deformation operation and deforms the drawn mask. Finally, the generated face area is fused and back-projected to the original frame, and fusion operation of different editions is completed.

In order to make the aforementioned features and advantages of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

The system flow chart is shown in fig. 1, and the system comprises multiple technologies of editing vector generation, time sequence independent editing propagation, time window editing propagation, expression driven editing propagation, regional perception fusion and the like.

Sketch optimization flowchart as shown in fig. 2, the original StyleGAN generation network is extended to two branches, one branch generating a sketch image and the other branch generating an image with high realism. The optimization process comprises two loss items of L _ sketch and L _ RGB, and the corresponding sketch editing of the editing area and other areas are respectively restricted to be kept unchanged.

As shown in fig. 1, a high-quality face video editing method and system based on sketch interaction includes:

s1: after a video is input, aligning and cutting a human face, and coding the human face to a hidden space;

s2: extending StyleGAN generation network, adding sketch generation branch

Reverse optimizing image hidden code to generate edit vector delta_edit；

S3: time-independent editing, vector delta will be edited_editDirectly superimposing the hidden codes to all frames to finish the propagation of the time sequence irrelevant editing;

s4: time window editing using weight-superimposed editing vectors delta of piecewise linear functions_editCompleting the editing and spreading of the action or the expression;

s5: expression drive editing, calculating weight superposition editing vector delta according to similarity of expression parameters of current frame and editing frame_editMaking the edit correspond to a specific expression;

s6: fusing different types of edits added by different frames by using a region perception fusion method, and fusing the face to the original video;

wherein the method of S1 comprises:

after an input video is given, detecting key points of a human face by using dlib, aligning and cutting the human face by using coordinates of smooth key points of a time window, and generating a video frame sequence f₁,f₂,…,f_NWhere N is the number of frames. The present invention uses E4E to project a sequence of frames onto W⁺Space, generating hidden code sequence w₁,w₂,…,w_N. The subsequently generated edit vector is superimposed on the crypto-code sequence. The smoothing is to smooth the coordinates of key points of the face on the sequence, and the face detection method is single-frame detection, so certain jitter exists between frames. The purpose of the smoothing is to eliminate the effects of jitter.

The method of S2 is shown in fig. 2, and includes:

s21: given the StyleGAN original generation network G, the invention designs a new generation network

And modeling the joint probability distribution of the real face image and the sketch. Which comprises two branches of which the number of branches is two,

generating a human face image with high reality sense, generating a network for the G,

and generating a corresponding sketch image. Given the hidden code w of the image,

generating a feature map F₁,F₂,…,F₁₄Wherein, F₁Is used as

Initial profile of the branch.

Intermediate feature maps of branches are repeatedly upsampled, and F_iAnd adding the convolved residual images. And (4) after the operation is finished for i = 2-14, finally generating a sketch image corresponding to the hidden code w. The StyleGAN3 original generation network has an extension of 10 pixels, and the invention clips the intermediate feature map and only uses the clipped pixel content.

S22: to train a sketch to generate branches

We first train a sketch generation network S based on the Pix2PixHD network using a data set with images matched to the sketch. The sketch generation network takes a real face image as input to generate a corresponding sketch for training a training sketch generation branch

Then, the invention randomly samples the hidden code w and inputs the hidden code w

Generating highly realistic face images

And corresponding sketch

The invention then trains the sketch generation branches using the following loss functions:

L_VGGis a perceptual loss function, and uses a VGG19 model to measure the visual similarity, L_L2Is the pixel L2 loss, α₁And alpha₂Are all preset weights, in this case α₁＝α₂＝1。

S23: after modeling the distribution of the real image and the sketch, the invention designs an optimization technology, according to the real image x input by the user, the sketch s drawn by the user_editAnd a mark area m_editGenerating an edit vector delta_edit. First, a real image x is projected onto W⁺And generating an initial hidden code w. Then, the invention optimizes to obtain a new implicit code w_editGenerated sketch

The same as the input sketch is in the editing area, and the generated image

In the non-editing area the same as the original image. To optimize for obtaining w_editThe invention uses the following loss function:

wherein L is_LPIPSIs LPIPS distance, as a matrix dot product. L is_sketchThe constraint edit area has the same structure as the sketch result, L_rgbThe constrained non-editing regions remain unchanged. The final optimization loss function is:

L_editing(w_edit)＝β₁L_sketch+β₂L_rgb,

β₁and beta₂Is a hyper-parameter. In the optimization process, the weight of the fixed network and the only optimized parameter are w_edit。

S24: the final edit vector is:

δ_edit＝w_edit-w

δ_editthe abstraction represents the compilation of the sketch and propagates through to the entire video.

Wherein the method of S3 comprises:

some editing operations have a significant impact on the entire video, with low relevance to expressions and actions. These editing operations mainly change the basic shape of a human face, such as the shape of a face and the shape of five sense organs. Edit vector delta generated by the invention_editThe video frame self-adaptive decoupling method has decoupling and semantic characteristics, and is directly applied to the whole video frame sequence. For each frame f_iGenerating a corresponding edit vector:

δ_i＝δ_edit,i＝1,2,…,N

these edit vectors propagate the edits throughout the video, generating a sequence of edited frames.

Wherein the method of S4 comprises:

unlike single frame editing, video has changes in expressions and motions with time. Users often edit time-sequential facial actions, such as adding blinks or smiles at a particular time. In a particular frame f_tAdding edit vector delta_editThe user also needs to input the duration h and the change time l. Then, for each frame f_iThe invention uses piecewise linear functions to generate smooth propagation edit vectors delta_i：

δ_i＝γ·δ_edit,i＝1,2,…,M

t₁＝t-h/2-l,t₂＝t-h/2,t₃＝t+h/2,t₄= t + h/2+ l, t is an editing frame f_tThe corresponding time. These new edit vectors delta_iWill be used to synthesize highly realistic face images. By using the editing mode, the invention not only can generate the editing effect in a specific time window, but also can form smooth transition, namely, the appearance and disappearance of the editing, for example, from a natural expression to a smiling expression, and then from the smiling expression to the natural expression.

Wherein the method of S5 comprises:

in some scenarios, the user may only want to add some edits at certain expressions, while keeping the original state or adding new edits at other expressions. Such editing operations include expression-driven wrinkles (e.g., statute lines, heads-up lines, etc.), and some shape edits that occur only under certain expressions (e.g., eyes are diminished when smiling). In order to propagate the expression-driven editing, the invention extracts the expression parameters of the human face by using a 3D reconstruction mode. More specifically, key frames for a given emoji edit

M is the number of key frames, and the expression parameters are extracted by the method

And corresponding edit vector

Some key frames may not have any editing operation, but act as key reference frames, indicating that there is no editing operation in a certain expression. For these frames, the edit vector is the zero vector. The present invention propagates expression guide edits using the following:

e_iis an input frame f_iC is a normalization term

In the present invention, vectors are edited

For the same region.

Wherein the method of S6 comprises:

s6.1: the invention supports editing any multiframe by using the sketch and fuses the editing effect. After editing multiple frames, a plurality of editing vectors are generated, and a simple method is to directly add the editing vectors, however, as shown in fig. 7, the method generates flaws, and the invention designs a region perception fusion mode.

S6.2: given a series of frame sequences f₁,f₂,…,f_NThe user selects M key frames k¹,k²,…,k^MEditing different regions corresponding to M editing mark regions M¹,m²,…,m^M. Using the above-mentioned edit propagation method, f is applied to each frame_iGenerating M edit vectors

Representing different editing operations.

For each frame f to be predicted_iThe invention uses the First-order method to generate a deformation field, and generates M new areas for the deformation of the input mark area

Input rendered region m^jMarking similar regions, but considering frame f_iAnd editing key frame k^jInter-expression and head movements. In order to fuse different editing operations, the invention replaces the local area of the feature map of the original frame with a new feature map:

wherein the content of the first and second substances,

the initial characteristic diagram is

G is the generative network of StyleGAN. Inventive downsampling

Make it and

and

with the same resolution. And for M times of editing operation, iterating the formula for M times, wherein j = 1-M, and completing editing and fusion of the local areas. The invention updates and generates the middle 5 characteristic graphs of the network, the characteristic graphs mainly control the face structure information, and the resolution is from 32 x 32 to 128 x 128. High resolution is determined by the original hidden code w_iThe adaptation is performed using the algorithm of the StyleGAN network. The present invention applies the above-described fusion operation to allFrame f_iI =1, 2., N, generating an edited fused aligned face video.

S6.3: the invention fuses the synthesized face to the original video to synthesize the final editing video. First, a face segmentation method is used to generate face region labeling maps of an input frame and an editing frame, and a union of face regions is calculated. The merged area expands further, making the edges transition smoothly. And converting the smoothed face region label graph into fusion weight, wherein the label region weight is 1, the non-label region weight is 0, the transition edge weight is between 0 and 1, and the face before and after editing is fused based on the weight. And finally, reversely aligning the face image to the original video, and synthesizing the final edited video.

As shown in fig. 3, the result of the fusion of the time-independent editing and the time window editing of the present invention is shown. The time-independent edits in the left side add hair and beard to the editing character, and the time window edits add a header-raising action to the editing character. The first line on the right side is the original video, the second line on the right side is the video result after editing, the video after editing has more hair and beard, and meanwhile, the eyebrow picking action is achieved.

As shown in fig. 4, the result of the fusion of the time-independent editing and the time window editing of the present invention is shown. The time sequence irrelevant editing in the left side reduces the nose of the face, and in expression driven editing, the eyes are reduced when the mouth is opened, and the original shape is kept when the mouth is closed. The video frames before and after editing are shown on the right side, and the editing operation is well spread to the whole video.

As shown in fig. 5, the result of editing a face using sketches of different rendering styles is shown. The first column of images shows the drawn sketch and the selected area, and the second column of images shows the result of a single frame edit. The first line on the right shows the original video frame and the subsequent lines show the results of the edit propagation. Aiming at sketches with different drawing styles, the method generates a result with higher quality and has better robustness.

As shown in fig. 6, the result of editing for a face with angle changes is presented. The first line of images is the original sequence of video frames, the second line of images is the edited sequence of video frames, and the left side is the user-drawn sketch and the selected region. The invention generates high-quality editing results even if the input face video has rotation and angle changes.

As shown in fig. 7, the results of two editorial fusion methods are presented. The user has added two time-independent edits, varying the face and hair, while adding a time window edit. The first row shows the editing results of the edited sketch, the second row shows the original video, the third row shows the result of a fusion mode of directly adding a plurality of editing vectors, and the fourth row shows the result of a region-aware fusion mode. The quality of the result generated by the regional perception fusion mode is higher than that of the direct addition of the editing vectors, and the effectiveness of the regional perception fusion module is proved.

As shown in fig. 8, the fusion results of the different edits are shown. The first line is the original video, the second line is the result of the time sequence irrelevant editing, the hair area of the human face is modified, the third line is the time window editing, the smile is added to the human face, and the last line is the result of the fusion of the two types of editing.

As shown in fig. 9, an intermediate result of the face video editing is shown. The second row shows the real video and the third row shows the result of the alignment. The fourth line shows the result of the drawing mask deformed according to the expression and the action, and the fifth line shows the editing result of the aligned face. The sixth row shows the result of face region segmentation, and the last row shows the final anti-aligned face editing result.

As shown in fig. 10, the results of keypoint smoothing are presented. The first three lines are the result of smoothing without using key points, the cut and aligned face has very large jitter, the last three lines are the result of smoothing with key points, and the cut and aligned face has no jitter problem.

The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.

module 2 for adding sketch generating branches

and the module 6 is used for fusing different types of edits added by different frames by using a regional perception fusion method, and fusing the face to the original video to obtain a face video editing result based on the sketch.

In the deep face video editing system based on the sketch, the module 1 is used for detecting face key points of a face video, aligning and cutting a face after smoothing by using a time window to generate a video frame sequence f₁,f₂,…,f_NWherein, N is the frame number of the face video; projecting a sequence of frames into a hidden space W⁺Generating a hidden code sequence w₁,w₂,…,w_N。

In the deep face video editing system based on the sketch, the module 2 is used for acquiring a StyleGAN original generation network G and constructing a generation network for modeling the joint probability distribution of a real face image and the sketch

Generating networks

Included

And

two branches are arranged on the upper surface of the main body,

a network is originally generated for G, for generating a simulated face image,

generating a feature map F₁,F₂,…,F₁₄Wherein F is₁Is used as

Initial feature maps of the branches;

Randomly sampling hidden code w and inputting it

Generating high truesHuman face image

And corresponding sketch

Constructing a loss function

Training sketch generation branches

L_VGGIs a perceptual loss function, and a VGG19 model is used for measuring the visual similarity, L_L2Is the pixel L2 loss, α₁And alpha₂Are all preset weights;

after the distribution of the real image and the sketch is modeled, a sketch s is drawn according to the input face image x_editAnd selecting the region m_edit(ii) a Projecting the face image x to W⁺Space, obtaining a hidden code w_editGenerated sketch

The same as the input sketch is in the editing area, and the generated image

In the non-edited region, w is obtained by the following loss function_edit：

L_editing(w_edit)＝β₁L_sketch+β₂L_rgb,

To obtain w_edit；

Final edit vector delta_edit＝w_edit-w，δ_editThe editing of the sketch is represented and spread to the whole face video; for each frame f_iAnd generating a corresponding edit vector:

δ_i＝δ_edit,i＝1,2,…,N

The deep face video editing system based on the sketch, wherein the module 4 is used for adding blinking or smiling actions at specific time in the face video, and at a specific frame f_tAdding an edit vector delta_editInput duration h and change time l, for each frame f_iThe invention uses piecewise linear functions to generate smooth propagation edit vectors delta_iTo obtain a new edit vector delta_i：

δ_i＝γ·δ_edit,i＝1,2,…,M

t₁＝t-h/2-l,t₂＝t-h/2,t₃＝t+h/2,t₄= t + h/2+ l, t is the edit frame f_tA corresponding time;

these new edit vectors delta_iFor synthesizing a simulated human face image;

the module 5 comprises:

a plurality of key frames in the face video are given

And corresponding edit vector

M is the number of key frames, the expression guide edits are propagated using:

e_iis an input frame f_iC is a normalization term

And edit the vector

For the same region;

wherein the content of the first and second substances,

the initial characteristic diagram is

G is a generation network of StyleGAN;

down sampling

Make it and

and

the resolution is the same; characteristic diagram

The invention also provides a storage medium for storing a program for executing the method for editing the deep face video based on the sketch.

The invention also provides a client used for the deep face video editing system based on the sketch.

Claims

1. A deep face video editing method based on sketch is characterized by comprising the following steps:

step 2, adding a sketch generation branch

2. The method for editing deep video of human face based on sketch of claim 1, wherein the step 1 comprises: detecting face key points of a face video, aligning and cutting the face after smoothing by using a time window to generate a video frame sequence f₁,f₂,…,f_NWherein, N is the frame number of the face video; projecting a sequence of frames into a hidden space W⁺Generating a crypto-code sequence w₁,w₂,…,w_N。

3. The method for video editing of deep face based on sketch as claimed in claim 2, wherein said step 2 includes:

Generating networks

Included

And

two branches are arranged on the upper surface of the main body,

a network is originally generated for G, for generating a simulated face image,

generating a feature map F₁,F₂,…,F₁₄Wherein, F₁Is used as

Initial feature maps of the branches;

The random sampling hidden code w is a random sampling hidden code,input it into

Generating highly realistic human face images

And corresponding draft

Constructing a loss function

Training sketch generation branches

L_VGGIs a perceptual loss function, and a VGG19 model is used for measuring the visual similarity, L_L2Is pixel L2 loss, α₁And alpha₂Are all preset weights;

The same as the input sketch is in the editing area, and the generated image

In the non-edited region, w is obtained by the following loss function_edit：

L_editing(w_edit)＝β₁L_sketch+β₂L_rgb,

To obtain w_edit；

δ_i＝δ_edit,i＝1,2,…,N

4. The method for video editing of deep face based on sketch as claimed in claim 3, wherein said step 4 includes:

adding blinking or smiling actions at specific time in the face video, and adding blinking or smiling actions at specific frame f_tAdding edit vector delta_editInput duration h and variation time l, for each frame f_iThe invention uses piecewise linear functions to generate smooth propagation edit vectors delta_iTo obtain a new edit vector delta_i：

δ_i＝γ·δ_edit,i＝1,2,…,M

these new edit vectors delta_iFor synthesizing a simulated face image;

the step 5 comprises the following steps:

given a plurality of key frames in the face video

And corresponding edit vector

M is the number of key frames, the expression guide edits are propagated using:

e_iis an input frame f_iC is a normalization term

And edit the vector

For the same region;

the step 6 comprises the following steps:

given a series of frame sequences f₁,f₂,…,f_NThe user selects M key frames k¹,k²,…,k^MEditing different regions, corresponding to M drawn marker regions M¹,m²,…,m^M(ii) a For each frame f_iGenerating M edit vectors

wherein the content of the first and second substances,

the initial characteristic diagram is

G is a generation network of StyleGAN;

down sampling

Make it and

and

the resolution is the same; characteristic diagram

5. A sketch-based deep face video editing system, comprising:

module 2 for adding sketch generating branches

a module 4 for editing the vector δ by weight superposition of piecewise linear functions_editCompleting the editing and spreading of the action or the expression;

6. The sketch-based deep human face video editing system as claimed in claim 5, wherein the module 1 is used for detecting human face key points of the human face video, aligning and cutting the human face after smoothing by using a time window, and generating a video frame sequence f₁,f₂,…,f_NWherein, N is the frame number of the face video; will be provided withProjection of a sequence of frames into a hidden space W⁺Generating a hidden code sequence w₁,w₂,…,w_N。

7. The system of claim 6, wherein the module 2 is used to obtain StyleGAN original generation network G and construct a generation network for modeling joint probability distribution of real face image and sketch

Generating networks

Included

And

two branches are arranged on the upper surface of the main body,

a network of original generation for G, for generating a simulated human face image,

generating a feature map F₁,F₂,…,F₁₄Wherein F is₁Is used as

Initial feature maps of the branches;

Randomly sampling the hidden code w and inputting it

Generating highly realistic human face images

And corresponding sketch

Constructing a loss function

Training sketch generation branches

The same as the input sketch is in the editing area, and the generated image

In the non-edited region, w is obtained by the following loss function_edit：

L_editing(w_edit)＝β₁L_sketch+β₂L_rgb,

To obtain w_edit；

δ_i＝δ_edit,i＝1,2,…,N

8. The sketch-based deep human face video editing system of claim 7, wherein the module 4 is used for adding blinking or smiling actions at specific times in the human face video, and at specific frames f_tAdding an edit vector delta_editInput duration h and change time l, for each frame f_iThe invention uses piecewise linear functions to generate smooth propagation edit vector delta_iTo obtain a new edit vector delta_i：

δ_i＝γ·δ_edit,i＝1,2,…,M

t₁＝t-h/2-l,t₂＝t-h/2,t₃＝t+h/2,t₄= t + h/2+ l, t is the edit frame f_tThe corresponding time;

these new edit vectors δ_iFor synthesizing a simulated face image;

the module 5 comprises:

given a plurality of key frames in the face video

And corresponding edit vector

M is the number of key frames, the expression guide edits are propagated using:

e_iis an input frame f_iC is a normalization term

And edit the vector

For the same region;

this module 6 is intended to give a succession of frame sequences f₁,f₂,…,f_NThe user selects M key frames k¹,k²,…,k^MEditing different regions corresponding to the M drawn mark regions M¹,m²,…,m^M(ii) a For each framef_iGenerating M edit vectors

wherein the content of the first and second substances,

the initial characteristic diagram is

G is a generation network of StyleGAN;

down sampling

Make it and

and

the resolution is the same; characteristic diagram

All M editing operations are updated, and the total update is carried outM times; the middle 5 characteristic maps of StyleGAN are updated, the resolution is from 32X 32 to 128X 128, and the high resolution is composed of the original hidden code w_iPerforming adjustment based on a StyleGAN algorithm; applying the above-described fusion operation to all frames f_iI =1, 2., N, generating an edited and fused aligned face video;

9. A storage medium storing a program for executing the method for video editing of a deep face based on a sketch as claimed in any one of claims 1 to 5.

10. A client for use in the sketch-based deep face video editing system as claimed in any one of claims 6 to 8.