CN112884636A

CN112884636A - Style migration method for automatically generating stylized video

Info

Publication number: CN112884636A
Application number: CN202110117964.4A
Authority: CN
Inventors: 霍静; 孔美豪; ***; 高阳
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-01-28
Filing date: 2021-01-28
Publication date: 2021-06-01
Anticipated expiration: 2041-01-28
Also published as: CN112884636B

Abstract

The invention discloses a style migration method for automatically generating stylized videos, which comprises the steps of constructing a style migration model for automatically generating stylized videos, wherein the style migration model comprises a high-compression self-encoder model based on knowledge distillation and a feature migration module based on semantic alignment; the self-encoder is divided into an encoder and a decoder, the encoder can encode original video content frames and style images into feature maps, the feature migration module can fuse content features and style features obtained by encoding of the encoder based on semantic alignment, finally fusion migration features based on semantic alignment are obtained, and finally stylized video frames are obtained through the decoder. The method and the device can ensure the stability of the migrated video, can realize the stylization of any video in any style, have very high speed in the style migration process, and have higher practicability.

Description

Style migration method for automatically generating stylized video

Technical Field

The invention belongs to the field of computer application, and particularly relates to a style migration method for automatically generating a stylized video.

Background

With the development and popularization of the internet and the mobile internet, more and more short video platforms begin to rise, the artistic requirements of people for videos are gradually increased on the basis of the short video platforms, and the forms created by professional artists or professional editing technicians are not only inconvenient but also high in cost. Therefore, the automatic generation of videos of any artistic style from videos by computer technology is receiving attention and favor of people.

Given a content graph and a target style sheet, style migration is aimed at producing a stylized image that can have both content graph structure and style sheet texture. A great deal of research work is already carried out on a style migration method based on a single image, and a great deal of attention is now turned to the field of video style migration because the video style migration has very wide application prospects (including short video artistic conversion and the like); clearly, style migration of video is more practical and challenging than style migration of single images.

Compared with the traditional image style migration, the video style migration is more difficult in that the stylized quality, the stability and the computational efficiency are considered at the same time. Currently available video style migration methods can be roughly classified into two categories according to whether optical flow is used or not.

The first category is methods using optical flow, which propose a loss of temporal consistency to obtain stability between adjacent frames by means of supervised constraints of optical flow. The method comprises an optimized optical flow constraint-based method, and although the method can obtain stable migration video, the time for each frame style migration of the video is nearly three minutes, and the extremely slow migration speed is unacceptable. Video style migration methods based on feed-forward networks are proposed in the successors, but the video style migration tasks cannot achieve real-time effects because optical flow constraints are still used in the training stage and the testing stage. To solve this problem, some methods only use the optical flow in the training phase and avoid using the optical flow in the testing phase, but then the speed is increased but the effect of the final migration is very unstable compared to those methods that also use the optical flow in the testing phase.

The second category is methods that do not use optical flow, such as LST, which can implement feature affine and thus can result in stable stylized video. After that, there are studies proposed to use an Avatar-Net based decoration module in combination with a compound normalization method to ensure video stability. However, the existing methods that do not use optical flow all use the original VGG network to encode content and style features, and the VGG network is very bulky, which means that a very large memory space is required to store the VGG model, which will greatly limit their applications in some small terminal devices.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a style migration method for automatically generating stylized videos, which can realize real-time stable arbitrary video style migration.

The technical scheme is as follows: the invention provides a style migration method for automatically generating a stylized video, which specifically comprises the following steps:

(1) constructing a video style migration network model, wherein the model comprises a high-compression self-encoder module based on knowledge distillation and a feature migration module based on semantic alignment; the self-encoder module comprises a lightweight encoder and a lightweight decoder;

(2) encoder encoding of content video frames and trellis diagrams: knowledge distillation is carried out on the lightweight encoder based on the VGG network, the encoder learns the encoding capacity of a teacher network VGG encoder while having enough lightweight, and original video content frames and style images are encoded into feature maps;

(3) a semantic alignment based feature migration module: fusing content features and style features obtained by encoding of an encoder to obtain fusion migration features based on semantic alignment;

(4) knowledge distillation of the lightweight decoder based on VGG network: the decoder can learn the decoding capability of a VGG decoder of a teacher network while the decoder is light enough, and the decoder decodes the fused and migrated features to obtain stylized video frames, and finally synthesizes the videos.

Further, the implementation of step (2) requires that the optimization of the loss function is as follows:

wherein, I is an original image, an encoder in a VGG network is E, and a lightweight encoder is E

I' is the reconstructed picture, E_k(I) For the k-th layer output characteristic diagram in the original VGG encoder,

and the k-th layer output characteristic diagram in the lightweight encoder is shown, wherein lambda and gamma are both hyper-parameters.

Further, the step (3) is realized as follows:

the characteristic diagram of the content image output by the encoder is F_c∈R^Cx(WxH)The output obtained from the stylized image is F_s∈R^Cx(WxH)Wherein C is the number of the channels of the feature map, and W and H are the width and the height of the feature map respectively; the feature migration module based on semantic alignment aims at finding a feature migration which enables semantic alignment of content graphs of different video frames through conversion, and supposing that the conversion process can be parameterized into a projection matrix P e R^CxCThen the optimized objective function is:

wherein ,

denotes from F_cIn the operation of selecting the i-th position feature vector, A_ijTo represent

And

k neighbor matrix of (1);

solving for P as:

wherein A is the affine matrix defined above, U is the diagonal matrix, and

is a matrix with characteristic alignment function, and the projection matrix P is formed as P ═ g (F (F)_c)f(F_s)^T) In the linear conversion process, g (x) MX and f (x) XT^T(ii) a The (x) process chooses to fit with three convolutional layers, and the g () process uses a fully-connected layer fit.

Further, the implementation of step (4) requires that the optimization of the loss function is as follows:

wherein I is an original image, E_k(I) For the k-th layer output characteristic diagram in the original VGG encoder,

for the k-th layer output characteristic diagram in the lightweight encoder, I' is a decoder using lightweight

Decoding the resulting reconstructed picture with λ being a hyperparametric, the above distillation process being aimed at

While the information of the original E can be retained,

can be well combined with

And performing image reconstruction on the obtained output characteristic information.

Has the advantages that: compared with the prior art, the invention has the beneficial effects that: 1. the stability between adjacent frames and the order consistency are considered while the higher stylized quality of the video frames is finished, so that the stability of the video after the migration can be ensured; 2. the stylization has rich diversity, and can realize the stylization of any video in any style; 3. in the process of video style migration, real-time performance needs to be achieved, that is, the speed of the style migration process needs to be guaranteed to be very high, and in order to have higher practicability, the light weight of the whole model needs to be guaranteed.

Drawings

FIG. 1 is a flow chart of the invention;

FIG. 2 is a block diagram of a high compression self-encoder of the present invention based on knowledge distillation;

FIG. 3 is a schematic diagram of a video style migration network structure constructed by the present invention;

FIG. 4 is an exemplary diagram of a video style migration effect according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The invention provides a style migration method for automatically generating a stylized video, which mainly comprises three stages in the video style migration process, wherein the first stage is to encode a content video frame and a style graph by an encoder, the second stage is to perform characteristic style migration and fusion on the content and style characteristics obtained by encoding, and the third stage is to perform decoder decoding on the migrated and fused characteristics to obtain a stylized video frame and finally synthesize a video. The sizes of the encoder and the decoder largely determine whether the model is light, and whether the design of the characteristic migration part directly determines whether the stylized video obtained by migration is stable, whether the stylized video can be migrated in real time and whether the stylized video has the capability of migrating in any style. As shown in fig. 1, the method specifically comprises the following steps:

step 1: and constructing a video style migration network model, as shown in FIG. 3, which comprises a high-compression self-encoder module based on knowledge distillation and a feature migration module based on semantic alignment.

The self-encoder is divided into an encoder and a decoder, the encoder can encode original video content frames and style images into feature maps, the feature migration module can fuse content features and style features obtained by encoding of the encoder based on semantic alignment, finally fusion migration features based on semantic alignment are obtained, and finally stylized video frames are obtained through the decoder.

A feature style migration module (FTM) based on semantic alignment, which can ensure the stability between adjacent frames in the video style migration process; the size of the video style migration model is only 2.67MB, and the speed of executing the video style migration can reach 166.67 fps.

Step 2: encoder encoding of content video frames and trellis diagrams: knowledge distillation is carried out on the lightweight encoder based on the VGG network, the encoder learns the encoding capacity of the VGG encoder of a teacher network while having enough lightweight, and original video content frames and style images are encoded into feature maps.

As shown in fig. 2, the network structure of the lightweight encoder and decoder specifically includes: a lightweight encoder network of four symmetric groups of upsampled and downsampled convolutional layers, max pooling layers and employing a ReLU activation function to feature encode input video frames and arbitrary style images. The VGG network is a network structure widely used in style migration, and the lightweight encoder network is a student network obtained by knowledge distillation based on the VGG teacher network, so that the VGG network can use few parameters to realize an image encoding process. As in the network architecture of fig. 2

Partly as shown, it is desirable to have the encoder network learn while being sufficiently lightweightThe encoding capability to the teacher network VGG encoder, where the loss function needs to be optimized, is as follows:

wherein, the encoder in the original VGG-based network is E, and the lightweight encoder is defined as

I' is a reconstructed picture obtained by reconstruction of a decoder, wherein I is an original image, an encoder in a VGG network is E, and a lightweight encoder is E

is a k-th layer output characteristic diagram in a lightweight encoder, and both lambda and gamma are hyper-parameters

And step 3: knowledge distillation is carried out on the lightweight decoder based on the VGG network, so that the decoder can learn the decoding capability of the VGG decoder in the teacher network while being light enough.

As in the network architecture of fig. 2

Partially shown, for the lightweight decoder network for feature decoding of the migrated features, the VGG network is used as a teacher network to perform knowledge distillation, and the decoder network needs to be capable of learning the decoding capability of the VGG decoder of the teacher network while having a sufficiently lightweight, where the loss function to be optimized is as follows:

wherein

Is implemented by a lightweight decoder

Decoding the resulting reconstructed picture, the above distillation process being aimed at

While the information of the original E can be retained,

can be well combined with

And 4, step 4: and the feature migration module based on semantic alignment fuses the content features and the style features obtained by the encoder coding to obtain fusion migration features based on semantic alignment.

The feature migration module based on semantic alignment is a key for realizing real-time stable video style migration, and needs to be capable of efficiently completing style feature migration and simultaneously performing feature semantic alignment. To achieve the above idea, the idea of manifold alignment is adopted. Assume that the feature map of the content image output from the encoder is F_c∈R^Cx(WxH)The output of the style image obtained by the lightweight coding network is F_s∈R^Cx(WxH)Wherein C is the number of the channels of the feature map, and W and H are the width and height of the feature map respectively. The FTM module is designed to output the feature F after semantic alignment migration_csAnd outputs it to the decoder to obtain the migrated result map. In fact, the goal of the FTM module we have designed is to find a transform that enables semantically aligned feature migration of content maps of different video frames, assuming that the transform process can be parameterized as a projection matrix P e R^CxCThen the optimized objective function is:

wherein ,

And

k of (2) is a neighbor matrix. Therefore, the objective function is to make the content features after conversion similar to the k-nearest neighbor features of the style feature space. Equivalent to that during the video style migration, there may be some moving objects and some lighting changes, which may cause jitter after the migration. But based on the above affine-preserving transformation, stable consistency can be kept between two adjacent frames, thereby generating a stable video style migration result.

Solving the above equation in effect calculates its closed form solution for P, which can be obtained by taking the derivative of P and making the derivative 0:

wherein A is the affine matrix defined above, U is the diagonal matrix, and

also a matrix. Since A is a diagonal matrix, it can be decomposed into T^TT, the projection matrix P may thus be formalized as P ═ g (F)_c)f(F_s)^T) In the above linear conversion process, g (x) MX and f (x) XT^T. Even if we can solve P in a closed form, the process of matrix inversion is very importantTime consuming, we have designed an FTM network module result to fit the solution process. Where the f (x) process we choose to fit with three convolutional layers, the g () process uses a fully-connected layer fit.

The content images needed to be used for training the self-encoder are preprocessed. Uniformly adjusting the image to 256 × 256 pixels; the content image is respectively input into a student self-encoder network and a teacher self-encoder network, and the part comprises an encoding part and a decoding part. The encoding section encodes the image; the decoding part reconstructs the input image according to the characteristic code obtained by the encoder. Meanwhile, through the feature perception loss and the reconstruction loss, as shown in fig. 2, the training method based on knowledge distillation ensures that the light-weight self-encoder network obtained by distillation can have the capabilities of multi-level feature extraction and feature-based image reconstruction; the content image and the style image are respectively sent into a style migration network added with a semantic alignment feature migration module as shown in fig. 3, the middle feature migration module is trained (a distilled lightweight self-encoder network is fixed), and the migration module is trained based on the designed content loss Lc and the style loss Ls.

In the testing stage, the video frames and the selected style images are directly input into a trained lightweight style migration model, the model automatically and efficiently outputs stylized results, and finally, stable stylized videos are synthesized in real time, as shown in fig. 4, the stylized videos are style migration results every 10 frames in one video, and it can be seen that the style migration with semantic alignment can be performed no matter whether the videos are foreground or background, so that stable video frame results are generated.

Claims

1. A style migration method for automatically generating stylized video is characterized by comprising the following steps:

2. The method for automatically generating style migration of stylized video according to claim 1, wherein said step (2) is implemented by optimizing a loss function as follows:

3. The method for automatically generating style migration of stylized video according to claim 1, wherein said step (3) is implemented as follows:

the characteristic diagram of the content image output by the encoder is F_c∈R^Cx(WxH)The output obtained from the stylized image is F_s∈R^Cx ^(WxH)Wherein C is the number of the channels of the feature map, and W and H are the width and the height of the feature map respectively; the feature migration module based on semantic alignment aims at finding a feature migration which enables semantic alignment of content graphs of different video frames through conversion, and supposing that the conversion process can be parameterized into a projection matrix P e R^CxCThen the optimized objective function is:

wherein ,

And

k neighbor matrix of (1);

solving for P as:

wherein A is the affine matrix defined above, U is the diagonal matrix, and

4. The method for automatically generating style migration of stylized video according to claim 1, characterized in that said step (4) is implemented by optimizing a loss function as follows:

While the information of the original E can be retained,

can be well combined with