CN112884636B

CN112884636B - Style migration method for automatically generating stylized video

Info

Publication number: CN112884636B
Application number: CN202110117964.4A
Authority: CN
Inventors: 霍静; 孔美豪; ***; 高阳
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-01-28
Filing date: 2021-01-28
Publication date: 2023-09-26
Anticipated expiration: 2041-01-28
Also published as: CN112884636A

Abstract

The invention discloses a style migration method for automatically generating a stylized video, which comprises the steps of constructing a high-compression self-encoder model based on knowledge distillation and a style migration model for automatically generating the stylized video based on a feature migration module of semantic alignment; the self-encoder is divided into an encoder and a decoder, the encoder can encode an original video content frame and a style image into a feature map, the feature migration module can fuse the content features and the style features obtained by the encoding of the encoder based on semantic alignment, finally, fusion migration features based on the semantic alignment are obtained, and finally, the stylized video frame is obtained through the decoder. The method and the device can ensure the stability of the video after migration, can realize the stylization of any video in any style, have very high speed of the style migration process, and have higher practicability.

Description

Style migration method for automatically generating stylized video

Technical Field

The invention belongs to the field of computer application, and particularly relates to a style migration method for automatically generating stylized videos.

Background

With the development and popularization of the internet and the mobile internet, more and more short video platforms are coming up, and the artistic demands of people for videos are gradually increased based on the development and popularization of the short video platforms, and the creation of the short video platforms by professional artists or professional clipping agents is inconvenient and high in cost. Therefore, automatic generation of video of any artistic style from video by computer technology is attracting attention and favor.

Given a content map and a target style map, the purpose of style migration is to produce a stylized image that can have both content map structure and style map texture. The style migration method based on the single image has a great deal of research work, and a great deal of attention is paid to the video style migration field at present, because the video style migration has very wide application prospect (including short video artistic conversion and the like); obviously, style migration of video is more practical and challenging than style migration of a single image.

Compared with the traditional image style migration, the video style migration is more difficult in that the stylized quality, the stability and the calculation efficiency are simultaneously considered. Currently existing video style migration methods can be broadly divided into two categories depending on whether or not light streaming is used.

The first is a method using optical flow, which puts forward a loss of timing consistency to achieve stability between adjacent frames through the supervised constraint of optical flow. Including optimization-based optical flow constraint methods, which, while enabling stable migration videos, require nearly three minutes for each frame style of video to migrate, this extremely slow migration rate is unacceptable. Video style migration methods based on feedforward networks have been proposed later, but because optical flow constraints are still used in the training stage and the testing stage, real-time effects cannot be achieved in video migration tasks. To solve this problem, some methods use light flow only during the training phase and avoid light flow during the testing phase, but the effect of the final migration is very unstable, although the speed is increased compared to those methods that also use light flow during the testing phase.

The second is a method that does not use optical flow, such as LST, and can realize characteristic affine so that stable stylized video can be obtained. After this, studies have proposed using a Avatar-Net based decoration module in combination with a component normalization method to guarantee video stability. But the existing methods without using optical streams in this category all use the original VGG network to encode content and style characteristics, and the VGG network is very bulky, meaning that a very large memory space is required to store the VGG model, which will limit their application in some small terminal devices to a great extent.

Disclosure of Invention

The invention aims to: the invention provides a style migration method for automatically generating stylized videos, which can realize real-time stable arbitrary video style migration.

The technical scheme is as follows: the invention provides a style migration method for automatically generating stylized video, which specifically comprises the following steps:

(1) Constructing a video style migration network model, wherein the model comprises a high-compression self-encoder module based on knowledge distillation and a feature migration module based on semantic alignment; the self-encoder module comprises a lightweight encoder and a lightweight decoder;

(2) Encoder encoding of content video frames and style sheets: performing knowledge distillation on a lightweight encoder based on a VGG network, enabling the encoder to learn the encoding capability of the VGG encoder of a teacher network while being lightweight enough, and encoding an original video content frame and a style image into a feature map;

(3) Feature migration module based on semantic alignment: fusing the content characteristics and style characteristics obtained by encoding the encoder to obtain fusion migration characteristics based on semantic alignment;

(4) Knowledge distillation is carried out on the lightweight decoder based on VGG network: the decoder can learn the decoding capability of the VGG decoder of the teacher network while being light enough, and the decoder decodes the fused and migrated features to obtain stylized video frames, and finally synthesizes the video.

Further, the implementation of the step (2) requires optimizing the loss function as follows:

wherein I is an original image, an encoder in a VGG network is E, and a lightweight encoder isI' is a reconstructed picture, E _k (I) Outputting a feature map for a kth layer in an original VGG encoder,>in encoders of light weightAnd outputting a characteristic diagram by the k layer, wherein lambda and gamma are super parameters.

Further, the implementation process of the step (3) is as follows:

the characteristic diagram of the output of the content image obtained by the encoder is F _c ∈R ^Cx(WxH) The output of the style image is F _s ∈R ^Cx(WxH) Wherein C is the number of channels of the feature map, and W and H are the width and height of the feature map respectively; feature migration module based on semantic alignment aims at finding a feature migration which converts content graphs of different video frames to enable semantic alignment, and the conversion process is assumed to be parameterized into a projection matrix P epsilon R ^CxC The optimized objective function is:

wherein ,representing the slave F _c An operation of selecting an ith position feature vector, A _ij Representation-> and />K neighbor matrices of (a);

solving P is as follows:

wherein A is an affine matrix as defined above, U is a diagonal matrix, and is a matrix with feature alignment function,projection matrix P is formalized as p=g (F) _c )f(F _s ) ^T ) In the linear conversion process, g (x) =mx and f (x) =xt ^T The method comprises the steps of carrying out a first treatment on the surface of the The f (x) procedure chooses to fit with three convolutional layers, and the g () procedure uses a full join layer fit.

Further, the implementation of the step (4) requires optimizing the loss function as follows:

wherein I is an original image, E _k (I) The feature map is output for the kth layer in the original VGG encoder,for the k-th layer output feature map in a lightweight encoder, I' is the decoder with a lightweight>The reconstructed picture obtained by decoding, lambda being the super parameter, the distillation process above being aimed at letting +.>While the information of the original E can be preserved, < >>Can well contain->And reconstructing the image by the obtained output characteristic information.

The beneficial effects are that: compared with the prior art, the invention has the beneficial effects that: 1. the stability between adjacent frames needs to be considered when the higher stylized quality of the video frames is finished, namely the sequence consistency is considered, so that the stability of the video after migration can be ensured; 2. the stylization has rich diversity, and can realize the stylization of any video in any style; 3. in the process of video style migration, real-time performance needs to be achieved, namely, the speed of the style migration process needs to be guaranteed to be very high, and in order to have higher practicability, the whole model needs to be guaranteed to be light.

Drawings

FIG. 1 is a flow chart of the invention;

FIG. 2 is a schematic diagram of a high compression self-encoder module based on knowledge distillation in accordance with the present invention;

FIG. 3 is a schematic diagram of a video style migration network constructed in accordance with the present invention;

FIG. 4 is an exemplary diagram of a video style migration effect of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

The invention provides a style migration method for automatically generating stylized video, which mainly needs to go through three stages in the video style migration process, wherein the first stage is to encode a content video frame and a style map by an encoder, the second stage is to perform feature style migration fusion on the encoded content and style features, and the third stage is to decode the migrated and fused features by a decoder so as to obtain the stylized video frame, and finally synthesize the video. The model sizes of the encoder and the decoder determine whether the model is light in weight to a great extent, and whether the feature migration part is designed or not directly determines whether the stylized video obtained by migration is stable, whether real-time style migration can be completed or not and whether the model has any style migration capability or not. As shown in fig. 1, the method specifically comprises the following steps:

step 1: a video style migration network model is constructed, as shown in fig. 3, comprising a knowledge distillation based high compression self-encoder module and a semantic alignment based feature migration module.

The self-encoder is divided into an encoder and a decoder, the encoder can encode an original video content frame and a style image into a feature map, the feature migration module can fuse the content features and the style features obtained by the encoding of the encoder based on semantic alignment, finally, fusion migration features based on the semantic alignment are obtained, and finally, the stylized video frame is obtained through the decoder.

A feature style migration module (FTM) based on semantic alignment, which can ensure stability between adjacent frames in the video style migration process; the video style migration model size is only 2.67MB and the speed at which video style migration is performed can reach 166.67fps.

Step 2: encoder encoding of content video frames and style sheets: and performing knowledge distillation on the lightweight encoder based on the VGG network, so that the encoder learns the encoding capability of the VGG encoder of the teacher network while being lightweight enough, and encoding the original video content frames and the style images into feature maps.

As shown in fig. 2, the lightweight encoder and decoder network architecture specifically includes: a symmetrical four-set up-and down-sampling convolutional layer, a max-pooling layer, and a lightweight encoder network that uses a ReLU activation function to feature-encode input video frames as well as arbitrary style images. The VGG network is a network structure widely used in style migration, and the lightweight encoder network is a student network obtained by knowledge distillation based on a VGG teacher network, so that the VGG network can realize the encoding process of images by using as few parameters as possible. As in the network architecture of figure 2Partly shown, there is a need for an encoder network that is lightweight enough to learn the encoding capabilities of a teacher network VGG encoder, where the loss function that needs to be optimized is as follows:

wherein the encoder in the original VGG-based network is E, and the lightweight encoder is defined asI' is a reconstructed picture obtained by reconstruction of a decoder, wherein I is an original image, an encoder in a network of VGG is E, and a lightweight encoder isI' is a reconstructed picture, E _k (I) Outputting a feature map for a kth layer in an original VGG encoder,>for the k layer output characteristic diagram in the lightweight encoder, lambda and gamma are super parameters

Step 3: the light-weight decoder carries out knowledge distillation based on the VGG network, so that the decoder can learn the decoding capability of the VGG decoder of the teacher network while being light enough.

As in the network architecture of figure 2As shown in part, for a lightweight decoder network that performs feature decoding on migrated features, knowledge distillation is performed using a VGG network as a teacher network, and it is necessary to enable the decoder network to learn the decoding capability of the VGG decoder of the teacher network while being lightweight enough, where the loss function to be optimized is as follows:

wherein Is implemented by a lightweight decoder->The reconstructed picture obtained by decoding, the goal of the above distillation procedure is to let +.>While the information of the original E can be preserved, < >>Can well contain->And reconstructing the image by the obtained output characteristic information.

Step 4: and the feature migration module based on semantic alignment fuses the content features and style features obtained by encoding of the encoder to obtain fusion migration features based on semantic alignment.

The feature migration module based on semantic alignment is a key for realizing real-time stable video style migration, and feature semantic alignment is required to be performed while style feature migration is completed efficiently. To achieve the above, the idea of manifold alignment is employed. Assume that a characteristic diagram of the content image output by the encoder is F _c ∈R ^Cx(WxH) The output of the style image obtained by the lightweight coding network is F _s ∈R ^Cx(WxH) Where C is the number of feature map channels and W and H are the width and height of the feature map, respectively. The FTM module designed will output feature F after semantic alignment migration _cs And outputs it to the decoder to obtain a migrated result map. In practice, the goal of our FTM module design is to find a transformation that enables semantically aligned feature migration of content graphs of different video frames, assuming that the transformation process can be parameterized as a projection matrix P e R ^CxC The optimized objective function is:

wherein ,representing the slave F _c An operation of selecting an ith position feature vector, A _ij Representation-> and />Is a k-nearest neighbor matrix of (c). Thus, the objective function is to let the transformThe content features thereafter are similar to the k-nearest neighbor features of the grid feature space. Equivalent to the process of video style migration, there may be some moving objects and some illumination changes, which may cause jitter after migration. But based on the affine preserving transformation above, the adjacent two frames can be kept consistent, thereby generating stable video style migration results.

Solving the above equation is in effect calculating its closed-form solution for P, which can be found by deriving P and letting the derivative be 0:

wherein A is an affine matrix as defined above, U is a diagonal matrix, and also a matrix. Since A is a diagonal matrix, it can be decomposed into T ^T T, the projection matrix P may thus be formed as p=g (F _c )f(F _s ) ^T ) In the above linear conversion process, g (x) =mx and f (x) =xt ^T . Even though we can solve P in a closed form, the process of matrix inversion is very time consuming, so we have designed an FTM network module result to fit the above solution process. Where the f (x) procedure we choose to fit with three convolutional layers and the g () procedure uses a full join layer fit.

The content images that need to be used from encoder training are preprocessed. Uniformly adjusting the image to 256×256 pixels; the content images are input into the student self-encoder network and the teacher self-encoder network respectively, and the student self-encoder network comprises an encoding part and a decoding part. The encoding part encodes the image; the decoding section reconstructs the input image based on the feature codes obtained by the encoder. Meanwhile, through the feature perception loss and the reconstruction loss, as shown in fig. 2, the training method based on knowledge distillation ensures that a lightweight self-encoder network obtained by distillation can have the capabilities of multi-level feature extraction and feature-based image reconstruction; the content image and the style image are respectively sent into a style migration network added with a semantic alignment feature migration module as shown in fig. 3, the middle feature migration module is trained (a lightweight self-encoder network which is distilled is fixed), and the migration module is trained based on designed content loss Lc and wind lattice loss Ls.

In the test stage, the video frames and the selected style images are directly input into a trained lightweight style migration model, the model automatically and efficiently outputs the stylized result, and finally, the stable stylized video is synthesized in real time, as shown in fig. 4, the style migration result of 10 frames per interval in one video can be seen, and the style migration with semantic alignment can be performed to generate a stable video frame result no matter whether the video is foreground or background.

Claims

1. A style migration method for automatically generating stylized video is characterized by comprising the following steps:

(4) Knowledge distillation is carried out on the lightweight decoder based on VGG network: the decoder can learn the decoding capability of the VGG decoder of the teacher network while being light enough, and the decoder decodes the fused and migrated features to obtain stylized video frames, and finally synthesizes the video;

the implementation process of the step (3) is as follows:

the characteristic diagram of the output of the content image obtained by the encoder is F _c ∈R ^Cx(WxH) The output of the style image is F _s ∈R ^Cx ^(WxH) Wherein C is the number of channels of the feature map, and W and H are the width and height of the feature map respectively; feature migration module based on semantic alignment aims at finding a feature migration which converts content graphs of different video frames to enable semantic alignment, and the conversion process is assumed to be parameterized into a projection matrix P epsilon R ^CxC The optimized objective function is:

solving P is as follows:

wherein A is an affine matrix as defined above, U is a diagonal matrix, and is a matrix with feature alignment function, and the projection matrix P is formed as p=g (F _c )f(F _s ) ^T ) In the linear conversion process, g (x) =mx and f (x) =xt ^T The method comprises the steps of carrying out a first treatment on the surface of the The f (x) procedure chooses to fit with three convolutional layers, and the g () procedure uses a full join layer fit.

2. The method for automatically generating stylized video style migration of claim 1, wherein the implementation of step (2) requires optimization of a loss function as follows:

wherein I is an original image, an encoder in a VGG network is E, and a lightweight encoder is' I is a reconstructed picture, E _k (I) Outputting a feature map for a kth layer in an original VGG encoder,>and outputting a characteristic diagram for a k layer in the lightweight encoder, wherein lambda and gamma are super parameters.

3. The method for automatically generating stylized video style migration of claim 1, wherein the implementation of step (4) requires optimization of a loss function as follows: