CN115641256B

CN115641256B - Training method of style migration model, video style migration method and device

Info

Publication number: CN115641256B
Application number: CN202211310125.5A
Authority: CN
Inventors: 戈维峰; 邱晓文; 何博安; 徐瑞泽
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2022-10-25
Filing date: 2022-10-25
Publication date: 2023-05-26
Anticipated expiration: 2042-10-25
Also published as: CN115641256A

Abstract

The application relates to the field of artificial intelligence and discloses a training method of a style migration model, a video style migration method and a device. The training method comprises the steps of obtaining training data, wherein the training data comprises N frames of sample content images and N sample style images; performing image style migration on the N frames of sample content images and the N frames of sample style images through a style removal model to obtain N frames of first real images; performing image style migration on the N frames of sample content images and the N frames of first real images through a style recovery model to obtain N frames of second real images; and determining a first parameter of the style removal model and a second parameter of the style restoration model according to a first loss function between the N frames of sample content images and the N frames of first real images and a second loss function between the N frames of sample content images and the N frames of second real images. The invention can efficiently transfer the style of the style image to the video frame, and simultaneously, the distortion and the artifact are not caused, and the incoherence of the front frame and the rear frame is not caused.

Description

Training method of style migration model, video style migration method and device

Technical Field

The present application relates to the field of artificial intelligence, and more particularly, to a training method of style migration models, a method, an apparatus, a medium, an electronic device, and a computer program product for video style migration in the field of computer vision.

Background

Artificial intelligence (artificial intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

Image rendering tasks such as image style migration and the like have wide application requirement scenes on terminal equipment. With the high-speed improvement of the performance and network performance of the terminal device, the entertainment requirement of the terminal device is gradually changed from an image level to a video level, namely, from image style migration processing of single images to image style migration processing of videos.

In recent years, video style migration methods have been successful in migrating an artistic style of a picture to a video input. However, if it is desired that the migrated video be as true as the video captured by the camera, then there is a need to address the oil painting-like distortion and incoherence artifacts that may be introduced by the art style migration.

The rapid development of photographic equipment has made video a dominant information carrier. People often post video in different color styles on social media to share their daily lives, express different emotions, and increase exposure. Thus, in many mobile devices, real image style migration or automatic color migration is a popular function. Unlike art style migration, real image style migration or automatic color migration requires that the output image be guaranteed to be a "real image" while changing the color style of the input video with the color style of one or more reference images.

The real image style migration in style migration means that the migration result should be like a real image taken with a camera without any distortion or unrealistic artifacts. Some current popular real image style migration algorithms are, for example: deepPhoto, photoWCT, WCT ² Photosnas, etc. However, some distortion or artifacts still exist in the migration results of these algorithms.

Disclosure of Invention

The embodiment of the application provides a training method of a style migration model, a video style migration method, a device, a medium, electronic equipment and a computer program product.

In a first aspect, an embodiment of the present application provides a method for training a style migration model, which is used for an electronic device, and the method includes:

the method comprises the steps of obtaining training data, wherein the training data comprises N frames of sample content images and N sample style images;

a style removing step, namely performing image style migration on the N frames of sample content images and the N frames of sample style images through a style removing model to obtain N frames of first real images;

a style recovery step, namely performing image style migration on the N frames of sample content images and the N frames of first real images through a style recovery model to obtain N frames of second real images;

and determining parameters, namely determining the first parameters of the style removal model and the second parameters of the style restoration model according to a first loss function between the N frames of sample content images and the N frames of first real images and a second loss function between the N frames of sample content images and the N frames of second real images.

In a possible implementation of the first aspect, in the style recovery step, N-1 optical flow graphs are generated from two consecutive frames of the sample content image, and the N-1 optical flow graphs are input into the style recovery model,

wherein the optical flow map includes optical flow information.

In a possible implementation of the first aspect, the style removal model includes a first feature extraction module, a first feature conversion module, and a first decoding module,

wherein the style removal step further comprises:

a first feature extraction step of inputting the N frames of sample content images and the N sample style images into the first feature extraction module to obtain N sample content feature sets and N sample style feature sets;

a first feature conversion step of inputting the N sample content feature sets and the N sample style feature sets into the first feature conversion module to obtain N first synthesized feature sets;

and a first decoding step, namely inputting the N first synthesized feature sets into the first decoding module to obtain the N frames of first real images.

In a possible implementation of the first aspect, the style recovery model includes a second feature extraction module, a second feature transformation module, a neural network model, and a second decoding module,

Wherein the style restoration step further comprises:

a second feature extraction step of inputting the N frames of first real images into the second feature extraction module to obtain N real image feature sets;

a second feature conversion step of inputting the N real image feature sets and the N sample content feature sets into the second feature conversion module to obtain N second synthesized feature sets;

generating the N-1 optical flow diagrams according to the sample content images of two continuous frames;

a fusion step of inputting the N second synthesis feature sets and the N-1 optical flow diagrams into the neural network model to obtain N fusion synthesis feature sets;

and a second decoding step, namely inputting the N fusion synthesized feature sets into the second decoding module to obtain the N frames of second real images.

In a possible implementation of the first aspect, the first feature transformation module includes a first whitening module, a first convolution module, and a first stylization module,

the first whitening module performs whitening calculation on each input sample content feature set to obtain a first whitening result, the first convolution module performs convolution calculation on the first whitening result and the sample content feature set to obtain a first convolution result, and the first stylization module performs stylization calculation on the first convolution result to obtain the first synthesized feature set.

In a possible implementation of the first aspect, the second feature transformation module includes a second whitening module, a second convolution module, and a second stylization module,

the second whitening module performs whitening calculation on each input real image feature set to obtain a second whitening result, the second convolution module performs convolution calculation on the second whitening result and the sample content feature set to obtain a second convolution result, and the second stylization module performs stylization calculation on the second convolution result to obtain the second synthesized feature set.

In a possible implementation manner of the first aspect, the first synthesis feature set includes a plurality of first synthesis features, for each first synthesis feature set input, the first decoding module performs downsampling on each first synthesis feature, and then fuses each first synthesis feature with a next first synthesis feature, and performs downsampling, so as to obtain a downsampled synthesis feature, and performs upsampling on the downsampled synthesis feature, together with each first synthesis feature in sequence, so as to obtain the N frames of first real images.

In a possible implementation manner of the first aspect, the fused synthetic feature set includes a plurality of fused synthetic features, for each fused synthetic feature set input, the second decoding module performs downsampling on each fused synthetic feature, and then performs fusion with a next fused synthetic feature, and performs downsampling on each fused synthetic feature, so as to obtain a downsampled fused synthetic feature, and performs upsampling on the downsampled fused synthetic feature, together with each fused synthetic feature in sequence, so as to obtain the N frames of second real images.

In a possible implementation of the first aspect, the first loss function and the second loss function are content loss functions.

In a possible implementation of the first aspect, the N sample-style images may be the same or different.

In a second aspect, an embodiment of the present application provides a method for migrating a video style, including:

a step of acquiring a video to be processed, wherein the video to be processed comprises a plurality of frames of content images to be processed, and each frame of content image to be processed corresponds to one style image;

a style migration step, using a target style migration model, performing image style migration on each frame of to-be-processed content image and the corresponding style image to obtain a video after style migration,

Wherein the target style migration model is the style restoration model, which is obtained according to the training method described in the first aspect and determines the second parameter.

In a possible implementation of the above second aspect, a plurality of the style images may be the same or different.

In a third aspect, an embodiment of the present application provides a training apparatus for a style migration model, where the apparatus includes:

the acquisition unit acquires training data, wherein the training data comprises N frames of sample content images and N sample style images;

the style removing unit is used for performing image style migration on the N frames of sample content images and the N frames of sample style images through a style removing model to obtain N frames of first real images;

the style recovery unit is used for carrying out image style migration on the N frames of sample content images and the N first real images through a style recovery model to obtain N frames of second real images;

and the parameter determining unit is used for determining a first parameter of the style removing model and a second parameter of the style recovering model according to a first loss function between the N frames of sample content images and the N frames of first real images and a second loss function between the N frames of sample content images and the N frames of second real images.

In a fourth aspect, an embodiment of the present application provides an apparatus for migrating a video style, where the apparatus includes:

the method comprises the steps of obtaining a video unit to be processed, and obtaining the video to be processed and a plurality of style images, wherein the video to be processed comprises a plurality of frames of content images to be processed, and each frame of content image to be processed corresponds to one style image;

a style migration unit for performing image style migration on each frame of the content image to be processed and the corresponding style image by using a target style migration model to obtain a video after style migration,

wherein the target style migration model is the style restoration model, which is obtained according to the training apparatus described in the third aspect, in which the second parameter is determined.

In a possible implementation of the fourth aspect, a plurality of the style images may be the same or different.

In a fifth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon instructions that, when executed on a computer, cause the computer to perform the method described in the first or second aspect above.

In a sixth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; one or more memories; the one or more memories store one or more programs that, when executed by the one or more processors, cause the electronic device to perform the method described in the first or second aspect above.

In a seventh aspect, embodiments of the present application provide a computer program product comprising computer executable instructions that are executed by a processor to implement the method described in the first or second aspect.

The invention uses a style removal-style recovery training method. Firstly, the style of the input video is covered by the style of the randomly selected picture, and then the style of the input video is recovered by using a self-supervision method. Such a self-supervised training method is crucial for the model of the invention to generate video with visual impact. Therefore, the invention can efficiently migrate the style of the style image to the video frame without causing distortion and artifacts. And meanwhile, style migration is carried out on continuous frames, and context information is transferred, so that the continuity of the front frame and the rear frame is not caused.

Drawings

FIG. 1 illustrates a flow chart of a training method of a style migration model according to an embodiment of the present application;

FIG. 2 is a specific flow chart of the style removal step of FIG. 1;

FIG. 3 is a schematic diagram of a training process for a style migration model according to an embodiment of the present invention;

FIG. 4 is a detailed flow chart of the style restoration step of FIG. 1;

FIG. 5 is a schematic diagram of a feature conversion module;

FIG. 6 is a process diagram of decoding by a decoding module;

FIG. 7 is a flow chart of a method of video style migration according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a process of video style migration according to an embodiment of the present invention;

FIG. 9 is a block diagram of a training apparatus for a style migration model according to an embodiment of the present invention;

FIG. 10 is a block diagram of an apparatus for video style migration according to an embodiment of the present invention;

fig. 11 shows a block diagram of an electronic device, according to an embodiment of the application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application; it will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Since embodiments of the present application relate to a large number of applications of neural networks, for ease of understanding, related terms and concepts of the neural networks to which embodiments of the present application may relate are first described below.

1. Long-short term memory network

A Long Short-Term Memory network (LSTM) is a time-cycled neural network. LSTM increases one cell state (cell state) compared to the hidden layer (hidden state) of the original RNN. The inputs of LSTM at time t are three: cell state C _t-1 Hidden layer state h _t-1 Input vector X at time t _t The output of LSTM at time t has two: cell state Ct, hidden layer state h _t 。

2.ConvLSTM

ConvLSTM is an LSTM network, but its input transforms and cyclic transforms are implemented by convolution, many of which parameters are understood with reference to LSTM.

3.VGGNet

VGGNet is a convolutional neural network developed jointly by the visual geometry group of oxford university (Visual Geometry Group) and researchers from the *** flag deep team. At present, a relatively large number of network structures mainly comprise ResNet (152-1000 layers), gooleNet (22 layers), VGGNet (19 layers), most models are improved based on the models, and a new optimization algorithm, multi-model fusion and the like are adopted. VGGNet has so far often been used to extract image features.

4. Convolutional neural network

The convolutional neural network (convolutional neuron network, CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of a convolutional layer and a sub-sampling layer, which can be regarded as a filter. The convolution layer refers to a neuron layer in the convolution neural network, which performs convolution processing on an input signal. In the convolutional layer of the convolutional neural network, one neuron may be connected with only a part of adjacent layer neurons. A convolutional layer typically contains a number of feature planes, each of which may be composed of a number of neural elements arranged in a rectangular pattern. Neural elements of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights can be understood as the way image information is extracted is independent of location. The convolution kernel can be initialized in the form of a matrix with random size, and reasonable weight can be obtained through learning in the training process of the convolution neural network. In addition, the direct benefit of sharing weights is to reduce the connections between layers of the convolutional neural network, while reducing the risk of overfitting.

5. Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really wanted to be predicted as possible, the predicted value of the current network and the really wanted target value can be compared, and then the weight vector of each layer of the neural network is updated according to the difference condition between the predicted value and the really wanted target value (of course, an initialization process is usually carried out before the first update, namely, the pre-configuration parameters of each layer in the deep neural network); for example, if the predicted value of the network is high, the weight vector is adjusted to make it predicted lower, and the adjustment is continued until the deep neural network is able to predict the truly desired target value or a value very close to the truly desired target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function), which is an important equation for measuring the difference between the predicted value and the target value. Where a higher output value (loss) of the loss function indicates a greater difference, training of the deep neural network becomes a process to minimize this loss.

6. Image style migration

Image style migration refers to fusing the image content in a content image a with the image style of a style image B to produce a composite image C having both the image content a and the image style B.

Illustratively, performing image style migration on the content image 1 according to the style image 1, so as to obtain a composite image 1, wherein the composite image 1 comprises the content in the content image 1 and the style in the style image 1; similarly, the image style migration is performed on the content image 1 according to the style image 2, and a composite image 2 can be obtained, wherein the composite image 2 includes the content in the content image 1 and the style in the style image 2.

The style image can refer to a reference image for style migration, and the style in the image can comprise texture characteristics of the image and artistic expression forms of the image; for example, the art expression of the image may include the style of the image such as cartoon, oil painting, watercolor, ink, etc; the content image may refer to an image requiring style migration, and the content in the image may refer to semantic information in the image, that is, may include high-frequency information, low-frequency information, and the like in the content image.

7. Optical flow information

The optical flow (optical flow or optic flow) is used to represent the instantaneous velocity of the pixel motion of a spatially moving object in an observation imaging plane, and is one method of using the pixel's change in the time domain in an image sequence and the correlation between adjacent frames to find the correspondence between the previous frame and the current frame, and thus calculate the motion information of the object between the adjacent frames.

The training method of the style migration model provided by the embodiment of the application is described in detail below. FIG. 1 illustrates a flow chart of a method of training a style migration model according to an embodiment of the present application.

In the acquiring step S11, training data including N frames of sample content images and N sample style images is acquired. N is an integer greater than or equal to 2, in this embodiment, n=5, for example.

Illustratively, the N-frame sample content image may refer to N-frame consecutive sample content images included in the sample video.

It should be understood that for the style migration process of a single frame image, only the content in the content image and the style in the style image need be considered; however, for the video, since the video comprises a plurality of continuous video images, the style migration of the video needs to consider not only the stylized effect of the images, but also the stability among the images of the plurality of frames; the smoothness of the video after style migration processing needs to be ensured, and noise such as screen flash and artifacts is avoided.

The N-frame sample content image refers to an image adjacent to N frames in the sample video.

In this embodiment, 5 consecutive frames of a sample video

Is used as sample content image, 5 randomly selected images from the dataset +. >

Is used as a sample style image. It will be appreciated that one sample-style image is used for one sample-content image, resulting in a sample-style image-sample-content image pair +.>

As shown in fig. 3.

It is understood that the N sample style images may be the same or different.

FIG. 3 is a schematic diagram of a training process for a style migration model according to an embodiment of the present invention.

In the style removal step S12, the 5-frame sample content image and the 5-sample style images are subjected to image style migration by the style removal model 31 shown in fig. 3, so as to obtain a 5-frame first real image.

By way of example, 5 frames of sample content images and 5 sample style images included in a sample video may be input into the style removal model 31.

For example, 5 sample content images may be input to the style removal model 31 in chronological order one by one, and 5 sample style images may be input to the style removal model 31 in chronological order one by one. That is, at time t, 1 frame of sample content image

And corresponding sample-style image->

The style removal model 31 is input. For example, at time t1, 1 frame sample content image +.>

And corresponding sample windGrating image->

The style removal model 31 is input.

The style removal model 31 may perform image style migration processing on the one frame of sample content image according to the corresponding sample style image, so as to obtain a frame of first real image corresponding to the one frame of sample content image. After the above procedure is performed 5 times, a first real image of 5 frames corresponding to the 5 frames of sample content images, i.e., I shown in FIG. 3, can be obtained _G1 ，I _G2 ，I _G3 ，I _G4 ，I _G5 。

In the style restoration step S13, the 5 frames of sample content images and the 5 frames of first real images are subjected to image style migration through the style restoration model 32, so as to obtain 5 frames of second real images.

Illustratively, 5 frames of sample content images and 5 frames of first real images may be input into the style recovery model 32.

For example, 5 frames of sample content images may be input to the style recovery model 32 in time sequence, one frame by one frame, respectively, and 5 frames of first real images may be input to the style recovery model 32 in time sequence, one frame by one frame, respectively. That is, at time t, 1 frame of sample content image

And corresponding 1 frame first real image I _Gt The style recovery model 32 is input. For example, at time t1, one frame of sample content image +.>

And corresponding first real image I _G1 The style recovery model 32 is input.

Style recovery model 32 may correspond to a frame of first real image I from a frame of sample content image pair _G1 Performing image style migration processing to obtain a first real image I corresponding to the frame _G1 A corresponding frame of the second real image. After the above procedure is executed 5 times, 5 frames corresponding to the 5 frames of the first real image can be obtainedA second real image, i.e. I 'shown in FIG. 3' _G1 ，I’ _G2 ，I’ _G3 ，I’ _G4 ，I’ _G5 . It will be appreciated that the 5 frames of the second real image also correspond to the 5 frames of the sample content image, respectively.

In the parameter determination step S14, the first parameter of the style removal model 31 and the second parameter of the style restoration model 32 are determined from the first loss function (content loss 1) between the 5-frame sample content image and the 5-frame first real image, and the second loss function (content loss 2) between the 5-frame sample content image and the 5-frame second real image.

Illustratively, a first loss function (content loss 1) shown in fig. 3 is formed between the 5-frame sample content image and the 5-frame first real image, a second loss function (content loss 2) shown in fig. 3 is formed between the 5-frame sample content image and the 5-frame second real image, and the first parameters of the style removal model 31 and the second parameters of the style restoration model 32 are determined according to the first loss function (content loss 1) and the second loss function (content loss 2).

It should be understood that the first parameter includes one or more parameters and the second parameter also includes one or more parameters.

Fig. 2 is a specific flowchart of the style removal step S12 in fig. 1, and the style removal step S12 is described in detail below in conjunction with fig. 2.

As shown in fig. 3, the style removal model 31 includes a first feature extraction module 311, a first feature conversion module 312 (DecoupledIN 1), and a first decoding module 313 (Decoder 1).

In the first feature extraction step S121, 5 frames of sample content images and 5 sample style images are input to the first feature extraction module 311, resulting in 5 sample content feature sets and 5 sample style feature sets.

The following is imaged with sample content

And sample style image +.>

For exampleDetailed description will be given.

Imaging sample content

And sample style image +.>

The first feature extraction module 311 is input, and the first feature extraction module 311 is VGGNet, for example. VGGNet311 from sample content image +.>

Extracting sample content feature set->

And is->

Extracting sample style characteristic set->

And->

Each comprising a plurality of feature maps of different sizes, for example comprising 4 feature maps of different sizes, i.e.,

similar feature extraction is performed for the other 4 sample content images and the other 4 sample style images, resulting in 5 sample content feature sets and 5 sample style feature sets, i.e.,

Wherein t=1, 2,3,4,5.

It will be appreciated that given an input image pair

Extracting a VGG-19 network phi with fixed parameters _vgg In the feature map of the "conv1_1", "conv2_1", "conv3_1" and "conv4_1" layers (see fig. 6), as shown in the following expression (1):

wherein the method comprises the steps of

And->

Is a feature set of the input image. The use of multi-scale features is important to generate good migration effects and image realism.

In the first feature conversion step S122, 5 sample content feature sets and 5 sample style feature sets are input to the first feature conversion module 312, resulting in 5 first synthesized feature sets.

For example, it will

And->

Inputting the first feature conversion module 312, making the multi-scale feature graphs pass through four corresponding scale modules in the first feature conversion module 312 (decoupledIN 1) respectively, mapping their statistical information to match with the statistical information of the style image, and then outputting a first synthesized feature set g ₁ 。

It will be appreciated that in a similar manner to that described above, the first feature transformation module 312 may output 5 first composite featuresSet g ₁ ，g ₂ ，g ₃ ，g ₄ ，g ₅ 。

Fig. 5 is a schematic diagram of the structure of the feature conversion module. For example, the feature conversion module is the first feature conversion module 312. As shown in fig. 5, the first feature conversion module 312 includes a first whitening module 3121, a first convolution module 3122, and a first stylization module 3123.

As shown in fig. 5, the first convolution module 3122 includes 3×3 convolution layers (c is the number of channels of the input feature) having a number of convolution kernels of 2c, and the first rasterization module 3123 includes a module for performing mean and standard deviation calculation, and 3×3 convolution layers having a number of convolution kernels of 1 c.

By way of example only, the process may be performed,

as input f _c Is inputted into the first whitening module 3121, and the first whitening module 3121 is paired +.>

Performing whitening calculation to obtain a first whitening result, and using the first whitening result and the first convolution module 3122 as f _S Input sample content feature set +.>

Performing convolution calculation to obtain a first convolution result, and performing stylization calculation on the first convolution result by the first stylization module 3123 to obtain a first synthesis feature set g ₁ 。

As shown in fig. 5, the present invention proposes a first feature transformation module 312 (DecoupledIN) that decomposes feature transformations into feature whitening and stylization. It is desirable to remove the styles in the whitening module by linear transformation, and migrate the styles in the stylized module and protect the image structure from damage. For example, at layer i, a set of content input features is given

And a style input feature set +.>

The first whitening module 3121 is first removed to ++using the following expression (2) >

Is a style of (1) to obtain a first whitening result->

Then the first whitening result ++>

Sample content feature set->

A convolution layer of 3×3 (c is the number of channels of the input feature) with a number of convolution kernels of 2c, respectively fed into the first convolution module 3122, as shown in expression (3), results in a first convolution result ++>

Then, as shown in expression (4), the first stylization module 3123 +_for the first convolution result>

Performing stylized calculation to obtain stylized characteristic g _t,i And stylized feature g _t,i Is reduced back to c by another 3 x 3 convolutional layer.

/>

Wherein μ and σ represent the mean and standard deviation, respectively, of the features calculated per channel.

Thus, the first feature transformation module 312 may output 5 first synthesized feature sets g ₁ ，g ₂ ，g ₃ ，g ₄ ，g ₅ 。

Next, in a first decoding step S123, 5 first synthesized feature sets g ₁ ，g ₂ ，g ₃ ，g ₄ ，g ₅ Respectively input into a first decoding module 313 (Decoder 1) to obtain 5 frames of first real image I _G1 ，I _G2 ，I _G3 ，I _G4 ，I _G5 。

It will be appreciated that the first decoding module 313 is configured to decode each of the converted first composite feature sets into a real image, for example, the first composite feature set g ₁ Decoding into a first real image I _G1 。

Each first composite feature set comprises a plurality of first composite features, e.g., 4 composite features, e.g., g ₁ ＝(g _1-1 ,g _1-2 ,g _1-3 ,g _1-4 )。

Fig. 6 is a process diagram of decoding by the decoding module. For example, the decoding module is the first decoding module 313. As shown in fig. 6, for a first synthesized feature set g of the input ₁ The first decoding module 313 will g _1-1 Downsampling is performed and then with the following g _1-2 And (5) fusing and then downsampling. In a similar manner, after downsampling the four first synthesized features, a downsampled synthesized feature g 'is obtained' _1-4 . Next, the feature g 'is synthesized for downsampling' _1-4 Up-sampling and then synthesizing the first synthesized characteristic g _1-3 Up-sampling is performed together. In a similar manner, after up-sampling is continued, the sample is finally compared with g _1-1 Convolving together to output 1 frame of first real image I _G1 。

It will be appreciated that upsampling is the refinement of multi-scale features from low resolution to high resolution, resulting in a final output.

It can be understood thatIn a similar manner, the first decoding module 313 can decode to obtain the other 4-frame first real image I _G2 ，I _G3 ，I _G4 ，I _G5 。

It will be appreciated that when the decoding module in fig. 6 is the first decoding module 313, received are the 5 first synthesized feature sets g output by the first feature conversion module 312 ₁ ，g ₂ ，g ₃ ，g ₄ ，g ₅ And, there is no ConvLSTM in fig. 6.

It will be appreciated that "up-sampling" and "down-sampling" in fig. 6 represent the corresponding modules that perform "up-sampling" and "down-sampling".

Fig. 4 is a specific flowchart of the style restoration step S13 in fig. 1, and the style restoration step S13 is described in detail below with reference to fig. 4.

As shown in FIG. 3, the style recovery model 32 includes a second feature extraction module 321 (VGGNet), a second feature transformation module 322 (DecoupledIN 2), a neural network model 323 (convLSTM), and a second decoding module 324 (Decoder 2)

In the second feature extraction step S131, 5 frames of the first real image are input into the second feature extraction module 321, so as to obtain 5 real image feature sets.

The second feature extraction module 321 is, for example, VGGNet.

For example, 1 frame of the first real image I _G1 Input VGGNet321, VGGNet321 extracts a real image feature set f _G1 . Similarly, VGGNet321 can extract the other 4 real image feature sets f _G2 ，f _G3 ，f _G4 ，f _G5 。

In a second feature transformation step 132, 5 real image features are assembled f _G1 ，f _G2 ，f _G3 ，f _G4 ，f _G5 And 5 sample style feature sets

The second feature transformation modules 322 (DecoupledIN 2) are respectively input to obtain 5 second synthesized feature sets.

It will be appreciated that the 5 sample content feature sets are extracted in step S121

For example, a first set of real image features f _G1 And sample content feature set

Input to the second feature transformation module 322 (decoupledIN 2) and output a second set of synthesized features g' ₁ 。/>

It will be appreciated that in a similar manner as described above, the second feature transformation module 322 may output 4 other second synthesized feature sets g' ₂ ，g’ ₃ ，g’ ₄ ，g’ ₅ 。

The second feature conversion module 322 includes a second whitening module, a second convolution module, and a second stylization module. In this embodiment, the structure of the second feature conversion module 322 is as shown in fig. 5, which is the same as the structure of the first feature conversion module 312 in fig. 5, and the second whitening module is the same as the first whitening module 3121 in fig. 5, the second convolution module is the same as the first convolution module 3122 system in fig. 5, and the second rasterization module is the same as the first rasterization module 3123 in fig. 5. Therefore, the second feature conversion module 322 will not be described repeatedly herein.

Illustratively, the real image feature set f _G1 As input f _c Is input into a second whitening module, the second whitening module performs the following steps _G1 Performing whitening calculation to obtain a second whitening result, and taking the sum of the second whitening result and the second whitening result as f by a second convolution module _S Input sample content feature set

Performing convolution calculation to obtain a second convolution result, and performing second convolution on the second convolution result by a second stylization moduleStylized computation to obtain a second synthetic feature set g' ₁ . Similarly, other 4 second synthetic feature sets g 'may be output' ₂ ，g’ ₃ ，g’ ₄ ，g’ ₅ 。

In the generating step S133, N-1 optical flow diagrams are obtained according to the continuous two-frame sample content images. The optical flow map includes optical flow information.

Illustratively, two consecutive frames of sample content images are input to the optical flow map model 33 shown in fig. 3, the optical flow map model 33 being, for example, an existing FlowNet, flowNet33 for calculating optical flow.

For example, sample content image I of first frame and second frame _C1 And I _C 2, inputting FlowNet33, calculating to obtain an optical flow diagram W _C1->C2 . Similarly, other 3 optical flow maps W can be obtained _C2->C3 ，W _C3->C4 ，W _C4->C5 。

In the fusion step 134, 5 second synthesized feature sets and 4 optical flow maps are input into the neural network model 323 to obtain 5 fused synthesized feature sets g' ₁ ，g” ₂ ，g” ₃ ，g” ₄ ，g” ₅ 。

The neural network model 323 is, for example, an existing ConvLSTM.

In this embodiment, the feature conversion is guided by predicting the current frame and the next frame optical flow with FlowNet. Given two consecutive frames

And->

FlowNet computation light flow graph W _t-1->t . At time stamp t, convLSTM accepts three outputs as its inputs: current stylized feature (i.e., second set of synthetic features) g' _t Hidden layer state (hiddenstate) h of last timestamp identity module _t-1 Cell state c of last timestamp same module _t-1 And generates (outputs) a hidden layer state h of the current timestamp _t And cell state c _t . In addition, in the case of the optical fiber,a 3 x 3 convolutional layer is also used instead of the original linear layer.

For example, a second synthetic feature set g' ₂ Hidden layer state h output by ConvLSTM323 at last time t1 ₁ Cell state c ₁ Input ConvLSTM323, convLSTM323 outputting a corresponding fused synthetic feature set g' ₂ . It will be appreciated that ConvLSTM323 also outputs the cell state c at the current time t2 ₂ 。

In this embodiment, for example, for the fused synthetic feature set g "output by ConvLSTM323 at t 1" ₁ Using a light-beam map W _C1->C2 Hidden layer h as t1 output after deforming (warp ping) ₁ Matching it with the features of the current frame (second frame).

Illustratively, the feature set g "will be fused" ₁ And optical flow diagram W _C1->C2 Input to the morphing module 34 shown in FIG. 3, get h ₁ 。

It will be appreciated that 5 fused synthetic feature sets g "may be obtained in a similar manner to that described above" ₁ ，g” ₂ ，g” ₃ ，g” ₄ ，g” ₅ 。

In addition, it can be appreciated that in inputting the first frame sample content image I _C1 And corresponding sample style image I _S1 At the time, only g 'of ConvLSTM323 is input' ₁ Both hidden layer state h and cell state c are 0 and no light flow pattern is entered.

In the second decoding step S135, the 5 fused feature sets are input to the second decoding module 324 (Decoder 2) to obtain 5 frames of the second real image, i.e., I 'shown in fig. 3' _G1 ，I’ _G2 ，I’ _G3 ，I’ _G4 ，I’ _G5 。

It will be appreciated that the second decoding module 325 is configured to decode the transformed and fused set of fused composite features into a true image, e.g., the fused composite feature set g'. ₁ The second decoding module 324 (Decoder 2) is input to obtain a corresponding 1-frame second real image I' _G1 。

Each set of fusion synthesized features includes a plurality of fusion synthesized features, e.g., 4 fusion synthesized features.

The second decoding module 324 performs the same decoding process as shown in fig. 6. Also, it is understood that the fused set of synthesized features output by ConvLSTM323 is input to the second decoding module 324. For one fused synthetic feature set input, the second decoding module 324 downsamples each fused synthetic feature and then fuses with the next fused synthetic feature and downsamples. And in a similar manner, downsampling the four fusion synthesis features to obtain downsampled fusion synthesis features. Then, up-sampling is performed on the down-sampled fusion synthesized features, and then up-sampling is performed together with each fusion synthesized feature in sequence. And in a similar way, up-sampling the four fusion synthesized features, and finally, convolving to output 1 frame of second real image.

It will be appreciated that in a similar manner, the second decoding module 324 may decode to obtain the other 4 frame second real image.

It will be appreciated that the manner in which the second decoding module 324 decodes each fused set of composite features is the same as the manner in which the first decoding module 313 decodes each first set of composite features, and will not be repeated here.

The loss function used in the embodiments of the present invention is described below.

It is understood that the penalty functions include content penalty functions, timing consistency penalty functions, perceptual penalty functions, and the like. In this embodiment, the content loss function is used to constrain the generated result so that the structure of the stylized generated result is the same as the input image. Because the statistical information of the features is converted and matched with the statistical information of the target style, the style migration effect can be achieved by only monitoring the style removal model and the style recovery model by using the content loss function, the structure of the generated graph is kept unchanged, and artifacts cannot be generated due to perception loss or counterloss.

Given a content image I _c Stylized output I corresponding to the same _G Conv4_1 layer transport using VGG-19The out-going features calculate the content loss. The style removal model and the style restoration model are two style migration networks that are trained end-to-end and do not share parameters. For a segment of video with N frames, the learning objective becomes:

Wherein the method comprises the steps of

Representing the first real image generated by style removal model 31, < >>

Representing a second real image generated by style recovery model 32,/for>

Representing a sample content image. In this embodiment, λ=1, for example. />

Representing a first loss function between the N frames of sample content image and the N frames of first real image,/i>

Representing a second loss function between the N frames of sample content images and the N frames of second real images.

For example, the first and second loss functions are content loss functions.

Using the loss function shown in expression (5), a plurality of first parameters of the style removal model 31 and a plurality of second parameters of the style restoration model 32 can be determined.

It is understood that the first plurality of parameters includes parameters of DecouledIN 1, parameters of Decoder1, and the second plurality of parameters includes parameters of DecouledIN 1, parameters of Decoder1, and parameters of ConvLSTM.

It will be appreciated that the trained style removal model 31 and style restoration model 32 are derived in the manner described above, i.e., the trained style removal model 31 includes a determined plurality of first parameters and the trained style restoration model 32 includes a determined plurality of second parameters.

In the present invention, for each training sample (video), there are, for example, 5 consecutive video frames and 5 style images. For each frame, it is first stylized by a style removal model 31 and then stylized by a style restoration model 32. In style restoration, features of different timestamps are linked by ConvLSTM.

It will be appreciated that the parameters of the style removal model 31 and the style restoration model 32 at different time stamps are shared, i.e. there is only one style removal model 31 and one style restoration model 32 in total.

Fig. 7 is a flow chart of a method of video style migration according to an embodiment of the present invention.

As shown in fig. 7, in the step S71 of acquiring a video to be processed, the video to be processed including a plurality of frames of content images to be processed, each frame of content image to be processed corresponding to one style image, and a plurality of style images are acquired.

FIG. 8 is a schematic diagram of a process for video style migration according to an embodiment of the present invention. As shown in fig. 8, the video to be processed includes, for example, N frames of the content image I to be processed _C1 ，I _C2 ，I _C3 ，I _C4 ，......I _Cn . The plurality of style images is, for example, N style images I _S1 ，I _S2 ，I _S3 ，I _S4 ，......I _Sn 。

In the style migration step S72, using the target style migration model 80 shown in fig. 8, each frame of the content image to be processed and the corresponding style image are subjected to image style migration, so as to obtain a video after style migration.

The target style migration model 80 is the style restoration model 32 with the second parameters determined according to the training method shown in fig. 1.

The processing procedure and style recovery model 32 of the target style migration model 80 for the content image to be processed and the corresponding style image for each frame are the same and will not be repeated here.

It will be appreciated that the plurality of style images may be the same or different. Therefore, the invention can carry out any style migration on the video with any length according to the wish of the user.

The invention can efficiently transfer the style of the style image to the video frame without causing distortion and artifact. And meanwhile, style migration is carried out on continuous frames, and context information is transferred, so that the continuity of the front frame and the rear frame is not caused.

The invention can carry out video style migration, and ensures that the migrated video does not have any oil painting-like distortion and dissonance artifacts.

In the invention, the decompledIN module works together with ConvLSTM to realize feature conversion capable of keeping consistency of video content structure and front and back frames. The DecoupledIN module decomposes style migration into linear feature whitening and stylization. At the same time, flowNet predicts the position of the current frame pixel at the next frame and encodes the context information via ConvLSTM. Through the feature conversion, the invention can generate the migration video with no obvious artifact and continuous frames.

The invention can continuously use multiple style images to perform style migration on the same section of input video. In addition, the model of the invention has high calculation efficiency and operates faster than most of the existing algorithms.

In real image style migration, the most important principle is to change the style of an input image without generating distortion and artifacts. At the same time, image realism means that the stylized result should appear to be photographed by a camera. The decoupled IN module used in the invention does not damage the image structure when performing linear feature conversion. The invention can better balance the stylized effect and the image authenticity by comparing the styles of the content image and the style image. The migration results generated by the present invention do not have exactly the same color distribution as the stylistic image because doing so would make the result look like a canvas rather than a photograph. Such results can be generated because the model of the present invention is self-supervising trained. Its learning target is a real image.

The invention also provides a training device of the style migration model. As shown in fig. 9, the apparatus 90 includes: an acquisition unit 91 that acquires training data including N frames of sample content images and N sample style images; the style removing unit 92 performs image style migration on the N frames of sample content images and the N frames of sample style images through a style removing model to obtain N frames of first real images; a style restoration unit 93, configured to perform image style migration on the N frame sample content images and the N first real images through a style restoration model, to obtain N frame second real images; the parameter determination unit 94 determines a first parameter of the style removal model and a second parameter of the style restoration model based on a first loss function between the N-frame sample content image and the N-frame first real image and a second loss function between the N-frame sample content image and the N-frame second real image.

It is understood that the acquiring unit 91, the style removing unit 92, the style restoring unit 93, and the parameter determining unit 94 may be implemented by the processor 1404 having functions of these modules or units in the electronic device 1400 in fig. 11.

The invention also provides a video style migration device. As shown in fig. 10, the apparatus 100 includes: the method comprises the steps of acquiring a video unit 1001 to be processed, and acquiring the video to be processed and a plurality of style images, wherein the video to be processed comprises a plurality of frames of content images to be processed, and each frame of content image to be processed corresponds to one style image; a style migration unit 1002 performs image style migration on each frame of the content image to be processed and the corresponding style image by using the target style migration model, obtains a video after style migration,

the target style migration model is a style recovery model with the second parameters determined after training by the training device 90 shown in fig. 9.

It will be appreciated that the video unit 1001 to be processed, the style migration unit 1002 may be implemented by the processor 1404 having the functions of these modules or units in the electronic device 1400 in fig. 11.

The present invention also provides a computer readable storage medium having stored thereon instructions which when executed on a computer cause the computer to perform the method shown in fig. 1 or fig. 7.

The present invention also provides a computer program product comprising computer executable instructions that are executed by the processor 1404 to implement the method shown in fig. 1 or 7.

Referring now to fig. 11, fig. 11 schematically illustrates an example electronic device 1400 in accordance with an embodiment of the invention. In one embodiment, the system 1400 may include one or more processors 1404, system control logic 1408 coupled to at least one of the processors 1404, a system memory 1412 coupled to the system control logic 1408, a non-volatile memory (NVM) 1416 coupled to the system control logic 1408, and a network interface 1420 coupled to the system control logic 1408.

In some embodiments, the processor 1404 may include one or more single-core or multi-core processors. In some embodiments, the processor 1404 may include any combination of general-purpose processors and special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In embodiments where system 1400 employs an eNB (enhanced Node B) 101 or a RAN (Radio Access Network ) controller 102, processor 1404 may be configured to perform various conforming embodiments, such as the embodiment shown in fig. 1.

In some embodiments, the system control logic 1408 may include any suitable interface controller to provide any suitable interface to at least one of the processors 1404 and/or any suitable device or component in communication with the system control logic 1408.

In some embodiments, the system control logic 1408 may include one or more memory controllers to provide an interface to the system memory 1412. The system memory 1412 may be used for loading and storing data and/or instructions. The memory 1412 of the system 1400 may include any suitable volatile memory, such as suitable Dynamic Random Access Memory (DRAM), in some embodiments.

NVM/memory 1416 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. In some embodiments, NVM/memory 1416 may include any suitable nonvolatile memory such as flash memory and/or any suitable nonvolatile storage device, such as at least one of a HDD (Hard Disk Drive), a CD (Compact Disc) Drive, a DVD (Digital Versatile Disc ) Drive.

The NVM/memory 1416 may include a portion of the storage resources on the device mounting the system 1400 or it may be accessed by, but not necessarily part of, the apparatus. For example, NVM/memory 1416 may be accessed over a network via network interface 1420.

In particular, the system memory 1412 and NVM/storage 1416 may include: a temporary copy and a permanent copy of instructions 1424. The instructions 1424 may include: instructions that, when executed by at least one of the processors 1404, cause the electronic device 1400 to implement a method as shown in fig. 2. In some embodiments, instructions 1424, hardware, firmware, and/or software components thereof may additionally/alternatively be disposed in system control logic 1408, network interface 1420, and/or processor 1404.

Network interface 1420 may include a transceiver to provide a radio interface for system 1400 to communicate over one or more networks to any other suitable devices (e.g., front end modules, antennas, etc.). In some embodiments, the network interface 1420 may be integrated with other components of the system 1400. For example, the network interface 1420 may be integrated with at least one of the processor 1404, the system memory 1412, the nvm/storage 1416, and a firmware device (not shown) having instructions which, when executed by at least one of the processor 1404, cause the electronic device 1400 to implement the method as shown in fig. 1.

The network interface 1420 may further include any suitable hardware and/or firmware to provide a multiple-input multiple-output radio interface. For example, network interface 1420 may be a network adapter, a wireless network adapter, a telephone modem, and/or a wireless modem.

In one embodiment, at least one of the processors 1404 may be packaged together with logic for one or more controllers of the system control logic 1408 to form a System In Package (SiP). In one embodiment, at least one of the processors 1404 may be integrated on the same die with logic for one or more controllers of the system control logic 1408 to form a system on chip (SoC).

The electronic device 1400 may further include: input/output (I/O) devices 1432. The I/O device 1432 may include a user interface to enable a user to interact with the electronic device 1400; the design of the peripheral component interface enables the peripheral component to also interact with the electronic device 1400. In some embodiments, the electronic device 1400 further includes a sensor for determining at least one of environmental conditions and location information associated with the electronic device 1400.

In some embodiments, the user interface may include, but is not limited to, a display (e.g., a liquid crystal display, a touch screen display, etc.), a speaker, a microphone, one or more cameras (e.g., still image cameras and/or video cameras), a flashlight (e.g., light emitting diode flash), and a keyboard.

Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the present application may be implemented as a computer program or program code that is executed on a programmable system including at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.

Program code may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For purposes of this application, a processing system includes any system having a processor such as, for example, a Digital Signal Processor (DSP), microcontroller, application Specific Integrated Circuit (ASIC), or microprocessor.

The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. Program code may also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described in the present application are not limited in scope to any particular programming language. In either case, the language may be a compiled or interpreted language.

In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed over a network or through other computer readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including but not limited to floppy diskettes, optical disks, read-only memories (CD-ROMs), magneto-optical disks, read-only memories (ROMs), random Access Memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or tangible machine-readable memory for transmitting information (e.g., carrier waves, infrared signal digital signals, etc.) in an electrical, optical, acoustical or other form of propagated signal using the internet. Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).

In the drawings, some structural or methodological features may be shown in a particular arrangement and/or order. However, it should be understood that such a particular arrangement and/or ordering may not be required. Rather, in some embodiments, these features may be arranged in a different manner and/or order than shown in the illustrative figures. Additionally, the inclusion of structural or methodological features in a particular figure is not meant to imply that such features are required in all embodiments, and in some embodiments, may not be included or may be combined with other features.

It should be noted that, in the embodiments of the present application, each unit/module is a logic unit/module, and in physical aspect, one logic unit/module may be one physical unit/module, or may be a part of one physical unit/module, or may be implemented by a combination of multiple physical units/modules, where the physical implementation manner of the logic unit/module itself is not the most important, and the combination of functions implemented by the logic unit/module is the key to solve the technical problem posed by the present application. Furthermore, to highlight the innovative part of the present application, the above-described device embodiments of the present application do not introduce units/modules that are less closely related to solving the technical problems presented by the present application, which does not indicate that the above-described device embodiments do not have other units/modules.

It should be noted that in the examples and descriptions of this patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

While the present application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present application.

Claims

1. A method for training a style migration model for an electronic device, the method comprising:

a parameter determination step of determining a first parameter of the style removal model and a second parameter of the style restoration model based on a first loss function between the N-frame sample content image and the N-frame first real image and a second loss function between the N-frame sample content image and the N-frame second real image,

wherein N is an integer greater than or equal to 2.

2. The training method of claim 1, wherein in the style restoration step, N-1 optical flow graphs are generated from two consecutive frames of the sample content image, and the N-1 optical flow graphs are input into the style restoration model,

Wherein the optical flow map includes optical flow information.

3. The training method of claim 2, wherein the style removal model comprises a first feature extraction module, a first feature transformation module, and a first decoding module,

wherein the style removal step further comprises:

4. The training method of claim 3, wherein the style recovery model comprises a second feature extraction module, a second feature transformation module, a neural network model, and a second decoding module,

wherein the style restoration step further comprises:

5. The training method of claim 3 wherein the first feature transformation module comprises a first whitening module, a first convolution module, and a first stylization module,

6. The training method of claim 4, wherein the second feature transformation module comprises a second whitening module, a second convolution module, and a second stylization module,

7. A training method as claimed in claim 3, characterized in that the first synthesis feature set comprises a plurality of first synthesis features, and for each first synthesis feature set input, the first decoding module performs downsampling on each first synthesis feature, then fuses each first synthesis feature with a next first synthesis feature, performs downsampling on each first synthesis feature, thereby obtaining a downsampled synthesis feature, and performs upsampling on the downsampled synthesis feature together with each first synthesis feature in sequence, thereby obtaining the N frames of first real images.

8. The training method of claim 4, wherein the set of fusion synthesized features includes a plurality of fusion synthesized features, and for each set of fusion synthesized features that is input, the second decoding module downsamples each fusion synthesized feature and then fuses with a next fusion synthesized feature, and downsamples each fusion synthesized feature, thereby obtaining a downsampled fusion synthesized feature, and upsamples the downsampled fusion synthesized feature, together with each fusion synthesized feature in sequence, thereby obtaining the N frames of second real images.

9. Training method according to any of the claims 1-8, characterized in that the first and the second loss function are content loss functions.

10. The training method of any of claims 1-8, wherein the N sample style images may be the same or different.

11. A method of video style migration, comprising:

wherein the target style migration model is the style restoration model, which is obtained according to the training method of any one of claims 1-10, and which determines the second parameter.

12. The method of claim 11, wherein the plurality of style images may be the same or different.

13. A training apparatus for a style migration model, the apparatus comprising:

a parameter determination unit that determines a first parameter of the style removal model and a second parameter of the style restoration model based on a first loss function between the N-frame sample content image and the N-frame first real image and a second loss function between the N-frame sample content image and the N-frame second real image,

Wherein N is an integer greater than or equal to 2.

14. An apparatus for video style migration, the apparatus comprising:

wherein the target style migration model is the style restoration model obtained by the training apparatus of claim 13, in which the second parameter is determined.

15. The apparatus of claim 14, wherein the plurality of style images may be the same or different.

16. A computer readable storage medium having stored thereon instructions which, when executed on a computer, cause the computer to perform the method of any of claims 1 to 10 or 11 to 12.

17. An electronic device, comprising: one or more processors; one or more memories; the one or more memories store one or more programs that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-10 or 11-12.