CN115393491A

CN115393491A - Ink video generation method and device based on instance segmentation and reference frame

Info

Publication number: CN115393491A
Application number: CN202110571615.XA
Authority: CN
Inventors: 刘家瑛; 梁浩; 汪文靖; 杨帅; 郭宗明
Original assignee: Peking University; Guangzhou Baiguoyuan Information Technology Co Ltd
Current assignee: Peking University; Guangzhou Baiguoyuan Information Technology Co Ltd
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2022-11-25

Abstract

The invention provides a method and a device for generating a wash ink video based on example segmentation and reference frames, which comprises the steps of obtaining an example segmentation image of each frame in a real video, calculating the similarity of each frame according to the optical flow estimation among the frames, and selecting a frame to be converted and a reference frame; and inputting the optical flow estimation, the frame to be converted, the reference frame and the example segmentation graph thereof into the ink stylized network model to obtain an ink video frame of the frame to be converted, thereby obtaining the ink video of the real video. The invention improves the ink and wash style generating capability of the model, obtains better effect in the aspects of white left, different object stroke entanglement and different scale object processing compared with the prior art, realizes the ink and wash stylization with reference, avoids error accumulation in the traditional sequence conversion, and reduces the flicker and inconsistency in the generated ink and wash video compared with the prior art.

Description

Ink video generation method and device based on instance segmentation and reference frame

Technical Field

The invention belongs to the field of video stylization, and particularly relates to a method and a device for generating a wash-ink video based on example segmentation and a reference frame.

Background

The ink video generation aims at giving any one actually shot video, and generates the ink style video corresponding to the content, so that the purpose of batch generation of the ink video is achieved. Ink-wash video generation can be roughly divided into two parts: firstly, each generated frame has a proper ink and wash style, and secondly, the consistency and the consistency of the stylized video are kept. There have been studies on the two, respectively, but there is no attempt to combine them properly and obtain better results.

As to the former, most of the current image ink and water stylization systems are based on adjustment of a general stylization method, and characteristics of the ink and water styles are not considered well, so that the following problems exist in the quality of the generated styles: 1) Difficulty in correctly selecting the blank area; 2) Strokes of different objects are easy to tangle; 3) It is difficult to simultaneously process objects of various dimensions. Due to the above problems, it is difficult for the ink-wash video generation system based on these methods to meet the requirements of practical applications.

In the latter case, due to the characteristic that the wash painting has more white space, the result of the conventional video stylization method contains more flickering and inconsistent phenomena, so that the direct migration to the wash painting video generation task is difficult.

Disclosure of Invention

In view of the above problems, the invention provides a method and a device for generating an ink-wash video based on example segmentation and a reference frame, which can obtain a better ink-wash style compared with the existing method for stylizing the ink-wash of an image and can keep the consistency and consistency of the ink-wash video. The method can convert the real shot video into the ink and wash style video with the same content, realizes the batch generation of the ink and wash video, and improves the subjective visual quality and the artistic effect.

The technical scheme adopted by the invention is as follows:

an ink-wash video generation method based on example segmentation and reference frames comprises the following steps:

1) Acquiring an example segmentation graph of each real video frame in a real video, and calculating the similarity of any two real video frames according to the optical flow estimation between all the real video frames;

2) According to the similarity, selecting a real video frame to be converted and one or more reference frames of each round, inputting the real video frame to be converted and the corresponding reference frames into the ink and wash stylized network model, and obtaining the ink and wash video frame of the real video frame to be converted through the following steps:

2.1 Extracting real video features and corresponding reference features according to the real video frame to be converted, each reference frame and the corresponding example segmentation graph respectively, and aligning each reference feature with the real video features by utilizing optical flow estimation;

2.2 According to the difference between each reference feature and the real video frame feature, calculating the reference feature score of each reference feature, and fusing all the reference features through the reference feature scores to obtain a fused reference feature;

2.3 Decoding the modified real video features obtained according to the real video features and the fusion reference features to obtain the ink video frame of the real video frame to be converted;

3) And repeating the step 2), and generating the ink video of the real video.

Further, selecting a real video frame to be converted and a plurality of reference frames of each round through the following steps:

1) Selecting the frame with the lowest similarity degree with all converted frames and the highest similarity degree with all unconverted frames according to the similarity degree as the real video frame to be converted of the round;

2) And selecting a plurality of converted frames with the highest similarity with the real video frame to be converted as reference frames according to the similarity, wherein the number of the reference frames is 0 when the first ink video frame is generated.

Further, the real video features are obtained by the following steps:

1) Splicing the real video frame to be converted and the corresponding example segmentation graph, and coding the splicing result to obtain global characteristics;

2) For each instance i in the instance segmentation map and its corresponding region image x _i Image of area x _i After the corresponding example segmentation graph is spliced in the channel dimension, the image is amplified to the image size of the real video frame to be converted;

3) Extracting image features of the enlarged image and reducing the image features back to region image x _i Feature size of (d), resulting in example feature H _i ；

4) Fusing global features with instance features H _i And obtaining the real video characteristics.

Further, the fused reference feature is obtained by:

1) Applying softmax transformation to each reference characteristic score at each position to obtain a corresponding coefficient;

2) And carrying out weighted summation on each position of each reference feature according to the coefficient to obtain the fusion reference feature.

Further, obtaining the modified real video features through the following steps:

1) Scoring the difference between the fused reference feature and the real video frame feature to obtain a fused reference feature score;

2) And modifying the real video frame characteristics by utilizing the fusion reference characteristic score weighted summation to obtain the modified real video characteristics.

Further, the structure of the ink stylized network model comprises: the device comprises an encoder, a reference frame feature fusion device, a feature modifier and a decoder; the structure of the encoder comprises: a plurality of convolutional layers and a plurality of residual blocks, wherein each convolutional layer is followed by a batch normalization module and a linear rectification function, and each convolutional layer is downsampled by a step length larger than 1 from the second convolutional layer; the reference frame feature fusion device and the feature modification device respectively comprise: a convolution layer, a plurality of staggered residual blocks, convolution layers and a convolution layer; the structure of the decoder is symmetrical to that of the encoder, and the decoder comprises: the system comprises a plurality of residual blocks and a plurality of convolution layers, wherein each convolution layer is followed by a linear rectification function, nearest neighbor upsampling is carried out before the other convolution layers except the last convolution layer, and a batch normalization module is added behind each convolution layer.

Further, training the ink stylized network model F by:

1) Collecting a plurality of sets of training data for iterative training, wherein each set of training data comprises: the real sample video, a real sample frame taken from the real sample video, a plurality of sample reference frames taken from the real sample video and a randomly selected real wash painting;

2) Acquiring example segmentation maps of the real sample frame and each sample reference frame, calculating optical flows of the real sample frame and each sample reference frame, and inputting the real sample frame, each sample reference frame and the corresponding example segmentation maps as well as the optical flows of the real sample frame and each sample reference frame into a water and ink stylized network model to obtain a water and ink video sample frame;

3) Inputting the ink and wash video sample frame and the real ink and wash picture into an ink and wash discriminator and an ink and wash style discriminator simultaneously for discrimination, wherein the structure of the ink and wash discriminator and the ink and wash style discriminator respectively comprises: a plurality of convolutional layers, wherein each convolutional layer except the first layer and the last layer is followed by a batch normalization module and a linear rectification function, the first layer is followed by only one linear rectification function and downsampled using a step size greater than 1;

4) Inputting the real ink painting into a real picture generator to obtain a generated real picture, and inputting the real video sample frame and the generated real picture into a real picture discriminator for discrimination, wherein the structure of the real picture generator comprises: encoder and decoder, the structure of encoder includes: a plurality of convolutional layers and a plurality of residual blocks, wherein each convolutional layer is followed by a batch normalization module and a linear rectification function, starting from the second convolutional layer, each convolutional layer is downsampled by a step size greater than 1, the structure of the decoder and the encoder are symmetric, comprising: a plurality of residual block and a plurality of convolution layer, wherein every convolution layer is followed a linear rectification function to except that the last convolution layer, carry out nearest neighbor upsampling before all the other convolution layers, add a batch standardization module behind each convolution layer, real picture arbiter's structure includes: a plurality of convolutional layers, wherein each convolutional layer except the first layer and the last layer is followed by a batch normalization module and a linear rectification function, the first layer is followed by only one linear rectification function and downsampled using a step size greater than 1;

5) Inputting the ink video sample frame into a real picture generator to obtain a reconstructed real video frame, and comparing the reconstructed real video frame with the real sample frame;

6) Inputting the generated real picture and the example segmentation graph thereof into the ink stylized network model to obtain a reconstructed ink painting, and comparing the reconstructed ink painting with the real ink painting;

7) By the discrimination and comparison of the steps 3) -6), the obtained countermeasure loss L is utilized _adv Cyclic reconstruction loss L _cycle Loss of consistency L _cons Time domain loss L _tmp White loss L _whiten Profile loss L _contour Line loss L _stroke And ink loss L _ink And calculating all parameters of the ink stylized network model.

Further, the loss L is resisted _adv ＝E _x [log(1-D _Y (F(x)))]+E _y [log(1-D _X (B(y)))]Where x, y are the results of sampling from the real data set and the ink data set, respectively, E _x And E _y Respectively, a mathematical expectation calculated using the aforementioned two probability distributions, D _Y As a discriminator for ink and wash paintings, D _X A real picture discriminator, a water and ink stylized network model and a real picture generator; loss of cyclic reconstruction L _cycle ＝γE _x [||B(F(x))-x|| ₁ ]+E _y [||F(B(y))-y|| ₁ ]Where γ is a coefficient to balance the cyclic reconstruction loss L _cycle Two parts of (1); loss of consistency L _cons ＝L _flow +L _scale Wherein

y _ω ＝ω(F(x)，f)，H _ω ＝ω(E(x)，f _↓ ) ω (I, w) is the transformation of picture I according to the optical flow w, f is the optical flow between the true sample frame and the sample reference frame, f _↓ For the result of down-sampling the optical flow matrix to the feature matrix size, M _o Is a shadow mask of the optical flow,

wherein H ⁿ Features extracted after the area where the nth instance is located is amplified are obtained, s is size reduction operation, and N is number of instances; time domain loss

Where H' is a modified real sample frame feature, { r ₁ ，...，r _k Is a reference frame, { f ₁ ，...，f _k Is the optical flow between the corresponding reference frame and the current frame, { M _o，1 ，...，M _o，k Is a shadow mask corresponding to the light flow, M _o，c Is composed of

Loss of white space

Wherein M is _f As a foreground mask, M _b As a background mask, M _b ＝1-M _f (ii) a Loss of contour

Wherein, haar (I) is to perform Haar wavelet transform on the image I and extract the high-frequency component, M _c The edge mask is obtained by calculating an example segmentation map, wherein the edge mask is an object outline region 1 and other regions are 0; line loss

Wherein Edg is an edge extractor HED, T is the number of pixels in the image, mu = T _/T, wherein T _ is the sum of the weights of non-edge points in the Edg detection result; ink loss L _ink ＝E _y [log D _Y (Gauss(y))]+E _x [log(1-D _Y (Gauss(F(x))))]Where Gauss stands for gaussian blur operation.

A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the above-mentioned method when executed.

An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer to perform the method as described above.

Compared with the prior art, the method has the advantages that the example segmentation input is added, the feature extraction mode is adjusted, and the guidance is carried out through the proper loss function during training, so that the ink and wash style generation capability of the model is improved, and better effects are achieved in the aspects of whitespace, stroke entanglement of different objects and object processing of different scales compared with the prior art; by means of feature fusion of a plurality of reference frames and modification of features of real video frames, ink formatting with reference is achieved, a conversion sequence and the reference frames are selected in a proper mode, error accumulation in conventional sequence conversion is avoided, and flicker and inconsistency in a generated ink video are reduced compared with an existing method.

Drawings

FIG. 1 is an overall framework diagram of an ink and water stylized network model used in an embodiment of the present invention.

Fig. 2 is a flow diagram of encoder details in an ink stylized network model shown for a single picture.

Fig. 3 is a network structure diagram of an ink-wash stylized network model used in an embodiment of the present invention.

Fig. 4 is a network structure diagram of the ink and wash style discriminator, the ink and wash painting discriminator, the real picture generator, and the real picture discriminator used in the embodiment of the present invention.

FIG. 5 shows an input video frame according to an embodiment of the present invention.

Fig. 6 is an ink video frame generated by fig. 5.

FIG. 7 is a comparison of a single picture to an ink picture of the present invention compared to prior art methods.

Fig. 8 is a comparison of a real video to an ink video of the present invention and a prior method.

Detailed Description

In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below. It should be noted that the specific number of layers, the number of modules, the number of functions, the arrangement of some layers, etc. given in the following examples are only a preferred implementation manner, and are not limited thereto.

The embodiment discloses an ink and wash video generation method based on example segmentation and reference frames, which comprises the steps of using a training data set to resist a training ink and wash stylized network model, an ink and wash style discriminator, an ink and wash painting discriminator, a real picture generator and a real picture discriminator; the ink stylized network model comprises an encoder, a reference frame feature fusion device, a feature modification device and a decoder; the encoder comprises a plurality of successive convolutional layers and a plurality of residual blocks, each convolutional layer being followed by a Batch Normalization module (Batch Normalization) and a linear rectification function, each convolutional layer being downsampled by a step size greater than 1, starting with the second convolutional layer; the reference frame feature fusion device and the feature modifier are both composed of a convolution layer, a plurality of staggered residual blocks, convolution layers and a convolution layer; the decoder is symmetrical to the encoder and comprises a plurality of residual error blocks and a plurality of convolution layers. Each convolution layer is followed by a linear rectification function, and except the last convolution layer, nearest neighbor up-sampling is carried out before the other convolution layers, and a batch normalization module is added behind each convolution layer; the ink and wash style discriminator, the ink and wash drawing discriminator and the real picture discriminator are all composed of a plurality of convolution layers, each convolution layer is followed by a batch normalization module and a linear rectification function except the first layer and the last layer, the first layer is followed by only one linear rectification function, and the down sampling is carried out by using the step length larger than 1; the real picture generator is composed of an encoder and a decoder which have the same structure in the ink and water stylized network model.

Taking the example of converting the target video into the ink and wash style video, the following is specifically described:

step 1: a plurality of real videos, images selected from the real videos and a plurality of randomly selected ink works images are collected, the photographic images form a real data set, and the ink works images form an ink data set.

And 2, step: and building a water and ink stylized network model, a water and ink style discriminator, a water and ink painting discriminator, a real picture generator and a real picture discriminator.

The network structure is shown in 1,2,3,4, and the model is divided into an encoder E and a reference frame feature fusion device W in the ink stylized network model _r The characteristic modifier W and the decoder G are used for marking the whole ink stylized network model as F. Besides, there is a color style discriminator D _ink Ink and wash painting discriminator D _Y True picture generator B, true picture discriminator D _X 。

Encoder E consists of 3 convolutional layers in succession followed by 6 residual blocks, each convolutional layer being followed byThe 2 nd and 3 rd convolutional layers are downsampled using a convolution of step size 2, with a batch normalization module and a linear rectification function (ReLU). When in use, firstly, an input frame and an example segmentation graph are spliced in a channel dimension to obtain a global feature H through the network _G Then for each instance i and its corresponding region image x _i X is to _i And the corresponding example segmentation maps are spliced together in the channel dimension and then enlarged to the input frame size, and then the features are extracted through the network and the area image x is reduced _i Is recorded as an example feature H _i Finally, the output of the network is obtained by fusing the segmentation graphs according to the examples

Wherein S _i An example segmentation map for example i, which is a single-channel 0/1 matrix of the same size as the original image, is 1 in the area of example i, and 0 in the remaining areas.

Reference frame feature fusion device W _r The device consists of 1 convolution layer, 5 staggered residual blocks and convolution layers, and finally 1 convolution layer; when the method is used, the difference between each reference feature and the real video frame feature is respectively input, a score matrix with the same size as the reference features is generated, then softmax transformation is respectively applied to all reference feature scores at all positions, and all reference features are respectively weighted and summed at each position by taking the softmax transformation as a coefficient to obtain a fused reference feature R _f 。

Feature modifier W, its structure and W _r The same; receiving and scoring a difference between the fusion reference feature and the real video frame feature, wherein the score is a matrix w with the same size as the fusion reference feature and the real video frame feature, and modifying the real video frame feature by using the fusion reference feature according to the score to obtain the modified real video frame feature

Where H' is the true video frame characteristics found by the encoder.

The decoder G is symmetrical to the encoder in structure and comprises 6 residual blocks and 3 convolutional layers, each convolutional layer is followed by a linear rectification function, and nearest neighbor upsampling is carried out before the other convolutional layers except the last convolutional layer; the module is used for converting the decorated real video frame characteristics into ink and wash style frames.

And the real picture generator B consists of 3 convolutional layers, 12 residual blocks and 3 convolutional layers, wherein each convolutional layer is followed by a linear rectification function, the ReLU is used except for the last convolutional layer, and the tanh is used for the last convolutional layer. Besides the first convolution layer and the last convolution layer, a batch normalization module is added behind each convolution layer; the module is used for conversion of ink-and-water images into real photos.

Ink style discriminator D _ink Ink and wash painting discriminator D _Y True picture discriminator D _X Each convolutional layer is followed by a batch normalization module and a linear rectification function (LeakyReLU) except the first layer and the last layer, the first layer is followed by only one linear rectification function (LeakyReLU), and the first convolutional layer uses convolution with the step length of 2 to carry out downsampling; the modules are respectively used for judging whether the input image has a wash ink style, belongs to a wash ink painting or not and belongs to a real photo or not, and the modules are used for performing confrontation training with the generation models F and B.

And 3, step 3: training ink stylized network model F and ink style discriminator D _ink Ink and wash painting discriminator D _Y Real picture generator B and real picture discriminator D _X . Wherein a model F, B and a discriminant model D are generated _ink ，D _Y ，D _X And alternately carrying out optimization.

For the generative models F, B, the loss function includes: against loss L _adv Cyclic reconstruction loss L _cycle Loss of consistency L _cons Time domain loss L _tmp White loss L _whiten Profile loss L _contour Line loss L _stroke And ink loss L _ink 。

Wherein L is _adv ＝E _x [log(1-D _Y (F(x)))]+E _y [log(1-D _X (B(y)))]，L _cycle ＝γE _x [||B(F(x))-x|| ₁ ]+E _y [||F(B(y))-y|| ₁ ]X, y are the results of sampling from the real data set and the ink data set, respectively, E _x And E _y Respectively, a mathematical expectation calculated using the aforementioned two probability distributions, D _Y As a discriminator for ink and wash paintings, D _X As a true picture discriminator, F is a water and ink stylized network model, B is a true picture generator, and γ is a coefficient for balancing L _cycle The inner two parts. The partial loss function aims to guide the ink stylized network model to generate a real ink style, guide the real picture generator to generate a real photo style, and generate the two types of the real photo style in a mutual inverse operation mode.

L _cons ＝L _flow +L _scale First part

Wherein y is _ω ＝ω(F(x)，f)，H _ω ＝ω(E(x)，f _↓ ) ω (I, w) means that the picture I is transformed according to an optical flow w, f is the optical flow between a sampled video frame and another frame, f _↓ For the result of the down-sampling of the optical flow matrix to the feature matrix size, M _o Is a shadow mask of the optical flow. L is _flow The goal is to guide the model to establish optical flow consistency between the features learned by the encoder and the ink images generated by the decoder. Another part

Wherein H ⁿ And (4) amplifying the region where the nth example mentioned in the introduction of the encoder is located to extract the features, and performing size reduction operation on s. L is _scale The goal is to guide the model to establish scale consistency between the features learned by the encoder and the ink image generated by the decoder.

Wherein H' is a modified true viewFrequency frame characteristics, { r } ₁ ，...，r _k Is a reference frame, { f ₁ ，...，f _k Is the optical flow between the corresponding reference frame and the current frame, { M _o，1 ，...，M _o，k Is a shadow mask corresponding to the light flow, M _o，c Is composed of

The objective of the loss function is to guide the model to generate an ink-wash image which is consistent with a reference frame in an area with credible optical flow, and generate an image which is consistent with the condition without the reference frame in the residual area.

Wherein M is _f For foreground masks, directly generated by example style results, M _b As a background mask, M _b ＝1-M _f . The purpose of this loss function is to guide the model to whiteout in the background area, applying more ink to the foreground portion.

Wherein, haar (I) is to perform Haar wavelet transform on the image I and extract the high-frequency component, M _c The edge mask is obtained by calculating an example segmentation map, with 1 in the object outline region and 0 in the other regions. The loss function aims to reduce the entanglement phenomenon of strokes among different objects.

Where Edg is the edge extractor HED, T is the number of pixels in the graph, μ = T _/T, where T _ is the sum of the weights of the non-edge points in the Edg detection result, and this coefficient is used to balance the loss of both edge and non-edge types. The objective of the loss function is to constrain the generated ink image and the original image to have similar lines and contours.

L _ink ＝E _y [log D _Y (Gauss(y))]+E _x [log(1-D _Y (Gauss(F(x))))]Where Gauss stands for gaussian blur operation. The loss function is intended to optimize the ink texture of the resulting ink image.

For discriminant model D _X ，D _Y ，D _ink In other words, the loss includes L _{adv_X} ＝E _y [log(1-D _Y (y))]+E _x [log D _Y (F(x))]，L _{adv_Y} ＝E _x [log(1-D _X (x))]+E _y [log D _X (B(y))]And L _{adv_ink} ＝E _y [log(1-D _ink (Gauss(y)))]+E _x [log D _ink (Gauss(F(x)))]The purpose of the partial loss function is to respectively generate an ink image and a real ink image for guiding a discriminant model to distinguish; the guiding discrimination model distinguishes and generates a real picture and a real picture; the guide discrimination model distinguishes the generated ink style from the real ink style.

The training step comprises:

calculating example segmentation (marking areas where different objects are located) of the real video frame and the reference frame and optical flow between the real video frame and the reference frame, and inputting the optical flow into the ink stylized network model to obtain a converted ink video frame;

the generated ink video frame and the ink painting are simultaneously input into an ink painting discriminator and an ink painting style discriminator for discrimination, and loss (L) is calculated _adv And L _ink ) The method is used for the confrontation training between the discriminator and the generator, so that the generated result is more real;

inputting the wash painting into a real picture generator to obtain a generated real picture;

inputting the real video frame and the generated real picture into a real picture discriminator for discrimination;

inputting the generated ink video frame into a real picture generator to obtain a reconstructed real video frame;

comparing the reconstructed real video frame with the input real video frame, calculating the loss (L) _cycle ) Guarantee reconstructed real video frame and originalThe content of the image is not lost in the generation process, namely the generated ink video frame and the input real video frame have the same content structure;

the generated real picture and the example thereof are segmented and input into the ink stylized network model to obtain a reconstructed ink painting;

the reconstructed ink-wash painting is compared with the input ink-wash painting, and the same is used to calculate the loss (L) _cycle ) To ensure the content is consistent during the generation process.

And 4, step 4: and an inference stage, namely inputting a video shot really (see fig. 5) to calculate an example segmentation of the video and carrying out optical flow estimation between all frames two by two, and finally outputting a wanted ink-wash style video (see fig. 6), wherein only some frames in the video are listed for convenience of demonstration, and the method comprises the following steps:

1) Calculating optical flows among all frames, and measuring the similarity of the two frames according to the size of the optical flows between the two frames;

2) Selecting a frame which is most similar to all the converted frames and most similar to all other non-converted frames as a real video frame of the round;

3) Selecting a plurality of converted frames which are most similar to the selected real video frames as reference frames, wherein the reference frames are specially not available in the first round;

4) Inputting the characteristics, corresponding to the reference frame, of a decoder in the ink stylized network model as reference characteristics, the selected real video frame, the example segmentation of the real video frame and the optical flow between the real video frame and the reference frame into the ink stylized network model together to obtain a corresponding ink video frame, and directly inputting the characteristics extracted from the real video frame into the decoder of the ink stylized network without the reference frame to generate the corresponding ink frame;

5) Returning to the step 2), until all the frames are converted, and finally, forming the converted frames into the ink-wash video according to the previous sequence.

Experimental data:

compared with the existing methods for ink painting and stylizing pictures, the method for converting a single picture into an ink painting picture is carried out (see fig. 7, chipGAN is the existing image ink painting method, adaIN, WCT and ChipGAN are the existing image stylizing method), and the ink painting effect better than that of the existing method can be obtained on the single picture; meanwhile, the comparison of real video to ink video is carried out with the existing image inking and video stylizing methods (see fig. 8, two adjacent columns are respectively enlarged views of video frames and corresponding areas, chipGAN is the existing image inking method, the method is respectively applied to each frame of the video to carry out the inking of the video, and Linear and Compound are the existing video stylizing methods).

It is easy to understand that the specific structure of each network in the method, such as the number of convolutional layers, the type of nonlinear activation function, the regularization method, etc., can be replaced by other structures; the optical flow estimation method, the example segmentation method, and the edge detection method used in the present method may be replaced with other methods having the same or similar functions.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. An ink-wash video generation method based on example segmentation and reference frames comprises the following steps:

2) Selecting a real video frame to be converted and one or more reference frames of each round according to the similarity, inputting the real video frame to be converted and the corresponding reference frames into a wash ink stylized network model, and obtaining the wash ink video frame of the real video frame to be converted through the following steps:

2.1 Respectively extracting real video features and corresponding reference features according to a real video frame to be converted, each reference frame and a corresponding example segmentation graph, and aligning each reference feature with the real video features by utilizing optical flow estimation;

2.2 Calculating a reference feature score of each reference feature according to the difference between each reference feature and the real video frame feature, and fusing all reference features through the reference feature scores to obtain a fused reference feature;

3) And repeating the step 2), and generating the ink video of the real video.

2. The method of claim 1, wherein the real video frame to be converted and the reference frames are selected for each round by the following steps:

2) And selecting a plurality of converted frame frames with the highest similarity with the real video frame to be converted as reference frames according to the similarity, wherein the number of the reference frames is 0 when the first ink video frame is generated.

3. The method of claim 1, wherein the true video features are obtained by:

1) Splicing the real video frame to be converted and the corresponding example segmentation graph, and coding a splicing result to obtain global characteristics;

3) Extracting image features of the enlarged image and reducing the image features back to region image x _i The characteristic dimension of (a) is,example feature H was obtained _i ；

4. The method of claim 1, wherein the fused reference feature is obtained by:

5. The method of claim 1, wherein the modified real video features are obtained by:

1) Scoring the difference between the fusion reference feature and the real video frame feature to obtain a fusion reference feature score;

6. The method of claim 1, wherein the structure of the ink stylized network model comprises: the device comprises an encoder, a reference frame feature fusion device, a feature modifier and a decoder; the structure of the encoder comprises: a plurality of convolutional layers and a plurality of residual blocks, wherein each convolutional layer is followed by a batch normalization module and a linear rectification function, and each convolutional layer is downsampled by a step length larger than 1 from the second convolutional layer; the reference frame feature fusion device and the feature modification device respectively comprise: a convolution layer, a plurality of staggered residual blocks, convolution layers and a convolution layer; the structure of the decoder is symmetrical to that of the encoder, and the decoder comprises: the system comprises a plurality of residual blocks and a plurality of convolution layers, wherein each convolution layer is followed by a linear rectification function, nearest neighbor upsampling is carried out before the other convolution layers except the last convolution layer, and a batch normalization module is added behind each convolution layer.

7. The method of claim 1, wherein the ink stylized network model F is trained by:

7) By the discrimination and comparison of the steps 3) -6), the obtained countermeasure loss L is utilized _adv And loss of cyclic reconstruction L _cycle Loss of consistency L _cons Time domain loss L _tmp White loss L _whiten Profile loss L _contour Line loss L _stroke And ink loss L _ink And calculating all parameters of the ink stylized network model.

8. The method of claim 7, wherein the loss L is resisted _adv ＝E _x [log(1-D _Y (F(x)))]+E _y [log(1-D _x (B(y)))]Where x, y are the results of sampling from the real data set and the ink data set, respectively, E _x And E _y Respectively, a mathematical expectation, D, calculated using the aforementioned two probability distributions _Y As a discriminator for ink and wash paintings, D _X A real picture discriminator, a Chinese ink stylized network model and a real picture generator; loss of cyclic reconstruction L _cycle ＝γE _x [||B(F(x))-x|| ₁ ]+E _y [||F(B(y))-y|| ₁ ]Where γ is a coefficient to balance the cyclic reconstruction loss L _cycle Two parts of (1); loss of consistency L _cons ＝L _flow +L _scale Wherein

y _ω ＝ω(F(x)，f)，H _ω ＝ω(E(x)，f _↓ ) ω (I, w) is the transformation of picture I according to the optical flow w, f is the optical flow between the true sample frame and the sample reference frame, f _↓ For the result of down-sampling the optical flow matrix to the feature matrix size, M _o Is a shadow mask of the light flow and,

Where H' is a modified real sample frame feature, { r ₁ ，...，r _k Is a reference frame, { f ₁ ，...，f _k Is the optical flow between the corresponding reference frame and the current frame, { M _o，1 ，...，M _o，k Is a shadow mask for the corresponding light flow, M _o，c Is composed of

Loss of white space

Wherein M is _f As foreground mask, M _b As a background mask, M _b ＝1-M _f (ii) a Loss of contour

Wherein Haar (I) is the Haar wavelet carried out on the image ITransforming, extracting high frequency components, M, therein _c The edge mask is obtained by calculating an example segmentation map, wherein the edge mask is an object outline region 1 and other regions are 0; line loss

9. A storage medium having a computer program stored thereon, wherein the computer program is arranged to, when run, perform the method of any of claims 1-8.

10. An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the method according to any of claims 1-8.