CN115393491A - Ink video generation method and device based on instance segmentation and reference frame - Google Patents

Ink video generation method and device based on instance segmentation and reference frame Download PDF

Info

Publication number
CN115393491A
CN115393491A CN202110571615.XA CN202110571615A CN115393491A CN 115393491 A CN115393491 A CN 115393491A CN 202110571615 A CN202110571615 A CN 202110571615A CN 115393491 A CN115393491 A CN 115393491A
Authority
CN
China
Prior art keywords
ink
real
frame
video
wash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110571615.XA
Other languages
Chinese (zh)
Inventor
刘家瑛
梁浩
汪文靖
杨帅
郭宗明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Guangzhou Baiguoyuan Information Technology Co Ltd
Original Assignee
Peking University
Guangzhou Baiguoyuan Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Guangzhou Baiguoyuan Information Technology Co Ltd filed Critical Peking University
Priority to CN202110571615.XA priority Critical patent/CN115393491A/en
Publication of CN115393491A publication Critical patent/CN115393491A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/02Non-photorealistic rendering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/207Analysis of motion for motion estimation over a hierarchy of resolutions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Graphics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a method and a device for generating a wash ink video based on example segmentation and reference frames, which comprises the steps of obtaining an example segmentation image of each frame in a real video, calculating the similarity of each frame according to the optical flow estimation among the frames, and selecting a frame to be converted and a reference frame; and inputting the optical flow estimation, the frame to be converted, the reference frame and the example segmentation graph thereof into the ink stylized network model to obtain an ink video frame of the frame to be converted, thereby obtaining the ink video of the real video. The invention improves the ink and wash style generating capability of the model, obtains better effect in the aspects of white left, different object stroke entanglement and different scale object processing compared with the prior art, realizes the ink and wash stylization with reference, avoids error accumulation in the traditional sequence conversion, and reduces the flicker and inconsistency in the generated ink and wash video compared with the prior art.

Description

Ink video generation method and device based on instance segmentation and reference frame
Technical Field
The invention belongs to the field of video stylization, and particularly relates to a method and a device for generating a wash-ink video based on example segmentation and a reference frame.
Background
The ink video generation aims at giving any one actually shot video, and generates the ink style video corresponding to the content, so that the purpose of batch generation of the ink video is achieved. Ink-wash video generation can be roughly divided into two parts: firstly, each generated frame has a proper ink and wash style, and secondly, the consistency and the consistency of the stylized video are kept. There have been studies on the two, respectively, but there is no attempt to combine them properly and obtain better results.
As to the former, most of the current image ink and water stylization systems are based on adjustment of a general stylization method, and characteristics of the ink and water styles are not considered well, so that the following problems exist in the quality of the generated styles: 1) Difficulty in correctly selecting the blank area; 2) Strokes of different objects are easy to tangle; 3) It is difficult to simultaneously process objects of various dimensions. Due to the above problems, it is difficult for the ink-wash video generation system based on these methods to meet the requirements of practical applications.
In the latter case, due to the characteristic that the wash painting has more white space, the result of the conventional video stylization method contains more flickering and inconsistent phenomena, so that the direct migration to the wash painting video generation task is difficult.
Disclosure of Invention
In view of the above problems, the invention provides a method and a device for generating an ink-wash video based on example segmentation and a reference frame, which can obtain a better ink-wash style compared with the existing method for stylizing the ink-wash of an image and can keep the consistency and consistency of the ink-wash video. The method can convert the real shot video into the ink and wash style video with the same content, realizes the batch generation of the ink and wash video, and improves the subjective visual quality and the artistic effect.
The technical scheme adopted by the invention is as follows:
an ink-wash video generation method based on example segmentation and reference frames comprises the following steps:
1) Acquiring an example segmentation graph of each real video frame in a real video, and calculating the similarity of any two real video frames according to the optical flow estimation between all the real video frames;
2) According to the similarity, selecting a real video frame to be converted and one or more reference frames of each round, inputting the real video frame to be converted and the corresponding reference frames into the ink and wash stylized network model, and obtaining the ink and wash video frame of the real video frame to be converted through the following steps:
2.1 Extracting real video features and corresponding reference features according to the real video frame to be converted, each reference frame and the corresponding example segmentation graph respectively, and aligning each reference feature with the real video features by utilizing optical flow estimation;
2.2 According to the difference between each reference feature and the real video frame feature, calculating the reference feature score of each reference feature, and fusing all the reference features through the reference feature scores to obtain a fused reference feature;
2.3 Decoding the modified real video features obtained according to the real video features and the fusion reference features to obtain the ink video frame of the real video frame to be converted;
3) And repeating the step 2), and generating the ink video of the real video.
Further, selecting a real video frame to be converted and a plurality of reference frames of each round through the following steps:
1) Selecting the frame with the lowest similarity degree with all converted frames and the highest similarity degree with all unconverted frames according to the similarity degree as the real video frame to be converted of the round;
2) And selecting a plurality of converted frames with the highest similarity with the real video frame to be converted as reference frames according to the similarity, wherein the number of the reference frames is 0 when the first ink video frame is generated.
Further, the real video features are obtained by the following steps:
1) Splicing the real video frame to be converted and the corresponding example segmentation graph, and coding the splicing result to obtain global characteristics;
2) For each instance i in the instance segmentation map and its corresponding region image x i Image of area x i After the corresponding example segmentation graph is spliced in the channel dimension, the image is amplified to the image size of the real video frame to be converted;
3) Extracting image features of the enlarged image and reducing the image features back to region image x i Feature size of (d), resulting in example feature H i
4) Fusing global features with instance features H i And obtaining the real video characteristics.
Further, the fused reference feature is obtained by:
1) Applying softmax transformation to each reference characteristic score at each position to obtain a corresponding coefficient;
2) And carrying out weighted summation on each position of each reference feature according to the coefficient to obtain the fusion reference feature.
Further, obtaining the modified real video features through the following steps:
1) Scoring the difference between the fused reference feature and the real video frame feature to obtain a fused reference feature score;
2) And modifying the real video frame characteristics by utilizing the fusion reference characteristic score weighted summation to obtain the modified real video characteristics.
Further, the structure of the ink stylized network model comprises: the device comprises an encoder, a reference frame feature fusion device, a feature modifier and a decoder; the structure of the encoder comprises: a plurality of convolutional layers and a plurality of residual blocks, wherein each convolutional layer is followed by a batch normalization module and a linear rectification function, and each convolutional layer is downsampled by a step length larger than 1 from the second convolutional layer; the reference frame feature fusion device and the feature modification device respectively comprise: a convolution layer, a plurality of staggered residual blocks, convolution layers and a convolution layer; the structure of the decoder is symmetrical to that of the encoder, and the decoder comprises: the system comprises a plurality of residual blocks and a plurality of convolution layers, wherein each convolution layer is followed by a linear rectification function, nearest neighbor upsampling is carried out before the other convolution layers except the last convolution layer, and a batch normalization module is added behind each convolution layer.
Further, training the ink stylized network model F by:
1) Collecting a plurality of sets of training data for iterative training, wherein each set of training data comprises: the real sample video, a real sample frame taken from the real sample video, a plurality of sample reference frames taken from the real sample video and a randomly selected real wash painting;
2) Acquiring example segmentation maps of the real sample frame and each sample reference frame, calculating optical flows of the real sample frame and each sample reference frame, and inputting the real sample frame, each sample reference frame and the corresponding example segmentation maps as well as the optical flows of the real sample frame and each sample reference frame into a water and ink stylized network model to obtain a water and ink video sample frame;
3) Inputting the ink and wash video sample frame and the real ink and wash picture into an ink and wash discriminator and an ink and wash style discriminator simultaneously for discrimination, wherein the structure of the ink and wash discriminator and the ink and wash style discriminator respectively comprises: a plurality of convolutional layers, wherein each convolutional layer except the first layer and the last layer is followed by a batch normalization module and a linear rectification function, the first layer is followed by only one linear rectification function and downsampled using a step size greater than 1;
4) Inputting the real ink painting into a real picture generator to obtain a generated real picture, and inputting the real video sample frame and the generated real picture into a real picture discriminator for discrimination, wherein the structure of the real picture generator comprises: encoder and decoder, the structure of encoder includes: a plurality of convolutional layers and a plurality of residual blocks, wherein each convolutional layer is followed by a batch normalization module and a linear rectification function, starting from the second convolutional layer, each convolutional layer is downsampled by a step size greater than 1, the structure of the decoder and the encoder are symmetric, comprising: a plurality of residual block and a plurality of convolution layer, wherein every convolution layer is followed a linear rectification function to except that the last convolution layer, carry out nearest neighbor upsampling before all the other convolution layers, add a batch standardization module behind each convolution layer, real picture arbiter's structure includes: a plurality of convolutional layers, wherein each convolutional layer except the first layer and the last layer is followed by a batch normalization module and a linear rectification function, the first layer is followed by only one linear rectification function and downsampled using a step size greater than 1;
5) Inputting the ink video sample frame into a real picture generator to obtain a reconstructed real video frame, and comparing the reconstructed real video frame with the real sample frame;
6) Inputting the generated real picture and the example segmentation graph thereof into the ink stylized network model to obtain a reconstructed ink painting, and comparing the reconstructed ink painting with the real ink painting;
7) By the discrimination and comparison of the steps 3) -6), the obtained countermeasure loss L is utilized adv Cyclic reconstruction loss L cycle Loss of consistency L cons Time domain loss L tmp White loss L whiten Profile loss L contour Line loss L stroke And ink loss L ink And calculating all parameters of the ink stylized network model.
Further, the loss L is resisted adv =E x [log(1-D Y (F(x)))]+E y [log(1-D X (B(y)))]Where x, y are the results of sampling from the real data set and the ink data set, respectively, E x And E y Respectively, a mathematical expectation calculated using the aforementioned two probability distributions, D Y As a discriminator for ink and wash paintings, D X A real picture discriminator, a water and ink stylized network model and a real picture generator; loss of cyclic reconstruction L cycle =γE x [||B(F(x))-x|| 1 ]+E y [||F(B(y))-y|| 1 ]Where γ is a coefficient to balance the cyclic reconstruction loss L cycle Two parts of (1); loss of consistency L cons =L flow +L scale Wherein
Figure BDA0003082925990000041
Figure BDA0003082925990000042
y ω =ω(F(x),f),H ω =ω(E(x),f ) ω (I, w) is the transformation of picture I according to the optical flow w, f is the optical flow between the true sample frame and the sample reference frame, f For the result of down-sampling the optical flow matrix to the feature matrix size, M o Is a shadow mask of the optical flow,
Figure BDA0003082925990000043
Figure BDA0003082925990000044
wherein H n Features extracted after the area where the nth instance is located is amplified are obtained, s is size reduction operation, and N is number of instances; time domain loss
Figure BDA0003082925990000045
Figure BDA0003082925990000046
Where H' is a modified real sample frame feature, { r 1 ,...,r k Is a reference frame, { f 1 ,...,f k Is the optical flow between the corresponding reference frame and the current frame, { M o,1 ,...,M o,k Is a shadow mask corresponding to the light flow, M o,c Is composed of
Figure BDA0003082925990000047
Loss of white space
Figure BDA0003082925990000048
Wherein M is f As a foreground mask, M b As a background mask, M b =1-M f (ii) a Loss of contour
Figure BDA0003082925990000049
Wherein, haar (I) is to perform Haar wavelet transform on the image I and extract the high-frequency component, M c The edge mask is obtained by calculating an example segmentation map, wherein the edge mask is an object outline region 1 and other regions are 0; line loss
Figure BDA00030829259900000410
Figure BDA00030829259900000411
Wherein Edg is an edge extractor HED, T is the number of pixels in the image, mu = T _/T, wherein T _ is the sum of the weights of non-edge points in the Edg detection result; ink loss L ink =E y [log D Y (Gauss(y))]+E x [log(1-D Y (Gauss(F(x))))]Where Gauss stands for gaussian blur operation.
A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the above-mentioned method when executed.
An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer to perform the method as described above.
Compared with the prior art, the method has the advantages that the example segmentation input is added, the feature extraction mode is adjusted, and the guidance is carried out through the proper loss function during training, so that the ink and wash style generation capability of the model is improved, and better effects are achieved in the aspects of whitespace, stroke entanglement of different objects and object processing of different scales compared with the prior art; by means of feature fusion of a plurality of reference frames and modification of features of real video frames, ink formatting with reference is achieved, a conversion sequence and the reference frames are selected in a proper mode, error accumulation in conventional sequence conversion is avoided, and flicker and inconsistency in a generated ink video are reduced compared with an existing method.
Drawings
FIG. 1 is an overall framework diagram of an ink and water stylized network model used in an embodiment of the present invention.
Fig. 2 is a flow diagram of encoder details in an ink stylized network model shown for a single picture.
Fig. 3 is a network structure diagram of an ink-wash stylized network model used in an embodiment of the present invention.
Fig. 4 is a network structure diagram of the ink and wash style discriminator, the ink and wash painting discriminator, the real picture generator, and the real picture discriminator used in the embodiment of the present invention.
FIG. 5 shows an input video frame according to an embodiment of the present invention.
Fig. 6 is an ink video frame generated by fig. 5.
FIG. 7 is a comparison of a single picture to an ink picture of the present invention compared to prior art methods.
Fig. 8 is a comparison of a real video to an ink video of the present invention and a prior method.
Detailed Description
In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below. It should be noted that the specific number of layers, the number of modules, the number of functions, the arrangement of some layers, etc. given in the following examples are only a preferred implementation manner, and are not limited thereto.
The embodiment discloses an ink and wash video generation method based on example segmentation and reference frames, which comprises the steps of using a training data set to resist a training ink and wash stylized network model, an ink and wash style discriminator, an ink and wash painting discriminator, a real picture generator and a real picture discriminator; the ink stylized network model comprises an encoder, a reference frame feature fusion device, a feature modification device and a decoder; the encoder comprises a plurality of successive convolutional layers and a plurality of residual blocks, each convolutional layer being followed by a Batch Normalization module (Batch Normalization) and a linear rectification function, each convolutional layer being downsampled by a step size greater than 1, starting with the second convolutional layer; the reference frame feature fusion device and the feature modifier are both composed of a convolution layer, a plurality of staggered residual blocks, convolution layers and a convolution layer; the decoder is symmetrical to the encoder and comprises a plurality of residual error blocks and a plurality of convolution layers. Each convolution layer is followed by a linear rectification function, and except the last convolution layer, nearest neighbor up-sampling is carried out before the other convolution layers, and a batch normalization module is added behind each convolution layer; the ink and wash style discriminator, the ink and wash drawing discriminator and the real picture discriminator are all composed of a plurality of convolution layers, each convolution layer is followed by a batch normalization module and a linear rectification function except the first layer and the last layer, the first layer is followed by only one linear rectification function, and the down sampling is carried out by using the step length larger than 1; the real picture generator is composed of an encoder and a decoder which have the same structure in the ink and water stylized network model.
Taking the example of converting the target video into the ink and wash style video, the following is specifically described:
step 1: a plurality of real videos, images selected from the real videos and a plurality of randomly selected ink works images are collected, the photographic images form a real data set, and the ink works images form an ink data set.
And 2, step: and building a water and ink stylized network model, a water and ink style discriminator, a water and ink painting discriminator, a real picture generator and a real picture discriminator.
The network structure is shown in 1,2,3,4, and the model is divided into an encoder E and a reference frame feature fusion device W in the ink stylized network model r The characteristic modifier W and the decoder G are used for marking the whole ink stylized network model as F. Besides, there is a color style discriminator D ink Ink and wash painting discriminator D Y True picture generator B, true picture discriminator D X
Encoder E consists of 3 convolutional layers in succession followed by 6 residual blocks, each convolutional layer being followed byThe 2 nd and 3 rd convolutional layers are downsampled using a convolution of step size 2, with a batch normalization module and a linear rectification function (ReLU). When in use, firstly, an input frame and an example segmentation graph are spliced in a channel dimension to obtain a global feature H through the network G Then for each instance i and its corresponding region image x i X is to i And the corresponding example segmentation maps are spliced together in the channel dimension and then enlarged to the input frame size, and then the features are extracted through the network and the area image x is reduced i Is recorded as an example feature H i Finally, the output of the network is obtained by fusing the segmentation graphs according to the examples
Figure BDA0003082925990000061
Wherein S i An example segmentation map for example i, which is a single-channel 0/1 matrix of the same size as the original image, is 1 in the area of example i, and 0 in the remaining areas.
Reference frame feature fusion device W r The device consists of 1 convolution layer, 5 staggered residual blocks and convolution layers, and finally 1 convolution layer; when the method is used, the difference between each reference feature and the real video frame feature is respectively input, a score matrix with the same size as the reference features is generated, then softmax transformation is respectively applied to all reference feature scores at all positions, and all reference features are respectively weighted and summed at each position by taking the softmax transformation as a coefficient to obtain a fused reference feature R f
Feature modifier W, its structure and W r The same; receiving and scoring a difference between the fusion reference feature and the real video frame feature, wherein the score is a matrix w with the same size as the fusion reference feature and the real video frame feature, and modifying the real video frame feature by using the fusion reference feature according to the score to obtain the modified real video frame feature
Figure BDA0003082925990000062
Where H' is the true video frame characteristics found by the encoder.
The decoder G is symmetrical to the encoder in structure and comprises 6 residual blocks and 3 convolutional layers, each convolutional layer is followed by a linear rectification function, and nearest neighbor upsampling is carried out before the other convolutional layers except the last convolutional layer; the module is used for converting the decorated real video frame characteristics into ink and wash style frames.
And the real picture generator B consists of 3 convolutional layers, 12 residual blocks and 3 convolutional layers, wherein each convolutional layer is followed by a linear rectification function, the ReLU is used except for the last convolutional layer, and the tanh is used for the last convolutional layer. Besides the first convolution layer and the last convolution layer, a batch normalization module is added behind each convolution layer; the module is used for conversion of ink-and-water images into real photos.
Ink style discriminator D ink Ink and wash painting discriminator D Y True picture discriminator D X Each convolutional layer is followed by a batch normalization module and a linear rectification function (LeakyReLU) except the first layer and the last layer, the first layer is followed by only one linear rectification function (LeakyReLU), and the first convolutional layer uses convolution with the step length of 2 to carry out downsampling; the modules are respectively used for judging whether the input image has a wash ink style, belongs to a wash ink painting or not and belongs to a real photo or not, and the modules are used for performing confrontation training with the generation models F and B.
And 3, step 3: training ink stylized network model F and ink style discriminator D ink Ink and wash painting discriminator D Y Real picture generator B and real picture discriminator D X . Wherein a model F, B and a discriminant model D are generated ink ,D Y ,D X And alternately carrying out optimization.
For the generative models F, B, the loss function includes: against loss L adv Cyclic reconstruction loss L cycle Loss of consistency L cons Time domain loss L tmp White loss L whiten Profile loss L contour Line loss L stroke And ink loss L ink
Wherein L is adv =E x [log(1-D Y (F(x)))]+E y [log(1-D X (B(y)))],L cycle =γE x [||B(F(x))-x|| 1 ]+E y [||F(B(y))-y|| 1 ]X, y are the results of sampling from the real data set and the ink data set, respectively, E x And E y Respectively, a mathematical expectation calculated using the aforementioned two probability distributions, D Y As a discriminator for ink and wash paintings, D X As a true picture discriminator, F is a water and ink stylized network model, B is a true picture generator, and γ is a coefficient for balancing L cycle The inner two parts. The partial loss function aims to guide the ink stylized network model to generate a real ink style, guide the real picture generator to generate a real photo style, and generate the two types of the real photo style in a mutual inverse operation mode.
L cons =L flow +L scale First part
Figure BDA0003082925990000071
Wherein y is ω =ω(F(x),f),H ω =ω(E(x),f ) ω (I, w) means that the picture I is transformed according to an optical flow w, f is the optical flow between a sampled video frame and another frame, f For the result of the down-sampling of the optical flow matrix to the feature matrix size, M o Is a shadow mask of the optical flow. L is flow The goal is to guide the model to establish optical flow consistency between the features learned by the encoder and the ink images generated by the decoder. Another part
Figure BDA0003082925990000072
Wherein H n And (4) amplifying the region where the nth example mentioned in the introduction of the encoder is located to extract the features, and performing size reduction operation on s. L is scale The goal is to guide the model to establish scale consistency between the features learned by the encoder and the ink image generated by the decoder.
Figure BDA0003082925990000073
Figure BDA0003082925990000075
Wherein H' is a modified true viewFrequency frame characteristics, { r } 1 ,...,r k Is a reference frame, { f 1 ,...,f k Is the optical flow between the corresponding reference frame and the current frame, { M o,1 ,...,M o,k Is a shadow mask corresponding to the light flow, M o,c Is composed of
Figure BDA0003082925990000074
The objective of the loss function is to guide the model to generate an ink-wash image which is consistent with a reference frame in an area with credible optical flow, and generate an image which is consistent with the condition without the reference frame in the residual area.
Figure BDA0003082925990000081
Wherein M is f For foreground masks, directly generated by example style results, M b As a background mask, M b =1-M f . The purpose of this loss function is to guide the model to whiteout in the background area, applying more ink to the foreground portion.
Figure BDA0003082925990000082
Wherein, haar (I) is to perform Haar wavelet transform on the image I and extract the high-frequency component, M c The edge mask is obtained by calculating an example segmentation map, with 1 in the object outline region and 0 in the other regions. The loss function aims to reduce the entanglement phenomenon of strokes among different objects.
Figure BDA0003082925990000083
Figure BDA0003082925990000084
Where Edg is the edge extractor HED, T is the number of pixels in the graph, μ = T _/T, where T _ is the sum of the weights of the non-edge points in the Edg detection result, and this coefficient is used to balance the loss of both edge and non-edge types. The objective of the loss function is to constrain the generated ink image and the original image to have similar lines and contours.
L ink =E y [log D Y (Gauss(y))]+E x [log(1-D Y (Gauss(F(x))))]Where Gauss stands for gaussian blur operation. The loss function is intended to optimize the ink texture of the resulting ink image.
For discriminant model D X ,D Y ,D ink In other words, the loss includes L adv_X =E y [log(1-D Y (y))]+E x [log D Y (F(x))],L adv_Y =E x [log(1-D X (x))]+E y [log D X (B(y))]And L adv_ink =E y [log(1-D ink (Gauss(y)))]+E x [log D ink (Gauss(F(x)))]The purpose of the partial loss function is to respectively generate an ink image and a real ink image for guiding a discriminant model to distinguish; the guiding discrimination model distinguishes and generates a real picture and a real picture; the guide discrimination model distinguishes the generated ink style from the real ink style.
The training step comprises:
calculating example segmentation (marking areas where different objects are located) of the real video frame and the reference frame and optical flow between the real video frame and the reference frame, and inputting the optical flow into the ink stylized network model to obtain a converted ink video frame;
the generated ink video frame and the ink painting are simultaneously input into an ink painting discriminator and an ink painting style discriminator for discrimination, and loss (L) is calculated adv And L ink ) The method is used for the confrontation training between the discriminator and the generator, so that the generated result is more real;
inputting the wash painting into a real picture generator to obtain a generated real picture;
inputting the real video frame and the generated real picture into a real picture discriminator for discrimination;
inputting the generated ink video frame into a real picture generator to obtain a reconstructed real video frame;
comparing the reconstructed real video frame with the input real video frame, calculating the loss (L) cycle ) Guarantee reconstructed real video frame and originalThe content of the image is not lost in the generation process, namely the generated ink video frame and the input real video frame have the same content structure;
the generated real picture and the example thereof are segmented and input into the ink stylized network model to obtain a reconstructed ink painting;
the reconstructed ink-wash painting is compared with the input ink-wash painting, and the same is used to calculate the loss (L) cycle ) To ensure the content is consistent during the generation process.
And 4, step 4: and an inference stage, namely inputting a video shot really (see fig. 5) to calculate an example segmentation of the video and carrying out optical flow estimation between all frames two by two, and finally outputting a wanted ink-wash style video (see fig. 6), wherein only some frames in the video are listed for convenience of demonstration, and the method comprises the following steps:
1) Calculating optical flows among all frames, and measuring the similarity of the two frames according to the size of the optical flows between the two frames;
2) Selecting a frame which is most similar to all the converted frames and most similar to all other non-converted frames as a real video frame of the round;
3) Selecting a plurality of converted frames which are most similar to the selected real video frames as reference frames, wherein the reference frames are specially not available in the first round;
4) Inputting the characteristics, corresponding to the reference frame, of a decoder in the ink stylized network model as reference characteristics, the selected real video frame, the example segmentation of the real video frame and the optical flow between the real video frame and the reference frame into the ink stylized network model together to obtain a corresponding ink video frame, and directly inputting the characteristics extracted from the real video frame into the decoder of the ink stylized network without the reference frame to generate the corresponding ink frame;
5) Returning to the step 2), until all the frames are converted, and finally, forming the converted frames into the ink-wash video according to the previous sequence.
Experimental data:
compared with the existing methods for ink painting and stylizing pictures, the method for converting a single picture into an ink painting picture is carried out (see fig. 7, chipGAN is the existing image ink painting method, adaIN, WCT and ChipGAN are the existing image stylizing method), and the ink painting effect better than that of the existing method can be obtained on the single picture; meanwhile, the comparison of real video to ink video is carried out with the existing image inking and video stylizing methods (see fig. 8, two adjacent columns are respectively enlarged views of video frames and corresponding areas, chipGAN is the existing image inking method, the method is respectively applied to each frame of the video to carry out the inking of the video, and Linear and Compound are the existing video stylizing methods).
It is easy to understand that the specific structure of each network in the method, such as the number of convolutional layers, the type of nonlinear activation function, the regularization method, etc., can be replaced by other structures; the optical flow estimation method, the example segmentation method, and the edge detection method used in the present method may be replaced with other methods having the same or similar functions.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (10)

1. An ink-wash video generation method based on example segmentation and reference frames comprises the following steps:
1) Acquiring an example segmentation graph of each real video frame in a real video, and calculating the similarity of any two real video frames according to the optical flow estimation between all the real video frames;
2) Selecting a real video frame to be converted and one or more reference frames of each round according to the similarity, inputting the real video frame to be converted and the corresponding reference frames into a wash ink stylized network model, and obtaining the wash ink video frame of the real video frame to be converted through the following steps:
2.1 Respectively extracting real video features and corresponding reference features according to a real video frame to be converted, each reference frame and a corresponding example segmentation graph, and aligning each reference feature with the real video features by utilizing optical flow estimation;
2.2 Calculating a reference feature score of each reference feature according to the difference between each reference feature and the real video frame feature, and fusing all reference features through the reference feature scores to obtain a fused reference feature;
2.3 Decoding the modified real video features obtained according to the real video features and the fusion reference features to obtain the ink video frame of the real video frame to be converted;
3) And repeating the step 2), and generating the ink video of the real video.
2. The method of claim 1, wherein the real video frame to be converted and the reference frames are selected for each round by the following steps:
1) Selecting the frame with the lowest similarity degree with all converted frames and the highest similarity degree with all unconverted frames according to the similarity degree as the real video frame to be converted of the round;
2) And selecting a plurality of converted frame frames with the highest similarity with the real video frame to be converted as reference frames according to the similarity, wherein the number of the reference frames is 0 when the first ink video frame is generated.
3. The method of claim 1, wherein the true video features are obtained by:
1) Splicing the real video frame to be converted and the corresponding example segmentation graph, and coding a splicing result to obtain global characteristics;
2) For each instance i in the instance segmentation map and its corresponding region image x i Image of area x i After the corresponding example segmentation graph is spliced in the channel dimension, the image is amplified to the image size of the real video frame to be converted;
3) Extracting image features of the enlarged image and reducing the image features back to region image x i The characteristic dimension of (a) is,example feature H was obtained i
4) Fusing global features with instance features H i And obtaining the real video characteristics.
4. The method of claim 1, wherein the fused reference feature is obtained by:
1) Applying softmax transformation to each reference characteristic score at each position to obtain a corresponding coefficient;
2) And carrying out weighted summation on each position of each reference feature according to the coefficient to obtain the fusion reference feature.
5. The method of claim 1, wherein the modified real video features are obtained by:
1) Scoring the difference between the fusion reference feature and the real video frame feature to obtain a fusion reference feature score;
2) And modifying the real video frame characteristics by utilizing the fusion reference characteristic score weighted summation to obtain the modified real video characteristics.
6. The method of claim 1, wherein the structure of the ink stylized network model comprises: the device comprises an encoder, a reference frame feature fusion device, a feature modifier and a decoder; the structure of the encoder comprises: a plurality of convolutional layers and a plurality of residual blocks, wherein each convolutional layer is followed by a batch normalization module and a linear rectification function, and each convolutional layer is downsampled by a step length larger than 1 from the second convolutional layer; the reference frame feature fusion device and the feature modification device respectively comprise: a convolution layer, a plurality of staggered residual blocks, convolution layers and a convolution layer; the structure of the decoder is symmetrical to that of the encoder, and the decoder comprises: the system comprises a plurality of residual blocks and a plurality of convolution layers, wherein each convolution layer is followed by a linear rectification function, nearest neighbor upsampling is carried out before the other convolution layers except the last convolution layer, and a batch normalization module is added behind each convolution layer.
7. The method of claim 1, wherein the ink stylized network model F is trained by:
1) Collecting a plurality of sets of training data for iterative training, wherein each set of training data comprises: the real sample video, a real sample frame taken from the real sample video, a plurality of sample reference frames taken from the real sample video and a randomly selected real wash painting;
2) Acquiring example segmentation maps of the real sample frame and each sample reference frame, calculating optical flows of the real sample frame and each sample reference frame, and inputting the real sample frame, each sample reference frame and the corresponding example segmentation maps as well as the optical flows of the real sample frame and each sample reference frame into a water and ink stylized network model to obtain a water and ink video sample frame;
3) Inputting the ink and wash video sample frame and the real ink and wash picture into an ink and wash discriminator and an ink and wash style discriminator simultaneously for discrimination, wherein the structure of the ink and wash discriminator and the ink and wash style discriminator respectively comprises: a plurality of convolutional layers, wherein each convolutional layer except the first layer and the last layer is followed by a batch normalization module and a linear rectification function, the first layer is followed by only one linear rectification function and downsampled using a step size greater than 1;
4) Inputting the real ink painting into a real picture generator to obtain a generated real picture, and inputting the real video sample frame and the generated real picture into a real picture discriminator for discrimination, wherein the structure of the real picture generator comprises: encoder and decoder, the structure of encoder includes: a plurality of convolutional layers and a plurality of residual blocks, wherein each convolutional layer is followed by a batch normalization module and a linear rectification function, starting from the second convolutional layer, each convolutional layer is downsampled by a step size greater than 1, the structure of the decoder and the encoder are symmetric, comprising: a plurality of residual block and a plurality of convolution layer, wherein every convolution layer is followed a linear rectification function to except that the last convolution layer, carry out nearest neighbor upsampling before all the other convolution layers, add a batch standardization module behind each convolution layer, real picture arbiter's structure includes: a plurality of convolutional layers, wherein each convolutional layer except the first layer and the last layer is followed by a batch normalization module and a linear rectification function, the first layer is followed by only one linear rectification function and downsampled using a step size greater than 1;
5) Inputting the ink video sample frame into a real picture generator to obtain a reconstructed real video frame, and comparing the reconstructed real video frame with the real sample frame;
6) Inputting the generated real picture and the example segmentation graph thereof into the ink stylized network model to obtain a reconstructed ink painting, and comparing the reconstructed ink painting with the real ink painting;
7) By the discrimination and comparison of the steps 3) -6), the obtained countermeasure loss L is utilized adv And loss of cyclic reconstruction L cycle Loss of consistency L cons Time domain loss L tmp White loss L whiten Profile loss L contour Line loss L stroke And ink loss L ink And calculating all parameters of the ink stylized network model.
8. The method of claim 7, wherein the loss L is resisted adv =E x [log(1-D Y (F(x)))]+E y [log(1-D x (B(y)))]Where x, y are the results of sampling from the real data set and the ink data set, respectively, E x And E y Respectively, a mathematical expectation, D, calculated using the aforementioned two probability distributions Y As a discriminator for ink and wash paintings, D X A real picture discriminator, a Chinese ink stylized network model and a real picture generator; loss of cyclic reconstruction L cycle =γE x [||B(F(x))-x|| 1 ]+E y [||F(B(y))-y|| 1 ]Where γ is a coefficient to balance the cyclic reconstruction loss L cycle Two parts of (1); loss of consistency L cons =L flow +L scale Wherein
Figure FDA0003082925980000031
y ω =ω(F(x),f),H ω =ω(E(x),f ) ω (I, w) is the transformation of picture I according to the optical flow w, f is the optical flow between the true sample frame and the sample reference frame, f For the result of down-sampling the optical flow matrix to the feature matrix size, M o Is a shadow mask of the light flow and,
Figure FDA0003082925980000032
Figure FDA0003082925980000033
wherein H n Features extracted after the area where the nth instance is located is amplified are obtained, s is size reduction operation, and N is number of instances; time domain loss
Figure FDA0003082925980000034
Figure FDA0003082925980000035
Where H' is a modified real sample frame feature, { r 1 ,...,r k Is a reference frame, { f 1 ,...,f k Is the optical flow between the corresponding reference frame and the current frame, { M o,1 ,...,M o,k Is a shadow mask for the corresponding light flow, M o,c Is composed of
Figure FDA0003082925980000036
Loss of white space
Figure FDA0003082925980000037
Wherein M is f As foreground mask, M b As a background mask, M b =1-M f (ii) a Loss of contour
Figure FDA0003082925980000038
Figure FDA0003082925980000039
Wherein Haar (I) is the Haar wavelet carried out on the image ITransforming, extracting high frequency components, M, therein c The edge mask is obtained by calculating an example segmentation map, wherein the edge mask is an object outline region 1 and other regions are 0; line loss
Figure FDA00030829259800000310
Figure FDA00030829259800000311
Wherein Edg is an edge extractor HED, T is the number of pixels in the image, mu = T _/T, wherein T _ is the sum of the weights of non-edge points in the Edg detection result; ink loss L ink =E y [log D Y (Gauss(y))]+E x [log(1-D Y (Gauss(F(x))))]Where Gauss stands for gaussian blur operation.
9. A storage medium having a computer program stored thereon, wherein the computer program is arranged to, when run, perform the method of any of claims 1-8.
10. An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the method according to any of claims 1-8.
CN202110571615.XA 2021-05-25 2021-05-25 Ink video generation method and device based on instance segmentation and reference frame Pending CN115393491A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110571615.XA CN115393491A (en) 2021-05-25 2021-05-25 Ink video generation method and device based on instance segmentation and reference frame

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110571615.XA CN115393491A (en) 2021-05-25 2021-05-25 Ink video generation method and device based on instance segmentation and reference frame

Publications (1)

Publication Number Publication Date
CN115393491A true CN115393491A (en) 2022-11-25

Family

ID=84114468

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110571615.XA Pending CN115393491A (en) 2021-05-25 2021-05-25 Ink video generation method and device based on instance segmentation and reference frame

Country Status (1)

Country Link
CN (1) CN115393491A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117857842A (en) * 2024-03-07 2024-04-09 淘宝(中国)软件有限公司 Image quality processing method in live broadcast scene and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117857842A (en) * 2024-03-07 2024-04-09 淘宝(中国)软件有限公司 Image quality processing method in live broadcast scene and electronic equipment
CN117857842B (en) * 2024-03-07 2024-05-28 淘宝(中国)软件有限公司 Image quality processing method in live broadcast scene and electronic equipment

Similar Documents

Publication Publication Date Title
US10593021B1 (en) Motion deblurring using neural network architectures
CN111028177B (en) Edge-based deep learning image motion blur removing method
CN106952228B (en) Super-resolution reconstruction method of single image based on image non-local self-similarity
Xiao et al. Space-time distillation for video super-resolution
CN112149459B (en) Video saliency object detection model and system based on cross attention mechanism
CN107818554B (en) Information processing apparatus and information processing method
Li et al. Diffusion Models for Image Restoration and Enhancement--A Comprehensive Survey
CN112541864A (en) Image restoration method based on multi-scale generation type confrontation network model
Chrysos et al. Motion deblurring of faces
CN111091503A (en) Image out-of-focus blur removing method based on deep learning
Wang et al. A survey of deep face restoration: Denoise, super-resolution, deblur, artifact removal
CN115393396B (en) Unmanned aerial vehicle target tracking method based on mask pre-training
Li et al. A maximum a posteriori estimation framework for robust high dynamic range video synthesis
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
Liu et al. Towards enhancing fine-grained details for image matting
Uddin et al. A perceptually inspired new blind image denoising method using $ L_ {1} $ and perceptual loss
CN114821580A (en) Noise-containing image segmentation method by stage-by-stage merging with denoising module
CN113516604B (en) Image restoration method
CN113962905A (en) Single image rain removing method based on multi-stage feature complementary network
CN115393491A (en) Ink video generation method and device based on instance segmentation and reference frame
CN113421210A (en) Surface point cloud reconstruction method based on binocular stereo vision
Wang et al. Uneven image dehazing by heterogeneous twin network
CN117593275A (en) Medical image segmentation system
Li et al. A review of advances in image inpainting research
CN114863132A (en) Method, system, equipment and storage medium for modeling and capturing image spatial domain information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination