CN114598833A - Video frame interpolation method based on spatio-temporal joint attention - Google Patents

Video frame interpolation method based on spatio-temporal joint attention Download PDF

Info

Publication number
CN114598833A
CN114598833A CN202210305381.9A CN202210305381A CN114598833A CN 114598833 A CN114598833 A CN 114598833A CN 202210305381 A CN202210305381 A CN 202210305381A CN 114598833 A CN114598833 A CN 114598833A
Authority
CN
China
Prior art keywords
video frame
convolutional
layer
attention
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210305381.9A
Other languages
Chinese (zh)
Other versions
CN114598833B (en
Inventor
路文
张弘毅
冯姣姣
张立泽
胡健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202210305381.9A priority Critical patent/CN114598833B/en
Publication of CN114598833A publication Critical patent/CN114598833A/en
Application granted granted Critical
Publication of CN114598833B publication Critical patent/CN114598833B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/01Conversion of standards, e.g. involving analogue television standards or digital television standards processed at pixel level
    • H04N7/0135Conversion of standards, e.g. involving analogue television standards or digital television standards processed at pixel level involving interpolation processes
    • H04N7/014Conversion of standards, e.g. involving analogue television standards or digital television standards processed at pixel level involving interpolation processes involving the use of motion vectors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Television Systems (AREA)

Abstract

The invention provides a video frame insertion method based on spatio-temporal joint attention, which comprises the following steps: (1) acquiring a training data set and a data set; (2) constructing a video frame interpolation network based on spatio-temporal joint attention; (3) iteratively training a video frame interpolation network model; (4) and acquiring a video frame insertion result. The video frame interpolation model based on the spatio-temporal joint attention constructed by the invention utilizes a spatio-temporal attention mechanism to capture the spatio-temporal relation between input frames and model complex motion, thereby completing high-quality video frame interpolation. Compared with most of the existing networks, the algorithm does not use extra optical flow input, avoids extra errors caused by optical flow estimation, and simultaneously has low network parameter quantity and practical application value.

Description

Video frame interpolation method based on spatio-temporal joint attention
Technical Field
The invention belongs to the technical field of video processing, relates to a video frame interpolation method, and particularly relates to a video frame interpolation method based on spatio-temporal joint attention, which can be used in the fields of slow motion generation, video post-processing and the like.
Background
The low temporal resolution causes aliasing of images and generates artifacts, degrading video quality, and thus becomes an important factor affecting video quality. The video frame interpolation method inserts one or more intermediate frames between consecutive image frames to improve temporal resolution and improve video quality.
Video frame interpolation methods typically consist of two parts, motion estimation and pixel synthesis. The motion estimation means that the position of a pixel point corresponding to an intermediate frame is predicted by calculating the motion of the pixel point between the front frame and the rear frame; the motion estimation is divided into forward estimation and backward estimation, and the pixel synthesis is to fuse the intermediate frame of the forward estimation and the intermediate frame of the backward estimation to obtain the intermediate frame. The early video frame interpolation mainly uses an optical flow method to estimate bidirectional optical flows of two frames before and after, and synthesizes an intermediate frame by forward warping or backward warping. With the development of deep learning, the method also has better effects in optical flow estimation and video frame interpolation. Most of the existing frame interpolation methods generate bilateral optical flows by applying a trained optical flow estimation network, and carry out mapping from the bilateral frames to intermediate frames. Firstly, the methods depend on the reliability of the optical flow estimation algorithm, and errors generated by the optical flow estimation network can cause inaccuracy of the frame interpolation result along with the continuous propagation of the frame interpolation network. Meanwhile, the optical flow estimation algorithm brings extra calculation amount, so that the frame interpolation efficiency is low. Secondly, the estimation of complex motion by these networks is limited to linear or quadratic motion trajectories, which are difficult to interpret. Therefore, it is important to design an interpolation model that does not depend on optical flow estimation and can accurately estimate a complex motion trajectory.
The use of attention mechanisms in neural networks allows the network to adaptively calibrate inputs based on task requirements and to focus on those parts that are more responsible for completing the task. Thus, with the attention mechanism, more complex motions can be captured and reconstructed. Meanwhile, by utilizing end-to-end training, estimation errors caused by an optical flow estimation model can be eliminated, the parameter quantity of the model is reduced, and the accuracy and the speed of frame interpolation are obviously improved.
The main factor influencing video frame insertion is the accuracy of predicting intermediate frames, and the objective evaluation index is peak signal-to-noise ratio (PSNR) and Structural Similarity (SSIM); the higher the PSNR, the higher the image quality; also, a higher SSIM indicates a higher image quality.
Junheum Park et al propose an ABME algorithm Asymmetric Bilateral Motion Estimation for Video Frame Interpolation based on optical flow in the International Conference on Computer Vision, first calculate the symmetric Bilateral optical flow of the input Frame, calculate the anchor Frame according to the symmetric Bilateral optical flow, then calculate the Asymmetric Bilateral optical flow between the anchor Frame and the input Frame, then calculate to get the preliminary intermediate Frame by using the Asymmetric Bilateral optical flow, finally optimize the preliminary intermediate Frame by using the synthetic network to get the final Interpolation result image.
Choi et al propose CAIN algorithm Channel Attention Is All young Need for Video Frame Interpolation, distribute the spatial feature information of each Frame in the network Channel, use the Channel Attention mechanism to catch the motion information. However, this algorithm fails to explicitly capture the dependency of the time dimension between input frames, resulting in severe artifacts at the motion margins of the generated frames and inaccurate frame interpolation results.
Disclosure of Invention
The invention aims to provide a video frame interpolation method based on space-time joint attention aiming at overcoming the defects in the prior art, and aims to effectively utilize the time correlation and the space information of adjacent frames and finish the estimation and the generation of complex nonlinear motion, effectively improve the accuracy of a video frame interpolation algorithm, simultaneously avoid introducing excessive parameter quantity and further improve the efficiency of the video frame interpolation algorithm.
In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:
(1) acquiring a training sample set and a testing sample set:
preprocessing each selected V original videos comprising L image frames, respectively marking image frames with odd numbers and image frames with even numbers in a video frame sequence corresponding to each preprocessed original video, and then marking the video frame sequence V with the marked R image frames1={V1 rR is more than or equal to 1 and less than or equal to R is used as a training sample set, and the video frame sequence with marks on the rest S image frames is used
Figure BDA0003564661830000021
As a test sample set, where L > 5, V > 1000,
Figure BDA0003564661830000022
V1 rrepresenting the R-th video frame sequence, S-V-R,
Figure BDA0003564661830000023
representing an s-th sequence of video frames;
(2) constructing a video frame interpolation network model f based on space-time joint attention:
constructing a video frame interpolation network model f comprising a feature extraction network, a space-time joint attention network and a 3D convolution network which are connected in sequence; wherein the feature extraction network comprises a plurality of 2D convolutional layers connected in sequence; the spatiotemporal attention network comprises a plurality of spatiotemporal attention modules connected in sequence; the 3D convolutional network comprises a plurality of sequentially connected 3D convolutional layers;
(3) performing iterative training on a video frame interpolation network model f:
(3a) initializing the iteration frequency to be I, setting the maximum iteration frequency to be I, wherein I is more than or equal to 100, setting the weight parameter of the video frame interpolation network model f to be theta, and setting I to be 0;
(3b) obtaining a training sample set V1Sequence of medium video frames V1 pThe intermediate frame of the image frame on odd bits:
(3b1) will train the sample set V1As an input to the video frame interpolation network model f, the feature extraction network operates on each sequence of video frames V1 pEvery training sample on odd-numbered bitsFeature extraction to obtain V1 rFeature map set containing training samples on E odd bits
Figure BDA0003564661830000031
Wherein E is more than or equal to 2 and is an even number,
Figure BDA0003564661830000032
representing a feature diagram corresponding to the training sample on the e odd bit of the ith iteration;
(3b2) computing each feature map by spatio-temporal joint attention network
Figure BDA0003564661830000033
And using the feature map set
Figure BDA0003564661830000034
Corresponding temporal and spatial correlation calculations
Figure BDA0003564661830000035
F depth features of (a);
(3b3)3D convolutional network pair
Figure BDA0003564661830000036
F depth features of the video sequence V1 rIntermediate frame image of image frame on odd number
Figure BDA0003564661830000037
(3c) Using an absolute value loss function L1, by
Figure BDA0003564661830000038
Calculating the loss value L of the video frame interpolation network model according to the image frames at the even number position in each video frame sequence, then adopting a gradient descent method, and updating the weight parameter theta of the f through the partial derivative value of L to obtain the video frame interpolation network model f of the iteration at this timei
(3d) Judging whether i is more than or equal to 1, if so, obtaining good trainingVideo frame interpolation network model f*Otherwise, let i equal to i +1, fiF, and performing step (3 b);
(4) acquiring a video frame insertion result:
in a sequence of video frames
Figure BDA0003564661830000039
As a trained video frame interpolation network model f*Is propagated forward to obtain an intermediate frame image X of selected image frames in each video frame sequence in the test dataset2_s
Compared with the prior art, the invention has the following advantages:
the invention constructs a spatio-temporal joint attention network included in a video frame interpolation network based on spatio-temporal joint attention, can acquire spatio-temporal correlation between adjacent input image frames in video frame interpolation model training, models object motion according to the spatio-temporal correlation, and finally synthesizes an intermediate frame, wherein the spatio-temporal joint attention network has better video frame interpolation effect than the prior art when dealing with complex nonlinear motion; the method utilizes the space-time joint attention network to acquire the characteristic information of the moving object, avoids the error of a calculation result caused by calculating the optical flow, effectively improves the accuracy of video frame interpolation, and simultaneously utilizes the attention model and the 3D convolution to carry out motion estimation, so that the network parameter quantity is low, the video frame interpolation speed is improved, and the method has practical application value.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a schematic diagram of a video frame insertion network according to the present invention;
FIG. 3 is a schematic diagram of the space-time attention block principle of the present invention
Detailed Description
The invention is described in further detail below with reference to the following figures and specific examples:
referring to fig. 1, the present invention includes the steps of:
step 1) obtaining a training sample set and a testing sample set:
for selected V original views each including L image framesCutting each frame image through a cutting window with the size of H multiplied by W to obtain a video frame sequence corresponding to each original video after preprocessing, wherein H equal to 448 and W equal to 256 respectively represent the length and the width of the cutting window, respectively marking an image frame with an odd number and an image frame with an even number in the video frame sequence corresponding to each original video after preprocessing, and then respectively marking a video frame V with R image frames provided with marks1={V1 rR is more than or equal to 1 and less than or equal to R is used as a training sample set, and the video frame sequence with marks on the rest S image frames is used
Figure BDA0003564661830000041
As a test sample set, where L ═ 7, V ═ 7564, R ═ 3782, S ═ 3782, V1 rRepresenting the sequence of the r-th video frame,
Figure BDA0003564661830000042
representing an s-th sequence of video frames;
step 2) constructing a video frame interpolation model f of space-time joint attention, wherein the structure of the model f is shown in figure 2:
constructing a video frame interpolation network model f comprising a feature extraction network, a space-time joint attention network and a 3D convolution network which are connected in sequence; wherein, the feature extraction network comprises 4 2D convolutional layers which are connected in sequence; the space-time attention network comprises 7 space-time attention modules which are connected in sequence, and the principle of the space-time attention modules is shown in figure 3; the 3D convolutional network comprises 3 sequentially connected 3D convolutional layers;
each layer of 4 2D convolutional layers of the feature extraction network comprises a plurality of convolution kernels and an activation function, the number of the convolution kernels in the first 2D convolutional layer and the second 2D convolutional layer is 64, the number of the convolution kernels in the third 2D convolutional layer and the fourth 2D convolutional layer is 128 and 256 respectively, the sizes of the convolution kernels of the 4 2D convolutional layers are 3 multiplied by 3, the step lengths of the convolution kernels of the first layer and the third layer are 2, the step lengths of the convolution kernels of the second layer and the fourth layer are 1, the 4 2D convolutional layers are subjected to 1 complementing operation, and the 4 2D convolutional layer activation functions adopt ReLU functions;
the space-time attention module in the space-time attention network comprises 4 branches, the space-time attention module adopts a multi-branch form to search input frames along space-time dimensions, and the space-time attention modules of different branches can calculate attention on different space dimensions, so that the network can capture changes caused by complex motion more easily; the attention block with the same size as the input feature map is modeled globally, and modeling of a background part is mainly completed; the smaller attention block is used for modeling locally, modeling a complex motion foreground and acquiring the characteristic information of a moving object; wherein each branch comprises 4 2D convolutional layers and a softmax layer, and the structure is as follows: the first, second and third 2D convolutional layers are connected in parallel, the result obtained by multiplying the outputs of the first and second 2D convolutional layers is used as the input of the softmax layer, the result obtained by multiplying the output of the softmax layer by the output of the third 2D convolutional layer is used as the input of each channel, the output of each channel is used as the input of the fourth 2D convolutional layer, and the output of the fourth 2D convolutional layer is the depth characteristic calculated by each spatiotemporal attention module; the space-time attention network comprises 7 space-time attention modules, wherein each space-time attention module comprises 4 convolutional layers, the number of convolutional cores in the first, second and third two-dimensional convolutional layers is 64, the number of convolutional cores in the fourth two-dimensional convolutional layer is 256, the sizes of convolutional cores in the first, second and third two-dimensional convolutional layers are 1 multiplied by 1, the convolutional step lengths are 1, the size of convolutional core in the fourth two-dimensional convolutional layer is 3 multiplied by 3, and the convolutional step length is 1;
the number of 3D convolutional layers contained in the 3D convolutional network is 3, the number of convolution kernels of the first 3D convolutional layer and the second 3D convolutional layer is 128 and 64 respectively, the number of convolution kernels of the third 3D convolutional layer is 64, the sizes of the convolution kernels of the first 3D convolutional layer and the second 3D convolutional layer are both 3 multiplied by 3, the size of the convolution kernel of the third 3D convolutional layer is 2 multiplied by 3, the convolution step length is both 1 multiplied by 1, and the activation function adopted by the third 3D convolutional layer is a ReLU activation function;
step 3), performing iterative training on the video frame interpolation network model f:
(3a) initializing the iteration frequency to be I, setting the maximum iteration frequency to be I, setting the I to be 100, setting the weight parameter of the video frame interpolation network model f to be theta, and setting the I to be 0;
(3b) obtaining a training sample set V1Medium videoFrame sequence V1 rThe intermediate frame of the image frame on odd bits:
(3b1) will train the sample set V1As input to the video frame interpolation network model f, the feature extraction network operates on each sequence of video frames V1 rCarrying out feature extraction on the middle 1, 3, 5 and 7 image frames to obtain V1 rFeature map set containing training samples on odd-numbered E-4 bits
Figure BDA0003564661830000051
Figure BDA0003564661830000052
Representing a feature diagram corresponding to the training sample on the e odd bit of the ith iteration;
(3b2) spatio-temporal joint attention network computation
Figure BDA0003564661830000053
F — 4 depth features:
(3b2i) mapping each feature
Figure BDA0003564661830000054
All carry out convolution operation with convolution kernel size of 1 multiplied by 1 to obtain key vector keQuery vector qeValue vector ve
(3b2ii) pair key vector keQuery vector qeValue vector veWith a size of hi×wiC is blocked, wherein h1=448,w1=256,h2=224,w2=128,h3=112,w3=64,h4=56,w432, 256, the extracted feature map is input into 4 branches, where the first branch uses the extracted feature map as h1×w1The x C is divided into blocks, and the second branch is used for extracting a feature map with h2×w2The x C is divided into blocks, and the third branch is used for extracting a feature map with h3×w3The x C is divided into blocks, and the fourth branch uses the extracted feature map as h4×w4Dividing by x C to obtain 340 characteristic blocks, wherein N is T x H/Hi×W/wiAnd T is 4 as the input frame number; by making a pair qeAnd keCarrying out multiplication operation to obtain correlation among different feature blocks; respectively mixing qeAnd keAfter each feature block is converted into a one-dimensional vector, matrix multiplication is performed, and then the correlation between the query vector block and the key vector block can be calculated by the following formula:
Figure BDA0003564661830000061
wherein m is more than or equal to 1 and less than or equal to N, N is more than or equal to 1 and less than or equal to N,
Figure BDA0003564661830000062
representing the m-th query vector chunk,
Figure BDA0003564661830000063
representing the nth key vector chunk, x (m, n) representing
Figure BDA0003564661830000064
M-th query vector chunk sum
Figure BDA0003564661830000065
The nth key carries out normalized correlation on the vector blocks; after the block division is carried out, the calculated amount brought by matrix operation can be effectively reduced; carrying out normalization operation to relieve gradient reduction caused by a Softmax function; performing Softmax operation on the obtained correlation to obtain attention weight:
Figure BDA0003564661830000066
wherein, exp represents the operation of an exponent,
Figure BDA0003564661830000067
represent
Figure BDA0003564661830000068
M-th query vector chunk sum
Figure BDA0003564661830000069
After all key vector blocks are subjected to exponential operation respectively, the results are added, and a (m, n) represents attention weight; multiplying the attention weight obtained by the expression by each value vector and adding the multiplication results to obtain an output:
Figure BDA00035646618300000610
wherein, OmRepresenting motion information captured by each block of the spatiotemporal attention module; all blocks are combined and converted into the initial size H multiplied by W multiplied by C, and finally space-time attention modules of different branches are combined in the channel dimension to obtain the combined block
Figure BDA00035646618300000611
4 time dimensions of depth features;
(3b3)3D convolutional network pair
Figure BDA00035646618300000612
The depth features of the 4 time dimensions of (1) are fused. Keeping the input and output dimensions of the first layer of 3D convolution kernel consistent, and performing feature extraction on the depth features; the second layer of 3D convolutional layer, the size of the convolutional kernel is set to be 3 multiplied by 3, so that the time dimension of the output feature graph is reduced to 2, and the depth features with short distance of the time dimension are decoded and fused; a third layer of 3D convolutional layer, the size of the convolutional kernel is set to be 2 x 3, the time dimension is reduced to 1, and each video frame sequence V can be generated through one 2D convolutional layer1 p4 th intermediate frame image of 1, 3, 5, 7 th image frame
Figure BDA0003564661830000071
Wherein, the number of convolution kernels of the 2D convolution layer is 1, the size of the convolution kernels is 7 multiplied by 7, the step length is 1, and the convolution layer carries out 0-complementing operation;
(3c) using an absolute value loss function L1, by
Figure BDA0003564661830000072
And 4 th image frame in each video frame sequence, calculating the loss value L of the video frame interpolation network model, inputting the loss value L into an Adam optimizer, updating the weight parameter theta of f, and obtaining the video frame interpolation network model f of the iteration at this timei
(3d) Judging whether i is more than or equal to 1, if so, obtaining a trained video frame interpolation network model f*Otherwise, let i equal to i +1, fiF, and performing step (3 b);
step 4), acquiring a video frame insertion result:
in a sequence of video frames V2 sNetwork model f as trained video frame insertion*Is propagated forward to obtain intermediate frame images X of selected image frames in each sequence of video frames in the test dataset2_s
The technical effects of the present invention are further described below with the combination of simulation experiments:
1. simulation conditions and contents:
the simulation was trained on two NVIDIA TITAN RTX graphics cards in PyTorch framework. Training the model using an Adam optimizer, where β1=0.9,β2The initial learning rate is set to 10 at 0.99-4The learning rate is reduced to 0.4 times of the original rate after 40 epochs.
When a video frame interpolation model is trained, a Vimeo90k data set is used as a training set, the training set comprises 3782 scenes with continuous frames, the spatial resolution of each frame is 448 multiplied by 256, and the data set is expanded and enhanced in modes of cutting blocks, turning and the like of Viemo90K during training; the Vimeo90K dataset is from Xue Tianfan et al in the literature "Video Enhancement with Task-organized Flow", "International Journal of Computer Vision, vol.127, No.8, pp.1106-1125,2019"; at the time of testing, the test set widely adopted at present is used: vimeo90K, UCF 101. Soomro et al in the document "UCF 101: A dataset of 101human locations from video in the world", "arXiv preprint arXiv:1212.0402,2012", UCF101 dataset containing 379 sets of pictures, each set of pictures comprising 3 consecutive frames of images; the Vimeo90K data set contained 3782 sets of pictures, each set containing 3 consecutive frames of images.
Video frame interpolation experimental results: as shown in table 1, the experimental results of the algorithm of the present invention are compared and simulated with the parameters, peak signal-to-noise ratio and result similarity of the existing ABME and CAIN:
Figure BDA0003564661830000081
TABLE 1
As shown in Table 1, the parameters of the ABME algorithm are 18.1M, the model input needs RGB images and optical flows of the images, the peak signal-to-noise ratio (PSNR) and the Structural Similarity (SSIM) in a Vimeo90K data set are respectively 35.84 and 0.973, and the PSNR and the SSIM in a UCF101 data set are respectively 32.90 and 0.969; the CAIN algorithm parameters are 42 and 8M, only RGB images are needed for model input, the PSNR and SSIM are 33.93 and 0.964 respectively in a Vimeo90K data set, and the PSNR and SSIM are 32.28 and 0.965 respectively in a UCF101 data set; the parameter number of the invention is 14.4M, the model input only needs RGB images, the data set Vimeo90K has PSNR and SSIM of 36.40 and 0.976 respectively, and the data set UCF101 has PSNR and SSIM of 33.35 and 0.971 respectively;
as can be seen from Table 1, the present invention achieves the best results in both PSNR and SSIM on both test sets. Meanwhile, under the condition of needing RGB image input, the method can be better than an algorithm only needing RGB image input and is also better than an interpolation algorithm based on optical flow; the parameter number of the invention is much smaller than CAIN algorithm, so that the algorithm of the invention is easier to realize engineering application.
The result of the simulation experiment shows that the method provided by the invention utilizes a space-time attention mechanism to capture the space-time relation between input frames, and models the complex motion, thereby completing the accurate video frame interpolation. Compared with the ABME algorithm, the method does not use optical flow estimation, avoids additional errors caused by the optical flow estimation, increases time dimension information compared with the CAIN algorithm, and effectively improves the accuracy of video frame interpolation; meanwhile, the invention has low network parameter quantity and practical application value.

Claims (5)

1. A video frame interpolation method based on spatio-temporal joint attention is characterized by comprising the following steps:
(1) acquiring a training sample set and a testing sample set:
preprocessing each selected V original videos comprising L image frames, respectively marking image frames with odd numbers and image frames with even numbers in a video frame sequence corresponding to each preprocessed original video, and then marking the video frame sequence V with the marked R image frames1={V1 rR is more than or equal to 1 and less than or equal to R is used as a training sample set, and the video frame sequence with marks on the rest S image frames is used
Figure FDA0003564661820000011
As a test sample set, where L > 5, V > 1000,
Figure FDA0003564661820000012
V1 rrepresenting the R-th video frame sequence, S-V-R,
Figure FDA0003564661820000013
representing an s-th sequence of video frames;
(2) constructing a video frame interpolation network model f based on space-time joint attention:
constructing a video frame interpolation network model f comprising a feature extraction network, a space-time joint attention network and a 3D convolution network which are connected in sequence; wherein the feature extraction network comprises a plurality of 2D convolutional layers connected in sequence; the spatiotemporal attention network comprises a plurality of spatiotemporal attention modules connected in sequence; the 3D convolutional network comprises a plurality of sequentially connected 3D convolutional layers;
(3) performing iterative training on the video frame interpolation network model f:
(3a) initializing the iteration frequency as I, the maximum iteration frequency as I, wherein I is more than or equal to 100, the weight parameter of the video frame interpolation network model f is theta, and making I equal to 0;
(3b) obtaining a training sample set V1Sequence of medium video frames V1 rThe intermediate frame of the image frame on odd bits:
(3b1) will train the sample set V1As an input to the video frame interpolation network model f, the feature extraction network operates on each sequence of video frames V1 rPerforming feature extraction on each training sample on the odd-numbered positions to obtain V1 rFeature map set containing training samples on E odd-numbered bits
Figure FDA0003564661820000014
Wherein E is more than or equal to 2 and is an even number,
Figure FDA0003564661820000015
representing a feature diagram corresponding to the training sample on the e odd bit of the ith iteration;
(3b2) computing each feature map by spatio-temporal joint attention network
Figure FDA0003564661820000021
And using the feature map set
Figure FDA0003564661820000022
Corresponding temporal and spatial correlation calculations
Figure FDA0003564661820000023
F depth features of (a);
(3b3)3D convolutional network pair
Figure FDA0003564661820000024
F depth features of the video sequence V1 rIntermediate frame image of image frame on odd number
Figure FDA0003564661820000025
(3c) Using an absolute value loss function L1, by
Figure FDA0003564661820000026
Calculating the loss value L of the video frame interpolation network model according to the image frames at the even number position in each video frame sequence, then adopting a gradient descent method, and updating the weight parameter theta of the f through the partial derivative value of L to obtain the video frame interpolation network model f of the iteration at this timei
(3d) Judging whether i is more than or equal to 1, if so, obtaining a trained video frame interpolation network model f*Otherwise, let i equal to i +1, fiF, and performing step (3 b);
(4) acquiring a video frame insertion result:
in a sequence of video frames
Figure FDA0003564661820000027
As a trained video frame interpolation network model f*Is propagated forward to obtain intermediate frame images X of selected image frames in each sequence of video frames in the test dataset2_s
2. The spatiotemporal joint attention-based video frame interpolation method according to claim 1, wherein the preprocessing is performed on the selected V original videos in the step (1a) by:
decomposing each original video into L frame images, and clipping each frame image through a clipping window with the size of H multiplied by W to obtain a video frame sequence corresponding to each original video after preprocessing, wherein H, W respectively represents the length and the width of the clipping window.
3. The spatiotemporal joint attention-based video frame interpolation method according to claim 1, wherein the video frame interpolation network model f in step (2) is provided, wherein:
the number of 2D convolutional layers contained in the feature extraction network is 4, 4 2D convolutional layers respectively contain a plurality of convolution kernels and an activation function, the number of the convolution kernels in the first 2D convolutional layer and the second 2D convolutional layer is 64, the number of the convolution kernels in the third 2D convolutional layer and the fourth 2D convolutional layer is 128 and 256 respectively, the sizes of the convolution kernels of the 4 2D convolutional layers are 3 multiplied by 3, the step lengths of the convolution kernels of the first layer and the third layer are 2, the step lengths of the convolution kernels of the second layer and the fourth layer are 1, and the 4 2D convolutional layer activation functions all adopt ReLU functions;
the spatiotemporal attention module in the spatiotemporal attention network comprises 4 branches, wherein each branch comprises 4 2D convolutional layers and one softmax layer, and the structure of the spatiotemporal attention module is as follows: the first, second and third 2D convolutional layers are connected in parallel, the result obtained by multiplying the outputs of the first and second 2D convolutional layers is used as the input of the softmax layer, the result obtained by multiplying the output of the softmax layer by the output of the third 2D convolutional layer is used as the input of each branch, the output of each branch is used as the input of the fourth 2D convolutional layer, and the output of the fourth 2D convolutional layer is the depth characteristic calculated by each spatiotemporal attention block; the space-time attention network comprises 7 space-time attention modules, wherein each space-time attention module comprises 4 convolutional layers, the number of convolutional cores in the first, second and third two-dimensional convolutional layers is 64, the number of convolutional cores in the fourth two-dimensional convolutional layer is 256, the sizes of convolutional cores in the first, second and third two-dimensional convolutional layers are 1 multiplied by 1, the convolutional step lengths are 1, the size of convolutional core in the fourth two-dimensional convolutional layer is 3 multiplied by 3, and the convolutional step length is 1;
the number of 3D convolutional layers included in the 3D convolutional network is 3, the number of convolution kernels of the first and second 3D convolutional layers is 128 and 64 respectively, the number of convolution kernels of the third 3D convolutional layer is 64, the sizes of the convolution kernels of the first and second 3D convolutional layers are both 3 x 3, the size of the convolution kernel of the third 3D convolutional layer is 2 x 3, the convolution step sizes are both 1 x 1, and the activation function adopted by the third 3D convolutional layer is a ReLU activation function.
4. The spatiotemporal joint attention-based video frame interpolation method according to claim 1, wherein the spatiotemporal joint attention network in step (3b2) computes each feature map
Figure FDA0003564661820000031
And utilizes F
Figure FDA0003564661820000032
Corresponding temporal and spatial correlation results
Figure FDA0003564661820000033
The implementation steps of the F depth features are as follows:
spatio-temporal joint attention network to each feature map
Figure FDA0003564661820000034
Convolution is carried out to obtain a key vector keQuery vector qeValue vector veWill vector keAnd vector qeMultiplying to obtain each feature map
Figure FDA0003564661820000035
Corresponding correlations, and associating each correlation with its value vector veAnd multiplying and summing the multiplication results to obtain the depth feature corresponding to each feature map.
5. The spatiotemporal joint attention-based video frame interpolation method according to claim 1, wherein the step (3b3) of reconstructing each video frame sequence V1 sIntermediate frame image of image frame on odd number
Figure FDA0003564661820000041
The method comprises the following implementation steps:
three layers of convolution layers of the 3D convolution network, wherein the first layer of 3D convolution layer carries out feature extraction on depth features, the second layer of 3D convolution layer reduces the time dimension of an output feature map to 2, the last layer of 3D convolution layer reduces the time dimension to 1, and an intermediate frame image is obtained
Figure FDA0003564661820000042
CN202210305381.9A 2022-03-25 2022-03-25 Video frame interpolation method based on spatio-temporal joint attention Active CN114598833B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210305381.9A CN114598833B (en) 2022-03-25 2022-03-25 Video frame interpolation method based on spatio-temporal joint attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210305381.9A CN114598833B (en) 2022-03-25 2022-03-25 Video frame interpolation method based on spatio-temporal joint attention

Publications (2)

Publication Number Publication Date
CN114598833A true CN114598833A (en) 2022-06-07
CN114598833B CN114598833B (en) 2023-02-10

Family

ID=81810400

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210305381.9A Active CN114598833B (en) 2022-03-25 2022-03-25 Video frame interpolation method based on spatio-temporal joint attention

Country Status (1)

Country Link
CN (1) CN114598833B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115243031A (en) * 2022-06-17 2022-10-25 合肥工业大学智能制造技术研究院 Video spatiotemporal feature optimization method and system based on quality attention mechanism, electronic device and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080174694A1 (en) * 2007-01-22 2008-07-24 Horizon Semiconductors Ltd. Method and apparatus for video pixel interpolation
CN101903828A (en) * 2007-12-20 2010-12-01 汤姆森许可贸易公司 Device for helping the capture of images
CN107133919A (en) * 2017-05-16 2017-09-05 西安电子科技大学 Time dimension video super-resolution method based on deep learning
CN111915659A (en) * 2019-05-10 2020-11-10 三星电子株式会社 CNN-based systems and methods for video frame interpolation
CN112734696A (en) * 2020-12-24 2021-04-30 华南理工大学 Face changing video tampering detection method and system based on multi-domain feature fusion
CN113034380A (en) * 2021-02-09 2021-06-25 浙江大学 Video space-time super-resolution method and device based on improved deformable convolution correction
CN113132664A (en) * 2021-04-19 2021-07-16 科大讯飞股份有限公司 Frame interpolation generation model construction method and video frame interpolation method
US20210383169A1 (en) * 2019-03-01 2021-12-09 Peking University Shenzhen Graduate School Method, apparatus, and device for video frame interpolation
CN114125455A (en) * 2021-11-23 2022-03-01 长沙理工大学 Bidirectional coding video frame insertion method, system and equipment based on deep learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080174694A1 (en) * 2007-01-22 2008-07-24 Horizon Semiconductors Ltd. Method and apparatus for video pixel interpolation
CN101903828A (en) * 2007-12-20 2010-12-01 汤姆森许可贸易公司 Device for helping the capture of images
CN107133919A (en) * 2017-05-16 2017-09-05 西安电子科技大学 Time dimension video super-resolution method based on deep learning
US20210383169A1 (en) * 2019-03-01 2021-12-09 Peking University Shenzhen Graduate School Method, apparatus, and device for video frame interpolation
CN111915659A (en) * 2019-05-10 2020-11-10 三星电子株式会社 CNN-based systems and methods for video frame interpolation
CN112734696A (en) * 2020-12-24 2021-04-30 华南理工大学 Face changing video tampering detection method and system based on multi-domain feature fusion
CN113034380A (en) * 2021-02-09 2021-06-25 浙江大学 Video space-time super-resolution method and device based on improved deformable convolution correction
CN113132664A (en) * 2021-04-19 2021-07-16 科大讯飞股份有限公司 Frame interpolation generation model construction method and video frame interpolation method
CN114125455A (en) * 2021-11-23 2022-03-01 长沙理工大学 Bidirectional coding video frame insertion method, system and equipment based on deep learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JIAN XIA: "Multi-Scale Attention Generative Adversarial Networks for Video Frame Interpolation", 《IEEE ACCESS》 *
JUN LI: "Spatio-Temporal Attention Networks for Action Recognition and Detection", 《 IEEE TRANSACTIONS ON MULTIMEDIA》 *
ZHIHAO SHI: "video farmer interpolation via generalized deformable convolution", 《IEEE TRANSACTIONS ON MULTIMEDIA》 *
董猛等: "基于注意力残差卷积网络的视频超分辨率重构", 《长春理工大学学报(自然科学版)》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115243031A (en) * 2022-06-17 2022-10-25 合肥工业大学智能制造技术研究院 Video spatiotemporal feature optimization method and system based on quality attention mechanism, electronic device and storage medium

Also Published As

Publication number Publication date
CN114598833B (en) 2023-02-10

Similar Documents

Publication Publication Date Title
Zamir et al. Restormer: Efficient transformer for high-resolution image restoration
CN113673307B (en) Lightweight video action recognition method
Reda et al. Unsupervised video interpolation using cycle consistency
CN109064507B (en) Multi-motion-stream deep convolution network model method for video prediction
CN110782490B (en) Video depth map estimation method and device with space-time consistency
CN111709895A (en) Image blind deblurring method and system based on attention mechanism
US11978146B2 (en) Apparatus and method for reconstructing three-dimensional image
CN111986105B (en) Video time sequence consistency enhancing method based on time domain denoising mask
CN108924528B (en) Binocular stylized real-time rendering method based on deep learning
CN107194948B (en) Video significance detection method based on integrated prediction and time-space domain propagation
CN112991450B (en) Detail enhancement unsupervised depth estimation method based on wavelet
CN112422870B (en) Deep learning video frame insertion method based on knowledge distillation
CN115953582B (en) Image semantic segmentation method and system
CN110956655A (en) Dense depth estimation method based on monocular image
CN112288788A (en) Monocular image depth estimation method
CN114598833B (en) Video frame interpolation method based on spatio-temporal joint attention
CN115565039A (en) Monocular input dynamic scene new view synthesis method based on self-attention mechanism
CN116205962A (en) Monocular depth estimation method and system based on complete context information
CN113160081A (en) Depth face image restoration method based on perception deblurring
Xiao et al. Progressive motion boosting for video frame interpolation
KR102057395B1 (en) Video generation method using video extrapolation based on machine learning
CN114612305B (en) Event-driven video super-resolution method based on stereogram modeling
CN114820745A (en) Monocular visual depth estimation system, method, computer device, and computer-readable storage medium
Tang et al. A constrained deformable convolutional network for efficient single image dynamic scene blind deblurring with spatially-variant motion blur kernels estimation
CN110827238A (en) Improved side-scan sonar image feature extraction method of full convolution neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant