CN114598833A

CN114598833A - Video frame interpolation method based on spatio-temporal joint attention

Info

Publication number: CN114598833A
Application number: CN202210305381.9A
Authority: CN
Inventors: 路文; 张弘毅; 冯姣姣; 张立泽; 胡健
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2022-06-07
Anticipated expiration: 2042-03-25
Also published as: CN114598833B

Abstract

The invention provides a video frame insertion method based on spatio-temporal joint attention, which comprises the following steps: (1) acquiring a training data set and a data set; (2) constructing a video frame interpolation network based on spatio-temporal joint attention; (3) iteratively training a video frame interpolation network model; (4) and acquiring a video frame insertion result. The video frame interpolation model based on the spatio-temporal joint attention constructed by the invention utilizes a spatio-temporal attention mechanism to capture the spatio-temporal relation between input frames and model complex motion, thereby completing high-quality video frame interpolation. Compared with most of the existing networks, the algorithm does not use extra optical flow input, avoids extra errors caused by optical flow estimation, and simultaneously has low network parameter quantity and practical application value.

Description

Video frame interpolation method based on spatio-temporal joint attention

Technical Field

The invention belongs to the technical field of video processing, relates to a video frame interpolation method, and particularly relates to a video frame interpolation method based on spatio-temporal joint attention, which can be used in the fields of slow motion generation, video post-processing and the like.

Background

The low temporal resolution causes aliasing of images and generates artifacts, degrading video quality, and thus becomes an important factor affecting video quality. The video frame interpolation method inserts one or more intermediate frames between consecutive image frames to improve temporal resolution and improve video quality.

Video frame interpolation methods typically consist of two parts, motion estimation and pixel synthesis. The motion estimation means that the position of a pixel point corresponding to an intermediate frame is predicted by calculating the motion of the pixel point between the front frame and the rear frame; the motion estimation is divided into forward estimation and backward estimation, and the pixel synthesis is to fuse the intermediate frame of the forward estimation and the intermediate frame of the backward estimation to obtain the intermediate frame. The early video frame interpolation mainly uses an optical flow method to estimate bidirectional optical flows of two frames before and after, and synthesizes an intermediate frame by forward warping or backward warping. With the development of deep learning, the method also has better effects in optical flow estimation and video frame interpolation. Most of the existing frame interpolation methods generate bilateral optical flows by applying a trained optical flow estimation network, and carry out mapping from the bilateral frames to intermediate frames. Firstly, the methods depend on the reliability of the optical flow estimation algorithm, and errors generated by the optical flow estimation network can cause inaccuracy of the frame interpolation result along with the continuous propagation of the frame interpolation network. Meanwhile, the optical flow estimation algorithm brings extra calculation amount, so that the frame interpolation efficiency is low. Secondly, the estimation of complex motion by these networks is limited to linear or quadratic motion trajectories, which are difficult to interpret. Therefore, it is important to design an interpolation model that does not depend on optical flow estimation and can accurately estimate a complex motion trajectory.

The use of attention mechanisms in neural networks allows the network to adaptively calibrate inputs based on task requirements and to focus on those parts that are more responsible for completing the task. Thus, with the attention mechanism, more complex motions can be captured and reconstructed. Meanwhile, by utilizing end-to-end training, estimation errors caused by an optical flow estimation model can be eliminated, the parameter quantity of the model is reduced, and the accuracy and the speed of frame interpolation are obviously improved.

The main factor influencing video frame insertion is the accuracy of predicting intermediate frames, and the objective evaluation index is peak signal-to-noise ratio (PSNR) and Structural Similarity (SSIM); the higher the PSNR, the higher the image quality; also, a higher SSIM indicates a higher image quality.

Junheum Park et al propose an ABME algorithm Asymmetric Bilateral Motion Estimation for Video Frame Interpolation based on optical flow in the International Conference on Computer Vision, first calculate the symmetric Bilateral optical flow of the input Frame, calculate the anchor Frame according to the symmetric Bilateral optical flow, then calculate the Asymmetric Bilateral optical flow between the anchor Frame and the input Frame, then calculate to get the preliminary intermediate Frame by using the Asymmetric Bilateral optical flow, finally optimize the preliminary intermediate Frame by using the synthetic network to get the final Interpolation result image.

Choi et al propose CAIN algorithm Channel Attention Is All young Need for Video Frame Interpolation, distribute the spatial feature information of each Frame in the network Channel, use the Channel Attention mechanism to catch the motion information. However, this algorithm fails to explicitly capture the dependency of the time dimension between input frames, resulting in severe artifacts at the motion margins of the generated frames and inaccurate frame interpolation results.

Disclosure of Invention

The invention aims to provide a video frame interpolation method based on space-time joint attention aiming at overcoming the defects in the prior art, and aims to effectively utilize the time correlation and the space information of adjacent frames and finish the estimation and the generation of complex nonlinear motion, effectively improve the accuracy of a video frame interpolation algorithm, simultaneously avoid introducing excessive parameter quantity and further improve the efficiency of the video frame interpolation algorithm.

In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:

(1) acquiring a training sample set and a testing sample set:

preprocessing each selected V original videos comprising L image frames, respectively marking image frames with odd numbers and image frames with even numbers in a video frame sequence corresponding to each preprocessed original video, and then marking the video frame sequence V with the marked R image frames₁＝{V₁ ^rR is more than or equal to 1 and less than or equal to R is used as a training sample set, and the video frame sequence with marks on the rest S image frames is used

As a test sample set, where L > 5, V > 1000,

V₁ ^rrepresenting the R-th video frame sequence, S-V-R,

representing an s-th sequence of video frames;

(2) constructing a video frame interpolation network model f based on space-time joint attention:

constructing a video frame interpolation network model f comprising a feature extraction network, a space-time joint attention network and a 3D convolution network which are connected in sequence; wherein the feature extraction network comprises a plurality of 2D convolutional layers connected in sequence; the spatiotemporal attention network comprises a plurality of spatiotemporal attention modules connected in sequence; the 3D convolutional network comprises a plurality of sequentially connected 3D convolutional layers;

(3) performing iterative training on a video frame interpolation network model f:

(3a) initializing the iteration frequency to be I, setting the maximum iteration frequency to be I, wherein I is more than or equal to 100, setting the weight parameter of the video frame interpolation network model f to be theta, and setting I to be 0;

(3b) obtaining a training sample set V₁Sequence of medium video frames V₁ ^pThe intermediate frame of the image frame on odd bits:

(3b1) will train the sample set V₁As an input to the video frame interpolation network model f, the feature extraction network operates on each sequence of video frames V₁ ^pEvery training sample on odd-numbered bitsFeature extraction to obtain V₁ ^rFeature map set containing training samples on E odd bits

Wherein E is more than or equal to 2 and is an even number,

representing a feature diagram corresponding to the training sample on the e odd bit of the ith iteration;

(3b2) computing each feature map by spatio-temporal joint attention network

And using the feature map set

Corresponding temporal and spatial correlation calculations

F depth features of (a);

(3b3)3D convolutional network pair

F depth features of the video sequence V₁ ^rIntermediate frame image of image frame on odd number

(3c) Using an absolute value loss function L1, by

Calculating the loss value L of the video frame interpolation network model according to the image frames at the even number position in each video frame sequence, then adopting a gradient descent method, and updating the weight parameter theta of the f through the partial derivative value of L to obtain the video frame interpolation network model f of the iteration at this timeⁱ；

(3d) Judging whether i is more than or equal to 1, if so, obtaining good trainingVideo frame interpolation network model f^*Otherwise, let i equal to i +1, fⁱF, and performing step (3 b);

(4) acquiring a video frame insertion result:

in a sequence of video frames

As a trained video frame interpolation network model f^*Is propagated forward to obtain an intermediate frame image X of selected image frames in each video frame sequence in the test dataset_{2_s}。

Compared with the prior art, the invention has the following advantages:

the invention constructs a spatio-temporal joint attention network included in a video frame interpolation network based on spatio-temporal joint attention, can acquire spatio-temporal correlation between adjacent input image frames in video frame interpolation model training, models object motion according to the spatio-temporal correlation, and finally synthesizes an intermediate frame, wherein the spatio-temporal joint attention network has better video frame interpolation effect than the prior art when dealing with complex nonlinear motion; the method utilizes the space-time joint attention network to acquire the characteristic information of the moving object, avoids the error of a calculation result caused by calculating the optical flow, effectively improves the accuracy of video frame interpolation, and simultaneously utilizes the attention model and the 3D convolution to carry out motion estimation, so that the network parameter quantity is low, the video frame interpolation speed is improved, and the method has practical application value.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a schematic diagram of a video frame insertion network according to the present invention;

FIG. 3 is a schematic diagram of the space-time attention block principle of the present invention

Detailed Description

The invention is described in further detail below with reference to the following figures and specific examples:

referring to fig. 1, the present invention includes the steps of:

step 1) obtaining a training sample set and a testing sample set:

for selected V original views each including L image framesCutting each frame image through a cutting window with the size of H multiplied by W to obtain a video frame sequence corresponding to each original video after preprocessing, wherein H equal to 448 and W equal to 256 respectively represent the length and the width of the cutting window, respectively marking an image frame with an odd number and an image frame with an even number in the video frame sequence corresponding to each original video after preprocessing, and then respectively marking a video frame V with R image frames provided with marks₁＝{V₁ ^rR is more than or equal to 1 and less than or equal to R is used as a training sample set, and the video frame sequence with marks on the rest S image frames is used

As a test sample set, where L ═ 7, V ═ 7564, R ═ 3782, S ═ 3782, V₁ ^rRepresenting the sequence of the r-th video frame,

representing an s-th sequence of video frames;

step 2) constructing a video frame interpolation model f of space-time joint attention, wherein the structure of the model f is shown in figure 2:

constructing a video frame interpolation network model f comprising a feature extraction network, a space-time joint attention network and a 3D convolution network which are connected in sequence; wherein, the feature extraction network comprises 4 2D convolutional layers which are connected in sequence; the space-time attention network comprises 7 space-time attention modules which are connected in sequence, and the principle of the space-time attention modules is shown in figure 3; the 3D convolutional network comprises 3 sequentially connected 3D convolutional layers;

each layer of 4 2D convolutional layers of the feature extraction network comprises a plurality of convolution kernels and an activation function, the number of the convolution kernels in the first 2D convolutional layer and the second 2D convolutional layer is 64, the number of the convolution kernels in the third 2D convolutional layer and the fourth 2D convolutional layer is 128 and 256 respectively, the sizes of the convolution kernels of the 4 2D convolutional layers are 3 multiplied by 3, the step lengths of the convolution kernels of the first layer and the third layer are 2, the step lengths of the convolution kernels of the second layer and the fourth layer are 1, the 4 2D convolutional layers are subjected to 1 complementing operation, and the 4 2D convolutional layer activation functions adopt ReLU functions;

the space-time attention module in the space-time attention network comprises 4 branches, the space-time attention module adopts a multi-branch form to search input frames along space-time dimensions, and the space-time attention modules of different branches can calculate attention on different space dimensions, so that the network can capture changes caused by complex motion more easily; the attention block with the same size as the input feature map is modeled globally, and modeling of a background part is mainly completed; the smaller attention block is used for modeling locally, modeling a complex motion foreground and acquiring the characteristic information of a moving object; wherein each branch comprises 4 2D convolutional layers and a softmax layer, and the structure is as follows: the first, second and third 2D convolutional layers are connected in parallel, the result obtained by multiplying the outputs of the first and second 2D convolutional layers is used as the input of the softmax layer, the result obtained by multiplying the output of the softmax layer by the output of the third 2D convolutional layer is used as the input of each channel, the output of each channel is used as the input of the fourth 2D convolutional layer, and the output of the fourth 2D convolutional layer is the depth characteristic calculated by each spatiotemporal attention module; the space-time attention network comprises 7 space-time attention modules, wherein each space-time attention module comprises 4 convolutional layers, the number of convolutional cores in the first, second and third two-dimensional convolutional layers is 64, the number of convolutional cores in the fourth two-dimensional convolutional layer is 256, the sizes of convolutional cores in the first, second and third two-dimensional convolutional layers are 1 multiplied by 1, the convolutional step lengths are 1, the size of convolutional core in the fourth two-dimensional convolutional layer is 3 multiplied by 3, and the convolutional step length is 1;

the number of 3D convolutional layers contained in the 3D convolutional network is 3, the number of convolution kernels of the first 3D convolutional layer and the second 3D convolutional layer is 128 and 64 respectively, the number of convolution kernels of the third 3D convolutional layer is 64, the sizes of the convolution kernels of the first 3D convolutional layer and the second 3D convolutional layer are both 3 multiplied by 3, the size of the convolution kernel of the third 3D convolutional layer is 2 multiplied by 3, the convolution step length is both 1 multiplied by 1, and the activation function adopted by the third 3D convolutional layer is a ReLU activation function;

step 3), performing iterative training on the video frame interpolation network model f:

(3a) initializing the iteration frequency to be I, setting the maximum iteration frequency to be I, setting the I to be 100, setting the weight parameter of the video frame interpolation network model f to be theta, and setting the I to be 0;

(3b) obtaining a training sample set V₁Medium videoFrame sequence V₁ ^rThe intermediate frame of the image frame on odd bits:

(3b1) will train the sample set V₁As input to the video frame interpolation network model f, the feature extraction network operates on each sequence of video frames V₁ ^rCarrying out feature extraction on the middle 1, 3, 5 and 7 image frames to obtain V₁ ^rFeature map set containing training samples on odd-numbered E-4 bits

(3b2) spatio-temporal joint attention network computation

F — 4 depth features:

(3b2i) mapping each feature

All carry out convolution operation with convolution kernel size of 1 multiplied by 1 to obtain key vector k_eQuery vector q_eValue vector v_e；

(3b2ii) pair key vector k_eQuery vector q_eValue vector v_eWith a size of h_i×w_iC is blocked, wherein h₁＝448，w₁＝256，h₂＝224，w₂＝128，h₃＝112，w₃＝64，h₄＝56，w₄32, 256, the extracted feature map is input into 4 branches, where the first branch uses the extracted feature map as h₁×w₁The x C is divided into blocks, and the second branch is used for extracting a feature map with h₂×w₂The x C is divided into blocks, and the third branch is used for extracting a feature map with h₃×w₃The x C is divided into blocks, and the fourth branch uses the extracted feature map as h₄×w₄Dividing by x C to obtain 340 characteristic blocks, wherein N is T x H/H_i×W/w_iAnd T is 4 as the input frame number; by making a pair q_eAnd k_eCarrying out multiplication operation to obtain correlation among different feature blocks; respectively mixing q_eAnd k_eAfter each feature block is converted into a one-dimensional vector, matrix multiplication is performed, and then the correlation between the query vector block and the key vector block can be calculated by the following formula:

wherein m is more than or equal to 1 and less than or equal to N, N is more than or equal to 1 and less than or equal to N,

representing the m-th query vector chunk,

representing the nth key vector chunk, x (m, n) representing

M-th query vector chunk sum

The nth key carries out normalized correlation on the vector blocks; after the block division is carried out, the calculated amount brought by matrix operation can be effectively reduced; carrying out normalization operation to relieve gradient reduction caused by a Softmax function; performing Softmax operation on the obtained correlation to obtain attention weight:

wherein, exp represents the operation of an exponent,

represent

M-th query vector chunk sum

After all key vector blocks are subjected to exponential operation respectively, the results are added, and a (m, n) represents attention weight; multiplying the attention weight obtained by the expression by each value vector and adding the multiplication results to obtain an output:

wherein, O_mRepresenting motion information captured by each block of the spatiotemporal attention module; all blocks are combined and converted into the initial size H multiplied by W multiplied by C, and finally space-time attention modules of different branches are combined in the channel dimension to obtain the combined block

4 time dimensions of depth features;

(3b3)3D convolutional network pair

The depth features of the 4 time dimensions of (1) are fused. Keeping the input and output dimensions of the first layer of 3D convolution kernel consistent, and performing feature extraction on the depth features; the second layer of 3D convolutional layer, the size of the convolutional kernel is set to be 3 multiplied by 3, so that the time dimension of the output feature graph is reduced to 2, and the depth features with short distance of the time dimension are decoded and fused; a third layer of 3D convolutional layer, the size of the convolutional kernel is set to be 2 x 3, the time dimension is reduced to 1, and each video frame sequence V can be generated through one 2D convolutional layer₁ ^p4 th intermediate frame image of 1, 3, 5, 7 th image frame

Wherein, the number of convolution kernels of the 2D convolution layer is 1, the size of the convolution kernels is 7 multiplied by 7, the step length is 1, and the convolution layer carries out 0-complementing operation;

(3c) using an absolute value loss function L1, by

And 4 th image frame in each video frame sequence, calculating the loss value L of the video frame interpolation network model, inputting the loss value L into an Adam optimizer, updating the weight parameter theta of f, and obtaining the video frame interpolation network model f of the iteration at this timeⁱ；

(3d) Judging whether i is more than or equal to 1, if so, obtaining a trained video frame interpolation network model f^*Otherwise, let i equal to i +1, fⁱF, and performing step (3 b);

step 4), acquiring a video frame insertion result:

in a sequence of video frames V₂ ^sNetwork model f as trained video frame insertion^*Is propagated forward to obtain intermediate frame images X of selected image frames in each sequence of video frames in the test dataset_{2_s}。

The technical effects of the present invention are further described below with the combination of simulation experiments:

1. simulation conditions and contents:

the simulation was trained on two NVIDIA TITAN RTX graphics cards in PyTorch framework. Training the model using an Adam optimizer, where β₁＝0.9，β₂The initial learning rate is set to 10 at 0.99^-4The learning rate is reduced to 0.4 times of the original rate after 40 epochs.

When a video frame interpolation model is trained, a Vimeo90k data set is used as a training set, the training set comprises 3782 scenes with continuous frames, the spatial resolution of each frame is 448 multiplied by 256, and the data set is expanded and enhanced in modes of cutting blocks, turning and the like of Viemo90K during training; the Vimeo90K dataset is from Xue Tianfan et al in the literature "Video Enhancement with Task-organized Flow", "International Journal of Computer Vision, vol.127, No.8, pp.1106-1125,2019"; at the time of testing, the test set widely adopted at present is used: vimeo90K, UCF 101. Soomro et al in the document "UCF 101: A dataset of 101human locations from video in the world", "arXiv preprint arXiv:1212.0402,2012", UCF101 dataset containing 379 sets of pictures, each set of pictures comprising 3 consecutive frames of images; the Vimeo90K data set contained 3782 sets of pictures, each set containing 3 consecutive frames of images.

Video frame interpolation experimental results: as shown in table 1, the experimental results of the algorithm of the present invention are compared and simulated with the parameters, peak signal-to-noise ratio and result similarity of the existing ABME and CAIN:

TABLE 1

As shown in Table 1, the parameters of the ABME algorithm are 18.1M, the model input needs RGB images and optical flows of the images, the peak signal-to-noise ratio (PSNR) and the Structural Similarity (SSIM) in a Vimeo90K data set are respectively 35.84 and 0.973, and the PSNR and the SSIM in a UCF101 data set are respectively 32.90 and 0.969; the CAIN algorithm parameters are 42 and 8M, only RGB images are needed for model input, the PSNR and SSIM are 33.93 and 0.964 respectively in a Vimeo90K data set, and the PSNR and SSIM are 32.28 and 0.965 respectively in a UCF101 data set; the parameter number of the invention is 14.4M, the model input only needs RGB images, the data set Vimeo90K has PSNR and SSIM of 36.40 and 0.976 respectively, and the data set UCF101 has PSNR and SSIM of 33.35 and 0.971 respectively;

as can be seen from Table 1, the present invention achieves the best results in both PSNR and SSIM on both test sets. Meanwhile, under the condition of needing RGB image input, the method can be better than an algorithm only needing RGB image input and is also better than an interpolation algorithm based on optical flow; the parameter number of the invention is much smaller than CAIN algorithm, so that the algorithm of the invention is easier to realize engineering application.

The result of the simulation experiment shows that the method provided by the invention utilizes a space-time attention mechanism to capture the space-time relation between input frames, and models the complex motion, thereby completing the accurate video frame interpolation. Compared with the ABME algorithm, the method does not use optical flow estimation, avoids additional errors caused by the optical flow estimation, increases time dimension information compared with the CAIN algorithm, and effectively improves the accuracy of video frame interpolation; meanwhile, the invention has low network parameter quantity and practical application value.

Claims

1. A video frame interpolation method based on spatio-temporal joint attention is characterized by comprising the following steps:

(1) acquiring a training sample set and a testing sample set:

As a test sample set, where L > 5, V > 1000,

V₁ ^rrepresenting the R-th video frame sequence, S-V-R,

representing an s-th sequence of video frames;

(3) performing iterative training on the video frame interpolation network model f:

(3a) initializing the iteration frequency as I, the maximum iteration frequency as I, wherein I is more than or equal to 100, the weight parameter of the video frame interpolation network model f is theta, and making I equal to 0;

(3b) obtaining a training sample set V₁Sequence of medium video frames V₁ ^rThe intermediate frame of the image frame on odd bits:

(3b1) will train the sample set V₁As an input to the video frame interpolation network model f, the feature extraction network operates on each sequence of video frames V₁ ^rPerforming feature extraction on each training sample on the odd-numbered positions to obtain V₁ ^rFeature map set containing training samples on E odd-numbered bits

Wherein E is more than or equal to 2 and is an even number,

(3b2) computing each feature map by spatio-temporal joint attention network

And using the feature map set

Corresponding temporal and spatial correlation calculations

F depth features of (a);

(3b3)3D convolutional network pair

(3c) Using an absolute value loss function L1, by

(4) acquiring a video frame insertion result:

in a sequence of video frames

As a trained video frame interpolation network model f^*Is propagated forward to obtain intermediate frame images X of selected image frames in each sequence of video frames in the test dataset_{2_s}。

2. The spatiotemporal joint attention-based video frame interpolation method according to claim 1, wherein the preprocessing is performed on the selected V original videos in the step (1a) by:

decomposing each original video into L frame images, and clipping each frame image through a clipping window with the size of H multiplied by W to obtain a video frame sequence corresponding to each original video after preprocessing, wherein H, W respectively represents the length and the width of the clipping window.

3. The spatiotemporal joint attention-based video frame interpolation method according to claim 1, wherein the video frame interpolation network model f in step (2) is provided, wherein:

the number of 2D convolutional layers contained in the feature extraction network is 4, 4 2D convolutional layers respectively contain a plurality of convolution kernels and an activation function, the number of the convolution kernels in the first 2D convolutional layer and the second 2D convolutional layer is 64, the number of the convolution kernels in the third 2D convolutional layer and the fourth 2D convolutional layer is 128 and 256 respectively, the sizes of the convolution kernels of the 4 2D convolutional layers are 3 multiplied by 3, the step lengths of the convolution kernels of the first layer and the third layer are 2, the step lengths of the convolution kernels of the second layer and the fourth layer are 1, and the 4 2D convolutional layer activation functions all adopt ReLU functions;

the spatiotemporal attention module in the spatiotemporal attention network comprises 4 branches, wherein each branch comprises 4 2D convolutional layers and one softmax layer, and the structure of the spatiotemporal attention module is as follows: the first, second and third 2D convolutional layers are connected in parallel, the result obtained by multiplying the outputs of the first and second 2D convolutional layers is used as the input of the softmax layer, the result obtained by multiplying the output of the softmax layer by the output of the third 2D convolutional layer is used as the input of each branch, the output of each branch is used as the input of the fourth 2D convolutional layer, and the output of the fourth 2D convolutional layer is the depth characteristic calculated by each spatiotemporal attention block; the space-time attention network comprises 7 space-time attention modules, wherein each space-time attention module comprises 4 convolutional layers, the number of convolutional cores in the first, second and third two-dimensional convolutional layers is 64, the number of convolutional cores in the fourth two-dimensional convolutional layer is 256, the sizes of convolutional cores in the first, second and third two-dimensional convolutional layers are 1 multiplied by 1, the convolutional step lengths are 1, the size of convolutional core in the fourth two-dimensional convolutional layer is 3 multiplied by 3, and the convolutional step length is 1;

the number of 3D convolutional layers included in the 3D convolutional network is 3, the number of convolution kernels of the first and second 3D convolutional layers is 128 and 64 respectively, the number of convolution kernels of the third 3D convolutional layer is 64, the sizes of the convolution kernels of the first and second 3D convolutional layers are both 3 x 3, the size of the convolution kernel of the third 3D convolutional layer is 2 x 3, the convolution step sizes are both 1 x 1, and the activation function adopted by the third 3D convolutional layer is a ReLU activation function.

4. The spatiotemporal joint attention-based video frame interpolation method according to claim 1, wherein the spatiotemporal joint attention network in step (3b2) computes each feature map

And utilizes F

Corresponding temporal and spatial correlation results

The implementation steps of the F depth features are as follows:

spatio-temporal joint attention network to each feature map

Convolution is carried out to obtain a key vector k_eQuery vector q_eValue vector v_eWill vector k_eAnd vector q_eMultiplying to obtain each feature map

Corresponding correlations, and associating each correlation with its value vector v_eAnd multiplying and summing the multiplication results to obtain the depth feature corresponding to each feature map.

5. The spatiotemporal joint attention-based video frame interpolation method according to claim 1, wherein the step (3b3) of reconstructing each video frame sequence V₁ ^sIntermediate frame image of image frame on odd number

The method comprises the following implementation steps:

three layers of convolution layers of the 3D convolution network, wherein the first layer of 3D convolution layer carries out feature extraction on depth features, the second layer of 3D convolution layer reduces the time dimension of an output feature map to 2, the last layer of 3D convolution layer reduces the time dimension to 1, and an intermediate frame image is obtained