CN111524068B

CN111524068B - Variable-length input super-resolution video reconstruction method based on deep learning

Info

Publication number: CN111524068B
Application number: CN202010290657.1A
Authority: CN
Inventors: 任卫军; 丁国栋; 黄金文; 张力波
Original assignee: Changan University
Current assignee: Dragon Totem Technology Hefei Co ltd
Priority date: 2020-04-14
Filing date: 2020-04-14
Publication date: 2023-06-02
Anticipated expiration: 2040-04-14
Also published as: CN111524068A

Abstract

The invention discloses a variable-length input super-resolution video reconstruction method based on deep learning; the method comprises the following steps: constructing training samples with random lengths, and acquiring a training set; establishing a super-resolution video reconstruction network model: the device comprises a feature extractor, a gradual alignment fusion module, a depth residual error module and a superposition module which are connected in sequence; training the super-resolution video reconstruction network model by adopting a training set to obtain a trained super-resolution video reconstruction network; and sequentially inputting the video to be processed into a trained super-resolution video reconstruction network to reconstruct the video, so as to obtain a corresponding super-resolution reconstruction video. The invention adopts a gradual alignment fusion mechanism to align and fuse the images frame by frame, and the alignment operation only acts on two adjacent frames of images, so that the model can process longer time sequence relation, and more adjacent video frames are used, which means that the input contains more scene information, and the reconstruction effect can be effectively improved.

Description

Variable-length input super-resolution video reconstruction method based on deep learning

Technical Field

The invention belongs to the technical field of video restoration, and particularly relates to a variable-length input super-resolution video reconstruction method based on deep learning.

Background

Most image and video based applications have effects that depend on the quality of the image. In general, the quality of an image is related to the amount of information it contains, and the resolution of an image is used to measure how much information an image contains, expressed in terms of the number of pixels per unit area, such as 1024×768. It follows that the resolution of an image represents the quality of the image, so in real life and application scenarios, high resolution becomes a quality appeal of images and video.

However, when the video contains complex motions of occlusion, severe blurring, and large offsets, the video needs to be reconstructed to obtain high quality video information. In order to effectively fuse the complementary information of the multi-frame images and obtain a high-quality reconstructed image, all frames in the input video frame sequence must be aligned, and an accurate corresponding relationship is established to perform the following reconstruction step. Alignment is a challenging but very important issue for video super resolution due to the constant motion of the camera or object, the target frame and each adjacent frame are not aligned. At present, most superdivision models treat all adjacent frames equally, and the same alignment network is used for processing different adjacent frames, so that different intervals between the different adjacent frames and the target frame are not considered. Theoretically, the motion offset of different adjacent frames with respect to the target frame is different, and adjacent frames farther from the target frame have larger offsets, which is certainly difficult to learn the alignment operation of different adjacent frames simultaneously using one alignment network.

At present, most multi-frame image super-resolution models can only input image sequences with determined lengths, and images at two ends of a video sequence can not be normally processed in the reconstruction process of the models, which is caused by the structural limitation of the models, and the input image sequences can only be complemented by mirror image processing or copying of target frames. As shown in fig. 1, in fig. 1 (a), when the input length is 9 (the target frame and the left and right 4 frames of images), and the number of the left remaining video frames of the current target frame is insufficient, the fixed-length input model must be supplemented by copying other image frames, so that manual intervention marks are increased, and additional noise is introduced. If the variable length input in fig. 1 (b) is not needed to be processed, the variable length input can be directly input into the reconstruction model, so that the method meets the requirements of practical application. In addition, if the appropriate input sequence length (including the total length and the lengths of the adjacent frames on the left and right sides) can be selected according to the difference of the use scenes, the applicability of the multi-frame image super-resolution reconstruction model is greatly enhanced.

Disclosure of Invention

Aiming at the defects of the existing design method, the invention aims to provide a variable-length input super-resolution video reconstruction method based on deep learning. The problem of inaccurate alignment of a long input image sequence in a video super-resolution task is solved by adopting a variable length input sequence; the gradual alignment fusion network is adopted to align and fuse any number of adjacent frames without affecting the subsequent reconstruction task, so that the practicability is higher.

A variable-length input super-resolution video reconstruction method based on deep learning comprises the following steps:

step 1, constructing training samples with random lengths, and acquiring a training set;

step 2, establishing a super-resolution video reconstruction network model: the device comprises a feature extractor, a gradual alignment fusion module, a depth residual error module and a superposition module which are connected in sequence;

step 3, training the super-resolution video reconstruction network model by adopting a training set to obtain a trained super-resolution video reconstruction network;

step 4, inputting the video to be processed into the trained super-resolution video reconstruction network in sequence to reconstruct the video, so as to obtain a corresponding super-resolution reconstruction video;

the length of each input image sequence of the video to be processed is self-defined.

Further, the training samples with random length are constructed as follows:

first, given an input sequence length K, K > 0; selecting a data set;

secondly, giving a target frame to be reconstructed;

finally, selecting an x-frame image on the left side of the target frame and a K-1-x frame image on the right side of the target frame, and arranging the K frame images in sequence from left to right to obtain an input image sequence;

where x is an integer randomly derived by uniform distribution, x=0, 1, …, K-1.

Further, the acquiring the training set is:

firstly, random horizontal overturning and rotation are used for each original training sample, so that a space transformation training sample is obtained;

secondly, introducing an interval variable T, wherein T is more than 1, and acquiring an input image sequence with the length of the input sequence by taking T as a sampling interval so as to simulate a moving target with low acquisition frame rate or fast movement and obtain a time enhancement training sample;

finally, the training set is composed of the original training sample, the space transformation training sample and the time enhancement training sample.

Further, the training set is adopted to train the super-resolution video reconstruction network model, and specifically comprises the following steps:

3.1, initializing super-resolution video reconstruction network model parameters given the maximum training times;

3.2, input image sequence (I ¹ ,…,I ^t ,…,I ^k ) Each image in the image sequence is subjected to feature extraction to obtain a corresponding feature image sequence (F ¹ ,…,F ^t ,…,F ^k )；

Wherein t is a target frame, and k is the length of the input image sequence; the input image sequence is a training sample;

3.3, adopting a gradual alignment fusion module to carry out gradual alignment feature fusion on the feature image sequence to obtain an aligned and fused feature image;

3.4, performing nonlinear mapping on the aligned and fused characteristic images by adopting a depth residual error module to obtain mapped characteristic images;

3.5, carrying out size amplification on the mapped characteristic image through sub-pixel convolution to obtain a characteristic image with a target size;

3.6, performing size amplification on the original target frame image through up-sampling to obtain an original image with a target size;

3.7, overlapping the characteristic image of the target size with the original image of the target size by adopting an overlapping module to obtain a reconstructed image of the target frame;

3.8, optimizing and updating parameters of the super-resolution video reconstruction network model;

for each input image sequence, repeating steps 3.2-3.8 until a maximum number of exercises is reached.

Furthermore, the gradual alignment fusion module is used for gradual alignment feature fusion of the feature image sequence, specifically:

first, for a sequence of feature images to the left of the target frame: let F ^l The left characteristic image of the target frame; from leftmost feature image F ¹ Initially, the first frame characteristic image F ¹ Alignment to second frame characteristic image F ² Fusing the aligned first frame characteristic image and the second frame characteristic image to obtain a fused characteristic image F ² ' let F ^l ＝F ² 'A'; fusing the characteristic images F ² ' alignment to third frame characteristic image F ³ Re-fusion, corresponding to F ³ ' let F ^l ＝F ³ 'A'; and so on until F ^t-1 F is then ^l ＝F ^t-1 ′；

Secondly, for the sequence of feature images to the right of the target frame: let F ^r The characteristic image on the right side of the target frame; from the rightmost feature image F ^k Starting to take the last frame of characteristic image F ^k Alignment to penultimate frame feature image F ^k-1 Fusing the two aligned frames of characteristic images to obtain a fused characteristic image F ^k-1 ' let F ^r ＝F ^k-1 'A'; fusing the characteristic images F ^k-1 ' alignment to third last frame feature image F ^k-2 Re-fusion, corresponding to F ^k-2 ' let F ^r ＝F ^k-2 'A'; and so on until F ^t+1 F is then ^r ＝F ^t+1 ′；

Finally, using the left-hand feature image F of the target frame ^l Target frame feature image F ^t And a target frame right-side characteristic image F ^r And fusing to obtain the characteristic images after alignment and fusion.

Further, the first frame characteristic image F ¹ Alignment to second frame characteristic image F ² The method specifically comprises the following steps: setting a first frame characteristic image F ¹ And a second frame characteristic image F ² W x H x C, where W is the width of the feature map, H is the height of the feature map,c is the number of channels of the feature map;

first, the first frame characteristic image F ¹ And a second frame characteristic image F ² Connecting in the channel direction to obtain a W×H×2C connection matrix;

secondly, mapping processing and channel number transformation are carried out on the connection matrix by using a plurality of convolution layers, so as to obtain a weight matrix of W multiplied by H multiplied by C;

finally, weighting the weight matrix to F by bit multiplication ¹ Finish F ¹ Alignment to F ² Is performed according to the operation of (a).

Further, a plurality of feature images are fused, which specifically includes:

(a) The M feature images to be fused are subjected to preliminary fusion by adding the para-elements to obtain a preliminary fusion matrix U,

wherein U is _i Representing an ith feature image to be fused;

(b) Global average pooling is carried out on the primary fusion matrix U to obtain a pooled result s,

wherein s is _c A feature matrix of a c-th channel representing the pooled result s; u (U) _c Representing a feature matrix of a c-th channel of the preliminary fusion matrix U; u (U) _c (m, n) represents a matrix U _c A pixel value at any pixel point (m, n);

(c) Using two full connection layers to build a correlation model between each channel of the feature map:

z＝W ₂ ·(δ(W ₁ ·U))

wherein W is ₁ Representing the weight of the first fully connected layer, W ₂ Representing the weight of the second fully connected layer, delta representing the ReLU activation function;

(d) The internal correlation of the feature matrix in the spatial dimension is established using a 1 x 1 convolution layer:

v _i ＝CNN _1×1 (W ₃ ，U _i )

wherein CNN _1×1 (. Cndot.) represents a convolution layer with a convolution kernel of 1×1; w (W) ₃ A weight matrix representing the convolutional layer,

(e) Calculating the total correlation { a }' of the feature matrix _i }，

/>

a _i ＝v _i ·z

(f) Using a sigmoid function pair { a _i Recalibration is carried out to obtain a total weight vector { b } _i }：

Where j=1, 2, M; (m, n, c) represents the position coordinates of a certain pixel point; b _{i，m，n，c} Representing the weights at the pixel points (m, n, c) of the ith feature image to be fused,

(g) The total weight vector { b } _i And corresponding feature images to be fused { U }, respectively _i Para-multiplying and adding to obtain fused result

Wherein, the ". Iy represents multiplication of para-elements.

Further, the depth residual module is formed by stacking a plurality of improved residual modules.

Still further, the improved residual block comprises four convolution layers, wherein the number of input channels is set to C, the convolution kernel size of the first convolution layer is 1×1, and the number of channels is 6×C; the convolution kernel of the second convolution layer is 1 multiplied by 1, and the number of channels is C/2; the convolution kernel size of the third convolution layer is 3 multiplied by 3, and the channel number is C/2; the convolution kernel size of the fourth convolution layer is 1×1, and the number of channels is C.

Compared with the prior art, the invention has the advantages that:

(1) The invention adopts a gradual alignment fusion mechanism to align and fuse the images frame by frame, and the alignment operation only acts on two adjacent frames of images, so that the model can process longer time sequence relation, and more adjacent video frames are used, which means that the input contains more scene information, and the reconstruction effect can be effectively improved.

(2) The invention selects the frame sequences with different lengths as input, has stronger practicability, and the gradual alignment fusion module can align and fuse any number of adjacent frames without influencing the subsequent reconstruction task.

(3) The feature fusion of the invention considers that different video frames and different positions have different contribution degrees to the reconstruction effect, and can more effectively fuse the features of different video frames.

(4) The invention uses the improved depth residual error network as a reconstruction network, and has stronger learning mapping capability.

Drawings

FIG. 1 is a schematic diagram of a conventional fixed-length input model and a variable-length input model according to the present invention; wherein, (a) is a traditional fixed-length input model schematic diagram; (b) a comparison schematic of the variable length input model of the present invention;

FIG. 2 is a schematic diagram of random length training samples during training according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a super-resolution video reconstruction network structure according to an embodiment of the present invention;

FIG. 4 is a schematic diagram showing the comparison of the conventional residual module and the modified residual module in the embodiment of the present invention; wherein, (a) is a traditional residual module processing structure schematic diagram, and (b) is an improved residual module processing structure schematic diagram;

fig. 5 is a schematic structural diagram of a feature fusion module according to an embodiment of the present invention.

Detailed Description

To describe the technical contents, operation flow, achieved objects and effects of the present invention in detail, the following description of examples is given.

illustratively, the process of acquiring training samples of random length:

first, given an input sequence length K, K > 0; selecting a data set;

secondly, giving a target frame to be reconstructed;

where x is an integer randomly derived by uniform distribution, x=0, 1.

The length of the input sequence in the invention can be fixed or can be changed according to the requirement. In the embodiment, during training, REDS is used as an original training sample set, and a bicubic interpolation method is utilized to obtain a low-resolution image; combining an RGB image block with the size of 64 multiplied by 64 in the low resolution image with a corresponding high resolution image block into a training sample; and meanwhile, random horizontal overturn and rotation are used for data enhancement, and the number of training samples is expanded. In addition, each data is subtracted by the average RGB value of the entire training set to pre-process all training data. Illustratively, a training sample is constructed: the input length is fixed at 15 during the training phase, and given the target frame to be reconstructed, an integer x (x=0, 1,.. x represents the length of the input sequence to the left of the target frame and K-1-x is the length of the input sequence to the right of the target frame, which are then combined into an input sequence of length K in left to right order, as shown in fig. 2. In order to take advantage of the characteristics of GPU acceleration matrix operations, the x values of different training samples in the same batch are the same.

Furthermore, when the training set is acquired, the invention can also perform data enhancement in time while using a general spatial data enhancement method (random horizontal overturn and rotation) in order to create training data more in line with the actual application scene. An interval variable T is introduced to represent the sampling interval of temporal data enhancement, and when T > 1, a lower acquisition frame rate or faster moving object can be simulated. For example, the target frame to be reconstructed is the ith frame image, the input length is 7, t is 2, and then the input image sequence can be expressed as:

i-6，i-4，i-2，i，i+2，i+4，i+6

with T of various sizes, more training data with complex movements can be created. Considering the characteristics of the REDS dataset, three time enhancement modes (t=1, namely the original image sequence) are selected. The time enhancement can increase the diversity and complexity of the training data in the time domain and improve the performance of super-resolution reconstruction in complex scenes.

referring to fig. 3, in one embodiment of the invention, the feature extractor uses a composition of 5 residual modules (convolutional layers) with batch normalization layers removed. The depth residual module stacks the depth residual module using 12 modified residual modules, illustratively having the following structure:

the number of input channels is set as C, and four convolution layers are used for mapping learning of the input: the convolution kernel size of the first convolution layer is 1×1, and the channel number is 6×c; the convolution kernel of the second convolution layer is 1 multiplied by 1, and the number of channels is C/2; the convolution kernel size of the third convolution layer is 3 multiplied by 3, and the channel number is C/2; the convolution kernel size of the fourth convolution layer is 1×1, and the number of channels is C.

The original residual module and the improved residual module structure pair are shown in fig. 4. The number of input channels is set to 128, and the improved residual error module performs mapping learning on the input by using four convolution layers: the convolution kernel size of the first convolution layer is 1×1, and the number of channels is 768; the convolution kernel size of the second convolution layer is 1×1, and the number of channels is 64; the convolution kernel size of the third convolution layer is 3×3, and the number of channels is 64; the convolution kernel size of the second convolution layer is 1×1 and the number of channels is 128.

The superposition module is an adder, and the mapped features output by the depth residual error module are added with original input features of the target frame to obtain a final output result.

specifically, the training set is adopted to train the super-resolution video reconstruction network model, and the specific steps are as follows:

in this embodiment, the batch size is set to 16, the maximum training number is 600000, adam is used as the optimizer, and the learning rate of all the structural layers of the network is initialized to 4e-4. Using the L1 distance as a loss function, the definition is as follows:

wherein I represents a real image,

representing the predicted image, h, w, c are the height, width, and channel number of the image, respectively. Is thatThe numerical stability in the training process is ensured, and a very small constant E is added in the loss function, and 1e-3 is taken.

3.2, input image sequence (I ¹ ，...，I ^t ，...，I ^k ) Each image in the image sequence is subjected to feature extraction to obtain a corresponding feature image sequence (F ¹ ，...，F ^t ，...，F ^k )；

3.3, adopting a gradual alignment fusion module to carry out gradual alignment feature fusion on the feature image sequence to obtain an aligned and fused feature image; referring to fig. 3, the specific process is as follows:

Finally, using the left-hand feature image F of the target frame ^l Target frame feature image F ^t And fusing the characteristic images with the characteristic image Fr on the right side of the target frame to obtain the characteristic images after alignment fusion.

The alignment process of the two adjacent characteristic images in the process is as follows:

for example: to the first frame characteristic image F ¹ Alignment to second frame characteristic image F ² The specific process is as follows: setting a first frame characteristic image F ¹ And a second frame characteristic image F ² W is the width of the feature map, H is the height of the feature map, and C is the number of channels of the feature map;

3.6, performing size amplification on the original target frame image through up-sampling to obtain an original image with a target size; this embodiment upsamples using a bilinear interpolation method or upsamples using a 5 x 5 convolution layer and a sub-pixel convolution layer.

Further, as shown in fig. 5, the specific process of fusing the plurality of feature images in the above process is:

wherein U is _i Representing an ith feature image to be fused;

z＝W ₂ ·(δ(W ₁ ·U))

(d) Using 1×Convolution of 1 will be { U }, respectively _i The size of the input feature matrix is changed to W×H, and the internal correlation CNN of each input feature matrix in the space dimension is learned _1×1 (U _i )：

v _i ＝CNN _1×1 (W ₃ ，U _i )

/>

(e) Calculating the total correlation { a }' of the feature matrix _i }，

ai＝v _i ·z

Where j=1, 2, M; (m, n, c) represents the position coordinates of a certain pixel point; b _i,m,n,c Representing the weights at the pixel points (m, n, c) of the ith feature image to be fused,

the above process is calculated separately at each location of the feature map.

Wherein, the ". Iy represents multiplication of para-elements. The alignment element is the element of the corresponding position.

The method adopts a depth residual error network, reduces the parameter quantity and improves the learning capacity of the network by improving the structure of a residual error module. Increasing the number of channels in the middle layer of the residual module helps to improve the reconstruction quality of the model, but if the number of channels is directly increased, the calculation amount is greatly increased, so that the number of channels of the feature map is changed by introducing 1×1 convolution. The 1 x 1 convolution is widely used for models such as ResNets, resNeXt and MobileNetV2 to reduce and increase the number of channels in the feature map. The number of channels is reduced first using a 1 x 1 convolution, then using a 3 x 3 convolution for feature extraction and mapping, and finally using a 1 x 1 convolution to recover the number of channels. Compared with the original residual error module, the improved residual error module not only reduces the calculated amount, but also enhances the modeling capability of the relation between channels, and is more beneficial to improving the reconstruction capability of the model.

The invention adopts a gradual alignment fusion mechanism to gradually align adjacent frames with the target frames and fuses the adjacent frames frame by frame, each alignment operation is only carried out on the adjacent two frames, and compared with the mode that all the adjacent frames in other models are respectively and independently aligned to the target frames, the gradual alignment fusion mechanism greatly improves the robustness of the reconstruction model to complex motions. In addition, in some optical flow-based methods, the original image is aligned, which is extremely susceptible to noise or occlusion, and the gradual alignment fusion mechanism aligns feature images after feature extraction, which are not susceptible to occlusion, blurring and noise in the original image. Therefore, the gradual alignment fusion mechanism not only can effectively improve alignment accuracy, but also can align and fuse a larger number of adjacent frames, which represents that more scene information can be used, and is helpful for improving the reconstruction effect of the model.

The invention adopts a random length training mechanism, and the variable length input is that the image super-resolution reconstruction model is required to allow a user to input video image sequences with different lengths on the premise of not influencing the reconstruction effect, and the proper input length can be selected independently according to the characteristics of real data. When no useful complementary information exists between adjacent images, only the target frame image is selected to be input, and if the adjacent frames can provide additional useful features, the proper input length is selected, which is of great significance to the application of the image super-resolution reconstruction technology. Through a random length training mechanism, although the input length is fixed, in the reconstruction process, the number of video frames aligned and fused by the gradual alignment fusion network is random in the front and rear directions of the current frame. Therefore, the gradual alignment fusion network can learn the feature fusion mapping of different numbers of video frames, so that the model is not influenced by the number of input video frames during testing, and the reconstruction effect of the model is ensured.

In summary, the invention not only improves the effect of video super-resolution reconstruction by means of two innovative mechanisms of gradual alignment fusion and random length training, but also allows the model to input image sequences of any length, including the total length of the input sequence and the length of a single-side sequence, thereby greatly improving the application range of video super-resolution reconstruction.

While the invention has been described in detail in this specification with reference to the general description and the specific embodiments thereof, it will be apparent to one skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.

Claims

1. The variable-length input super-resolution video reconstruction method based on deep learning is characterized by comprising the following steps of:

the training set is adopted to train the super-resolution video reconstruction network model, and the specific steps are as follows:

repeating steps 3.2-3.8 for each input image sequence until a maximum training number is reached;

the gradual alignment fusion module is used for gradual alignment feature fusion of the feature image sequence, and specifically comprises the following steps:

first, for a sequence of feature images to the left of the target frame: let F ^l The left characteristic image of the target frame; from leftmost feature image F ¹ Initially, the first frame characteristic image F ¹ Alignment to second frame characteristic image F ² Fusing the aligned first frame characteristic image and the second frame characteristic image to obtain a fused characteristic image F ^2′ Order F ^l ＝F ^2′ The method comprises the steps of carrying out a first treatment on the surface of the Fusing the characteristic images F ^2′ Alignment to third frame feature image F ³ Re-fusion, corresponding to F ^3′ Order F ^l ＝F ^3′ The method comprises the steps of carrying out a first treatment on the surface of the And so on until F ^t-1 F is then ^l ＝F ^t-1′ ；

Secondly, for the sequence of feature images to the right of the target frame: let F ^r The characteristic image on the right side of the target frame; from the rightmost feature image F ^k Starting to take the last frame of characteristic image F ^k Alignment to penultimate frame feature image F ^k-1 Fusing the two aligned frames of characteristic images to obtain a fused characteristic image F ^k-1′ Order F ^r ＝F ^k-1′ The method comprises the steps of carrying out a first treatment on the surface of the Fusing the characteristic images F ^k-1′ Alignment to the third last frame feature image F ^k-2 Re-fusion, corresponding to F ^k-2′ Order F ^r ＝F ^k-2′ The method comprises the steps of carrying out a first treatment on the surface of the And so on until F ^t+1 F is then ^r ＝F ^t ^+1′ ；

Finally, using the left-hand feature image F of the target frame ^l Target frame feature image F ^t And a target frame right-side characteristic image F ^r Fusing to obtain aligned and fused characteristic images;

2. The variable length input super resolution video reconstruction method based on deep learning of claim 1, wherein the training samples for constructing random lengths are:

first, given an input sequence length K, K > 0; selecting a data set;

secondly, giving a target frame to be reconstructed;

where x is an integer randomly derived by uniform distribution, x=0, 1.

3. The variable length input super resolution video reconstruction method based on deep learning according to claim 1, wherein the acquiring training set is:

4. The variable length input super resolution video reconstruction method based on deep learning as claimed in claim 1, wherein said first frame feature image F ¹ Alignment to second frame characteristic image F ² The method specifically comprises the following steps: setting a first frame characteristic image F ¹ And a second frame characteristic image F ² W is the width of the feature map, H is the height of the feature map, and C is the number of channels of the feature map;

5. The variable length input super resolution video reconstruction method based on deep learning as claimed in claim 1, wherein a plurality of feature images are fused, which specifically comprises:

(a) The M feature images to be fused are subjected to preliminary fusion through the addition of the alignment elements, and a preliminary fusion matrix U is obtained:

wherein U is _i Representing an ith feature image to be fused;

z＝W ₂ ·(δ(W ₁ ·U))

v _i ＝CNN _1×1 (W ₃ ，U _i )

wherein CNN _1×1 (. Cndot.) represents a convolution layer with a convolution kernel of 1×1; w (W) ₃ A weight matrix representing a convolutional layer;

(e) Calculating the total correlation { a }' of the feature matrix _i }，

a _i ＝v _i ·z

Where j=1, 2, M; (m, n, c) represents the position coordinates of a certain pixel point; b _{i，m，n，c} Weights at pixel points (m, n, c) representing the ith feature image to be fused;

Wherein, the ". Iy represents multiplication of para-elements.

6. The variable length input super resolution video reconstruction method based on deep learning according to claim 1, wherein the depth residual module is formed by stacking a plurality of improved residual modules.

7. The variable length input super resolution video reconstruction method based on deep learning as claimed in claim 6, wherein the modified residual block comprises four convolution layers, wherein the number of input channels is set to C, the convolution kernel size of the first convolution layer is 1 x 1, and the number of channels is 6 x C; the convolution kernel of the second convolution layer is 1 multiplied by 1, and the number of channels is C/2; the convolution kernel size of the third convolution layer is 3 multiplied by 3, and the channel number is C/2; the convolution kernel size of the fourth convolution layer is 1×1, and the number of channels is C.