CN111524068B - Variable-length input super-resolution video reconstruction method based on deep learning - Google Patents

Variable-length input super-resolution video reconstruction method based on deep learning Download PDF

Info

Publication number
CN111524068B
CN111524068B CN202010290657.1A CN202010290657A CN111524068B CN 111524068 B CN111524068 B CN 111524068B CN 202010290657 A CN202010290657 A CN 202010290657A CN 111524068 B CN111524068 B CN 111524068B
Authority
CN
China
Prior art keywords
image
feature
frame
super
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010290657.1A
Other languages
Chinese (zh)
Other versions
CN111524068A (en
Inventor
任卫军
丁国栋
黄金文
张力波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dragon Totem Technology Hefei Co ltd
Original Assignee
Changan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changan University filed Critical Changan University
Priority to CN202010290657.1A priority Critical patent/CN111524068B/en
Publication of CN111524068A publication Critical patent/CN111524068A/en
Application granted granted Critical
Publication of CN111524068B publication Critical patent/CN111524068B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/60Rotation of whole images or parts thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a variable-length input super-resolution video reconstruction method based on deep learning; the method comprises the following steps: constructing training samples with random lengths, and acquiring a training set; establishing a super-resolution video reconstruction network model: the device comprises a feature extractor, a gradual alignment fusion module, a depth residual error module and a superposition module which are connected in sequence; training the super-resolution video reconstruction network model by adopting a training set to obtain a trained super-resolution video reconstruction network; and sequentially inputting the video to be processed into a trained super-resolution video reconstruction network to reconstruct the video, so as to obtain a corresponding super-resolution reconstruction video. The invention adopts a gradual alignment fusion mechanism to align and fuse the images frame by frame, and the alignment operation only acts on two adjacent frames of images, so that the model can process longer time sequence relation, and more adjacent video frames are used, which means that the input contains more scene information, and the reconstruction effect can be effectively improved.

Description

Variable-length input super-resolution video reconstruction method based on deep learning
Technical Field
The invention belongs to the technical field of video restoration, and particularly relates to a variable-length input super-resolution video reconstruction method based on deep learning.
Background
Most image and video based applications have effects that depend on the quality of the image. In general, the quality of an image is related to the amount of information it contains, and the resolution of an image is used to measure how much information an image contains, expressed in terms of the number of pixels per unit area, such as 1024×768. It follows that the resolution of an image represents the quality of the image, so in real life and application scenarios, high resolution becomes a quality appeal of images and video.
However, when the video contains complex motions of occlusion, severe blurring, and large offsets, the video needs to be reconstructed to obtain high quality video information. In order to effectively fuse the complementary information of the multi-frame images and obtain a high-quality reconstructed image, all frames in the input video frame sequence must be aligned, and an accurate corresponding relationship is established to perform the following reconstruction step. Alignment is a challenging but very important issue for video super resolution due to the constant motion of the camera or object, the target frame and each adjacent frame are not aligned. At present, most superdivision models treat all adjacent frames equally, and the same alignment network is used for processing different adjacent frames, so that different intervals between the different adjacent frames and the target frame are not considered. Theoretically, the motion offset of different adjacent frames with respect to the target frame is different, and adjacent frames farther from the target frame have larger offsets, which is certainly difficult to learn the alignment operation of different adjacent frames simultaneously using one alignment network.
At present, most multi-frame image super-resolution models can only input image sequences with determined lengths, and images at two ends of a video sequence can not be normally processed in the reconstruction process of the models, which is caused by the structural limitation of the models, and the input image sequences can only be complemented by mirror image processing or copying of target frames. As shown in fig. 1, in fig. 1 (a), when the input length is 9 (the target frame and the left and right 4 frames of images), and the number of the left remaining video frames of the current target frame is insufficient, the fixed-length input model must be supplemented by copying other image frames, so that manual intervention marks are increased, and additional noise is introduced. If the variable length input in fig. 1 (b) is not needed to be processed, the variable length input can be directly input into the reconstruction model, so that the method meets the requirements of practical application. In addition, if the appropriate input sequence length (including the total length and the lengths of the adjacent frames on the left and right sides) can be selected according to the difference of the use scenes, the applicability of the multi-frame image super-resolution reconstruction model is greatly enhanced.
Disclosure of Invention
Aiming at the defects of the existing design method, the invention aims to provide a variable-length input super-resolution video reconstruction method based on deep learning. The problem of inaccurate alignment of a long input image sequence in a video super-resolution task is solved by adopting a variable length input sequence; the gradual alignment fusion network is adopted to align and fuse any number of adjacent frames without affecting the subsequent reconstruction task, so that the practicability is higher.
A variable-length input super-resolution video reconstruction method based on deep learning comprises the following steps:
step 1, constructing training samples with random lengths, and acquiring a training set;
step 2, establishing a super-resolution video reconstruction network model: the device comprises a feature extractor, a gradual alignment fusion module, a depth residual error module and a superposition module which are connected in sequence;
step 3, training the super-resolution video reconstruction network model by adopting a training set to obtain a trained super-resolution video reconstruction network;
step 4, inputting the video to be processed into the trained super-resolution video reconstruction network in sequence to reconstruct the video, so as to obtain a corresponding super-resolution reconstruction video;
the length of each input image sequence of the video to be processed is self-defined.
Further, the training samples with random length are constructed as follows:
first, given an input sequence length K, K > 0; selecting a data set;
secondly, giving a target frame to be reconstructed;
finally, selecting an x-frame image on the left side of the target frame and a K-1-x frame image on the right side of the target frame, and arranging the K frame images in sequence from left to right to obtain an input image sequence;
where x is an integer randomly derived by uniform distribution, x=0, 1, …, K-1.
Further, the acquiring the training set is:
firstly, random horizontal overturning and rotation are used for each original training sample, so that a space transformation training sample is obtained;
secondly, introducing an interval variable T, wherein T is more than 1, and acquiring an input image sequence with the length of the input sequence by taking T as a sampling interval so as to simulate a moving target with low acquisition frame rate or fast movement and obtain a time enhancement training sample;
finally, the training set is composed of the original training sample, the space transformation training sample and the time enhancement training sample.
Further, the training set is adopted to train the super-resolution video reconstruction network model, and specifically comprises the following steps:
3.1, initializing super-resolution video reconstruction network model parameters given the maximum training times;
3.2, input image sequence (I 1 ,…,I t ,…,I k ) Each image in the image sequence is subjected to feature extraction to obtain a corresponding feature image sequence (F 1 ,…,F t ,…,F k );
Wherein t is a target frame, and k is the length of the input image sequence; the input image sequence is a training sample;
3.3, adopting a gradual alignment fusion module to carry out gradual alignment feature fusion on the feature image sequence to obtain an aligned and fused feature image;
3.4, performing nonlinear mapping on the aligned and fused characteristic images by adopting a depth residual error module to obtain mapped characteristic images;
3.5, carrying out size amplification on the mapped characteristic image through sub-pixel convolution to obtain a characteristic image with a target size;
3.6, performing size amplification on the original target frame image through up-sampling to obtain an original image with a target size;
3.7, overlapping the characteristic image of the target size with the original image of the target size by adopting an overlapping module to obtain a reconstructed image of the target frame;
3.8, optimizing and updating parameters of the super-resolution video reconstruction network model;
for each input image sequence, repeating steps 3.2-3.8 until a maximum number of exercises is reached.
Furthermore, the gradual alignment fusion module is used for gradual alignment feature fusion of the feature image sequence, specifically:
first, for a sequence of feature images to the left of the target frame: let F l The left characteristic image of the target frame; from leftmost feature image F 1 Initially, the first frame characteristic image F 1 Alignment to second frame characteristic image F 2 Fusing the aligned first frame characteristic image and the second frame characteristic image to obtain a fused characteristic image F 2 ' let F l =F 2 'A'; fusing the characteristic images F 2 ' alignment to third frame characteristic image F 3 Re-fusion, corresponding to F 3 ' let F l =F 3 'A'; and so on until F t-1 F is then l =F t-1 ′;
Secondly, for the sequence of feature images to the right of the target frame: let F r The characteristic image on the right side of the target frame; from the rightmost feature image F k Starting to take the last frame of characteristic image F k Alignment to penultimate frame feature image F k-1 Fusing the two aligned frames of characteristic images to obtain a fused characteristic image F k-1 ' let F r =F k-1 'A'; fusing the characteristic images F k-1 ' alignment to third last frame feature image F k-2 Re-fusion, corresponding to F k-2 ' let F r =F k-2 'A'; and so on until F t+1 F is then r =F t+1 ′;
Finally, using the left-hand feature image F of the target frame l Target frame feature image F t And a target frame right-side characteristic image F r And fusing to obtain the characteristic images after alignment and fusion.
Further, the first frame characteristic image F 1 Alignment to second frame characteristic image F 2 The method specifically comprises the following steps: setting a first frame characteristic image F 1 And a second frame characteristic image F 2 W x H x C, where W is the width of the feature map, H is the height of the feature map,c is the number of channels of the feature map;
first, the first frame characteristic image F 1 And a second frame characteristic image F 2 Connecting in the channel direction to obtain a W×H×2C connection matrix;
secondly, mapping processing and channel number transformation are carried out on the connection matrix by using a plurality of convolution layers, so as to obtain a weight matrix of W multiplied by H multiplied by C;
finally, weighting the weight matrix to F by bit multiplication 1 Finish F 1 Alignment to F 2 Is performed according to the operation of (a).
Further, a plurality of feature images are fused, which specifically includes:
(a) The M feature images to be fused are subjected to preliminary fusion by adding the para-elements to obtain a preliminary fusion matrix U,
Figure BDA0002450269870000051
Figure BDA0002450269870000052
wherein U is i Representing an ith feature image to be fused;
(b) Global average pooling is carried out on the primary fusion matrix U to obtain a pooled result s,
Figure BDA0002450269870000053
Figure BDA0002450269870000054
wherein s is c A feature matrix of a c-th channel representing the pooled result s; u (U) c Representing a feature matrix of a c-th channel of the preliminary fusion matrix U; u (U) c (m, n) represents a matrix U c A pixel value at any pixel point (m, n);
(c) Using two full connection layers to build a correlation model between each channel of the feature map:
z=W 2 ·(δ(W 1 ·U))
wherein W is 1 Representing the weight of the first fully connected layer, W 2 Representing the weight of the second fully connected layer, delta representing the ReLU activation function;
(d) The internal correlation of the feature matrix in the spatial dimension is established using a 1 x 1 convolution layer:
v i =CNN 1×1 (W 3 ,U i )
wherein CNN 1×1 (. Cndot.) represents a convolution layer with a convolution kernel of 1×1; w (W) 3 A weight matrix representing the convolutional layer,
Figure BDA0002450269870000061
(e) Calculating the total correlation { a }' of the feature matrix i },
Figure BDA0002450269870000062
/>
a i =v i ·z
(f) Using a sigmoid function pair { a i Recalibration is carried out to obtain a total weight vector { b } i }:
Figure BDA0002450269870000063
Where j=1, 2, M; (m, n, c) represents the position coordinates of a certain pixel point; b i,m,n,c Representing the weights at the pixel points (m, n, c) of the ith feature image to be fused,
Figure BDA0002450269870000064
(g) The total weight vector { b } i And corresponding feature images to be fused { U }, respectively i Para-multiplying and adding to obtain fused result
Figure BDA0002450269870000065
Figure BDA0002450269870000066
Wherein, the ". Iy represents multiplication of para-elements.
Further, the depth residual module is formed by stacking a plurality of improved residual modules.
Still further, the improved residual block comprises four convolution layers, wherein the number of input channels is set to C, the convolution kernel size of the first convolution layer is 1×1, and the number of channels is 6×C; the convolution kernel of the second convolution layer is 1 multiplied by 1, and the number of channels is C/2; the convolution kernel size of the third convolution layer is 3 multiplied by 3, and the channel number is C/2; the convolution kernel size of the fourth convolution layer is 1×1, and the number of channels is C.
Compared with the prior art, the invention has the advantages that:
(1) The invention adopts a gradual alignment fusion mechanism to align and fuse the images frame by frame, and the alignment operation only acts on two adjacent frames of images, so that the model can process longer time sequence relation, and more adjacent video frames are used, which means that the input contains more scene information, and the reconstruction effect can be effectively improved.
(2) The invention selects the frame sequences with different lengths as input, has stronger practicability, and the gradual alignment fusion module can align and fuse any number of adjacent frames without influencing the subsequent reconstruction task.
(3) The feature fusion of the invention considers that different video frames and different positions have different contribution degrees to the reconstruction effect, and can more effectively fuse the features of different video frames.
(4) The invention uses the improved depth residual error network as a reconstruction network, and has stronger learning mapping capability.
Drawings
FIG. 1 is a schematic diagram of a conventional fixed-length input model and a variable-length input model according to the present invention; wherein, (a) is a traditional fixed-length input model schematic diagram; (b) a comparison schematic of the variable length input model of the present invention;
FIG. 2 is a schematic diagram of random length training samples during training according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a super-resolution video reconstruction network structure according to an embodiment of the present invention;
FIG. 4 is a schematic diagram showing the comparison of the conventional residual module and the modified residual module in the embodiment of the present invention; wherein, (a) is a traditional residual module processing structure schematic diagram, and (b) is an improved residual module processing structure schematic diagram;
fig. 5 is a schematic structural diagram of a feature fusion module according to an embodiment of the present invention.
Detailed Description
To describe the technical contents, operation flow, achieved objects and effects of the present invention in detail, the following description of examples is given.
A variable-length input super-resolution video reconstruction method based on deep learning comprises the following steps:
step 1, constructing training samples with random lengths, and acquiring a training set;
illustratively, the process of acquiring training samples of random length:
first, given an input sequence length K, K > 0; selecting a data set;
secondly, giving a target frame to be reconstructed;
finally, selecting an x-frame image on the left side of the target frame and a K-1-x frame image on the right side of the target frame, and arranging the K frame images in sequence from left to right to obtain an input image sequence;
where x is an integer randomly derived by uniform distribution, x=0, 1.
The length of the input sequence in the invention can be fixed or can be changed according to the requirement. In the embodiment, during training, REDS is used as an original training sample set, and a bicubic interpolation method is utilized to obtain a low-resolution image; combining an RGB image block with the size of 64 multiplied by 64 in the low resolution image with a corresponding high resolution image block into a training sample; and meanwhile, random horizontal overturn and rotation are used for data enhancement, and the number of training samples is expanded. In addition, each data is subtracted by the average RGB value of the entire training set to pre-process all training data. Illustratively, a training sample is constructed: the input length is fixed at 15 during the training phase, and given the target frame to be reconstructed, an integer x (x=0, 1,.. x represents the length of the input sequence to the left of the target frame and K-1-x is the length of the input sequence to the right of the target frame, which are then combined into an input sequence of length K in left to right order, as shown in fig. 2. In order to take advantage of the characteristics of GPU acceleration matrix operations, the x values of different training samples in the same batch are the same.
Furthermore, when the training set is acquired, the invention can also perform data enhancement in time while using a general spatial data enhancement method (random horizontal overturn and rotation) in order to create training data more in line with the actual application scene. An interval variable T is introduced to represent the sampling interval of temporal data enhancement, and when T > 1, a lower acquisition frame rate or faster moving object can be simulated. For example, the target frame to be reconstructed is the ith frame image, the input length is 7, t is 2, and then the input image sequence can be expressed as:
i-6,i-4,i-2,i,i+2,i+4,i+6
with T of various sizes, more training data with complex movements can be created. Considering the characteristics of the REDS dataset, three time enhancement modes (t=1, namely the original image sequence) are selected. The time enhancement can increase the diversity and complexity of the training data in the time domain and improve the performance of super-resolution reconstruction in complex scenes.
Step 2, establishing a super-resolution video reconstruction network model: the device comprises a feature extractor, a gradual alignment fusion module, a depth residual error module and a superposition module which are connected in sequence;
referring to fig. 3, in one embodiment of the invention, the feature extractor uses a composition of 5 residual modules (convolutional layers) with batch normalization layers removed. The depth residual module stacks the depth residual module using 12 modified residual modules, illustratively having the following structure:
the number of input channels is set as C, and four convolution layers are used for mapping learning of the input: the convolution kernel size of the first convolution layer is 1×1, and the channel number is 6×c; the convolution kernel of the second convolution layer is 1 multiplied by 1, and the number of channels is C/2; the convolution kernel size of the third convolution layer is 3 multiplied by 3, and the channel number is C/2; the convolution kernel size of the fourth convolution layer is 1×1, and the number of channels is C.
The original residual module and the improved residual module structure pair are shown in fig. 4. The number of input channels is set to 128, and the improved residual error module performs mapping learning on the input by using four convolution layers: the convolution kernel size of the first convolution layer is 1×1, and the number of channels is 768; the convolution kernel size of the second convolution layer is 1×1, and the number of channels is 64; the convolution kernel size of the third convolution layer is 3×3, and the number of channels is 64; the convolution kernel size of the second convolution layer is 1×1 and the number of channels is 128.
The superposition module is an adder, and the mapped features output by the depth residual error module are added with original input features of the target frame to obtain a final output result.
Step 3, training the super-resolution video reconstruction network model by adopting a training set to obtain a trained super-resolution video reconstruction network;
specifically, the training set is adopted to train the super-resolution video reconstruction network model, and the specific steps are as follows:
3.1, initializing super-resolution video reconstruction network model parameters given the maximum training times;
in this embodiment, the batch size is set to 16, the maximum training number is 600000, adam is used as the optimizer, and the learning rate of all the structural layers of the network is initialized to 4e-4. Using the L1 distance as a loss function, the definition is as follows:
Figure BDA0002450269870000101
wherein I represents a real image,
Figure BDA0002450269870000102
representing the predicted image, h, w, c are the height, width, and channel number of the image, respectively. Is thatThe numerical stability in the training process is ensured, and a very small constant E is added in the loss function, and 1e-3 is taken.
3.2, input image sequence (I 1 ,...,I t ,...,I k ) Each image in the image sequence is subjected to feature extraction to obtain a corresponding feature image sequence (F 1 ,...,F t ,...,F k );
Wherein t is a target frame, and k is the length of the input image sequence; the input image sequence is a training sample;
3.3, adopting a gradual alignment fusion module to carry out gradual alignment feature fusion on the feature image sequence to obtain an aligned and fused feature image; referring to fig. 3, the specific process is as follows:
first, for a sequence of feature images to the left of the target frame: let F l The left characteristic image of the target frame; from leftmost feature image F 1 Initially, the first frame characteristic image F 1 Alignment to second frame characteristic image F 2 Fusing the aligned first frame characteristic image and the second frame characteristic image to obtain a fused characteristic image F 2 ' let F l =F 2 'A'; fusing the characteristic images F 2 ' alignment to third frame characteristic image F 3 Re-fusion, corresponding to F 3 ' let F l =F 3 'A'; and so on until F t-1 F is then l =F t-1 ′;
Secondly, for the sequence of feature images to the right of the target frame: let F r The characteristic image on the right side of the target frame; from the rightmost feature image F k Starting to take the last frame of characteristic image F k Alignment to penultimate frame feature image F k-1 Fusing the two aligned frames of characteristic images to obtain a fused characteristic image F k-1 ' let F r =F k-1 'A'; fusing the characteristic images F k-1 ' alignment to third last frame feature image F k-2 Re-fusion, corresponding to F k-2 ' let F r =F k-2 'A'; and so on until F t+1 F is then r =F t+1 ′;
Finally, using the left-hand feature image F of the target frame l Target frame feature image F t And fusing the characteristic images with the characteristic image Fr on the right side of the target frame to obtain the characteristic images after alignment fusion.
The alignment process of the two adjacent characteristic images in the process is as follows:
for example: to the first frame characteristic image F 1 Alignment to second frame characteristic image F 2 The specific process is as follows: setting a first frame characteristic image F 1 And a second frame characteristic image F 2 W is the width of the feature map, H is the height of the feature map, and C is the number of channels of the feature map;
first, the first frame characteristic image F 1 And a second frame characteristic image F 2 Connecting in the channel direction to obtain a W×H×2C connection matrix;
secondly, mapping processing and channel number transformation are carried out on the connection matrix by using a plurality of convolution layers, so as to obtain a weight matrix of W multiplied by H multiplied by C;
finally, weighting the weight matrix to F by bit multiplication 1 Finish F 1 Alignment to F 2 Is performed according to the operation of (a).
3.4, performing nonlinear mapping on the aligned and fused characteristic images by adopting a depth residual error module to obtain mapped characteristic images;
3.5, carrying out size amplification on the mapped characteristic image through sub-pixel convolution to obtain a characteristic image with a target size;
3.6, performing size amplification on the original target frame image through up-sampling to obtain an original image with a target size; this embodiment upsamples using a bilinear interpolation method or upsamples using a 5 x 5 convolution layer and a sub-pixel convolution layer.
3.7, overlapping the characteristic image of the target size with the original image of the target size by adopting an overlapping module to obtain a reconstructed image of the target frame;
3.8, optimizing and updating parameters of the super-resolution video reconstruction network model;
for each input image sequence, repeating steps 3.2-3.8 until a maximum number of exercises is reached.
Further, as shown in fig. 5, the specific process of fusing the plurality of feature images in the above process is:
(a) The M feature images to be fused are subjected to preliminary fusion by adding the para-elements to obtain a preliminary fusion matrix U,
Figure BDA0002450269870000121
Figure BDA0002450269870000122
wherein U is i Representing an ith feature image to be fused;
(b) Global average pooling is carried out on the primary fusion matrix U to obtain a pooled result s,
Figure BDA0002450269870000123
Figure BDA0002450269870000124
wherein s is c A feature matrix of a c-th channel representing the pooled result s; u (U) c Representing a feature matrix of a c-th channel of the preliminary fusion matrix U; u (U) c (m, n) represents a matrix U c A pixel value at any pixel point (m, n);
(c) Using two full connection layers to build a correlation model between each channel of the feature map:
z=W 2 ·(δ(W 1 ·U))
wherein W is 1 Representing the weight of the first fully connected layer, W 2 Representing the weight of the second fully connected layer, delta representing the ReLU activation function;
Figure BDA0002450269870000125
(d) Using 1×Convolution of 1 will be { U }, respectively i The size of the input feature matrix is changed to W×H, and the internal correlation CNN of each input feature matrix in the space dimension is learned 1×1 (U i ):
v i =CNN 1×1 (W 3 ,U i )
Wherein CNN 1×1 (. Cndot.) represents a convolution layer with a convolution kernel of 1×1; w (W) 3 A weight matrix representing the convolutional layer,
Figure BDA0002450269870000131
/>
(e) Calculating the total correlation { a }' of the feature matrix i },
Figure BDA0002450269870000132
ai=v i ·z
(f) Using a sigmoid function pair { a i Recalibration is carried out to obtain a total weight vector { b } i }:
Figure BDA0002450269870000133
Where j=1, 2, M; (m, n, c) represents the position coordinates of a certain pixel point; b i,m,n,c Representing the weights at the pixel points (m, n, c) of the ith feature image to be fused,
Figure BDA0002450269870000134
the above process is calculated separately at each location of the feature map.
(g) The total weight vector { b } i And corresponding feature images to be fused { U }, respectively i Para-multiplying and adding to obtain fused result
Figure BDA0002450269870000135
Figure BDA0002450269870000136
Wherein, the ". Iy represents multiplication of para-elements. The alignment element is the element of the corresponding position.
Step 4, inputting the video to be processed into the trained super-resolution video reconstruction network in sequence to reconstruct the video, so as to obtain a corresponding super-resolution reconstruction video;
the length of each input image sequence of the video to be processed is self-defined.
The method adopts a depth residual error network, reduces the parameter quantity and improves the learning capacity of the network by improving the structure of a residual error module. Increasing the number of channels in the middle layer of the residual module helps to improve the reconstruction quality of the model, but if the number of channels is directly increased, the calculation amount is greatly increased, so that the number of channels of the feature map is changed by introducing 1×1 convolution. The 1 x 1 convolution is widely used for models such as ResNets, resNeXt and MobileNetV2 to reduce and increase the number of channels in the feature map. The number of channels is reduced first using a 1 x 1 convolution, then using a 3 x 3 convolution for feature extraction and mapping, and finally using a 1 x 1 convolution to recover the number of channels. Compared with the original residual error module, the improved residual error module not only reduces the calculated amount, but also enhances the modeling capability of the relation between channels, and is more beneficial to improving the reconstruction capability of the model.
The invention adopts a gradual alignment fusion mechanism to gradually align adjacent frames with the target frames and fuses the adjacent frames frame by frame, each alignment operation is only carried out on the adjacent two frames, and compared with the mode that all the adjacent frames in other models are respectively and independently aligned to the target frames, the gradual alignment fusion mechanism greatly improves the robustness of the reconstruction model to complex motions. In addition, in some optical flow-based methods, the original image is aligned, which is extremely susceptible to noise or occlusion, and the gradual alignment fusion mechanism aligns feature images after feature extraction, which are not susceptible to occlusion, blurring and noise in the original image. Therefore, the gradual alignment fusion mechanism not only can effectively improve alignment accuracy, but also can align and fuse a larger number of adjacent frames, which represents that more scene information can be used, and is helpful for improving the reconstruction effect of the model.
The invention adopts a random length training mechanism, and the variable length input is that the image super-resolution reconstruction model is required to allow a user to input video image sequences with different lengths on the premise of not influencing the reconstruction effect, and the proper input length can be selected independently according to the characteristics of real data. When no useful complementary information exists between adjacent images, only the target frame image is selected to be input, and if the adjacent frames can provide additional useful features, the proper input length is selected, which is of great significance to the application of the image super-resolution reconstruction technology. Through a random length training mechanism, although the input length is fixed, in the reconstruction process, the number of video frames aligned and fused by the gradual alignment fusion network is random in the front and rear directions of the current frame. Therefore, the gradual alignment fusion network can learn the feature fusion mapping of different numbers of video frames, so that the model is not influenced by the number of input video frames during testing, and the reconstruction effect of the model is ensured.
In summary, the invention not only improves the effect of video super-resolution reconstruction by means of two innovative mechanisms of gradual alignment fusion and random length training, but also allows the model to input image sequences of any length, including the total length of the input sequence and the length of a single-side sequence, thereby greatly improving the application range of video super-resolution reconstruction.
While the invention has been described in detail in this specification with reference to the general description and the specific embodiments thereof, it will be apparent to one skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.

Claims (7)

1. The variable-length input super-resolution video reconstruction method based on deep learning is characterized by comprising the following steps of:
step 1, constructing training samples with random lengths, and acquiring a training set;
step 2, establishing a super-resolution video reconstruction network model: the device comprises a feature extractor, a gradual alignment fusion module, a depth residual error module and a superposition module which are connected in sequence;
step 3, training the super-resolution video reconstruction network model by adopting a training set to obtain a trained super-resolution video reconstruction network;
the training set is adopted to train the super-resolution video reconstruction network model, and the specific steps are as follows:
3.1, initializing super-resolution video reconstruction network model parameters given the maximum training times;
3.2, input image sequence (I 1 ,...,I t ,...,I k ) Each image in the image sequence is subjected to feature extraction to obtain a corresponding feature image sequence (F 1 ,...,F t ,...,F k );
Wherein t is a target frame, and k is the length of the input image sequence; the input image sequence is a training sample;
3.3, adopting a gradual alignment fusion module to carry out gradual alignment feature fusion on the feature image sequence to obtain an aligned and fused feature image;
3.4, performing nonlinear mapping on the aligned and fused characteristic images by adopting a depth residual error module to obtain mapped characteristic images;
3.5, carrying out size amplification on the mapped characteristic image through sub-pixel convolution to obtain a characteristic image with a target size;
3.6, performing size amplification on the original target frame image through up-sampling to obtain an original image with a target size;
3.7, overlapping the characteristic image of the target size with the original image of the target size by adopting an overlapping module to obtain a reconstructed image of the target frame;
3.8, optimizing and updating parameters of the super-resolution video reconstruction network model;
repeating steps 3.2-3.8 for each input image sequence until a maximum training number is reached;
the gradual alignment fusion module is used for gradual alignment feature fusion of the feature image sequence, and specifically comprises the following steps:
first, for a sequence of feature images to the left of the target frame: let F l The left characteristic image of the target frame; from leftmost feature image F 1 Initially, the first frame characteristic image F 1 Alignment to second frame characteristic image F 2 Fusing the aligned first frame characteristic image and the second frame characteristic image to obtain a fused characteristic image F 2′ Order F l =F 2′ The method comprises the steps of carrying out a first treatment on the surface of the Fusing the characteristic images F 2′ Alignment to third frame feature image F 3 Re-fusion, corresponding to F 3′ Order F l =F 3′ The method comprises the steps of carrying out a first treatment on the surface of the And so on until F t-1 F is then l =F t-1′
Secondly, for the sequence of feature images to the right of the target frame: let F r The characteristic image on the right side of the target frame; from the rightmost feature image F k Starting to take the last frame of characteristic image F k Alignment to penultimate frame feature image F k-1 Fusing the two aligned frames of characteristic images to obtain a fused characteristic image F k-1′ Order F r =F k-1′ The method comprises the steps of carrying out a first treatment on the surface of the Fusing the characteristic images F k-1′ Alignment to the third last frame feature image F k-2 Re-fusion, corresponding to F k-2′ Order F r =F k-2′ The method comprises the steps of carrying out a first treatment on the surface of the And so on until F t+1 F is then r =F t +1′
Finally, using the left-hand feature image F of the target frame l Target frame feature image F t And a target frame right-side characteristic image F r Fusing to obtain aligned and fused characteristic images;
step 4, inputting the video to be processed into the trained super-resolution video reconstruction network in sequence to reconstruct the video, so as to obtain a corresponding super-resolution reconstruction video;
the length of each input image sequence of the video to be processed is self-defined.
2. The variable length input super resolution video reconstruction method based on deep learning of claim 1, wherein the training samples for constructing random lengths are:
first, given an input sequence length K, K > 0; selecting a data set;
secondly, giving a target frame to be reconstructed;
finally, selecting an x-frame image on the left side of the target frame and a K-1-x frame image on the right side of the target frame, and arranging the K frame images in sequence from left to right to obtain an input image sequence;
where x is an integer randomly derived by uniform distribution, x=0, 1.
3. The variable length input super resolution video reconstruction method based on deep learning according to claim 1, wherein the acquiring training set is:
firstly, random horizontal overturning and rotation are used for each original training sample, so that a space transformation training sample is obtained;
secondly, introducing an interval variable T, wherein T is more than 1, and acquiring an input image sequence with the length of the input sequence by taking T as a sampling interval so as to simulate a moving target with low acquisition frame rate or fast movement and obtain a time enhancement training sample;
finally, the training set is composed of the original training sample, the space transformation training sample and the time enhancement training sample.
4. The variable length input super resolution video reconstruction method based on deep learning as claimed in claim 1, wherein said first frame feature image F 1 Alignment to second frame characteristic image F 2 The method specifically comprises the following steps: setting a first frame characteristic image F 1 And a second frame characteristic image F 2 W is the width of the feature map, H is the height of the feature map, and C is the number of channels of the feature map;
first, the first frame characteristic image F 1 And a second frame characteristic image F 2 Connecting in the channel direction to obtain a W×H×2C connection matrix;
secondly, mapping processing and channel number transformation are carried out on the connection matrix by using a plurality of convolution layers, so as to obtain a weight matrix of W multiplied by H multiplied by C;
finally, weighting the weight matrix to F by bit multiplication 1 Finish F 1 Alignment to F 2 Is performed according to the operation of (a).
5. The variable length input super resolution video reconstruction method based on deep learning as claimed in claim 1, wherein a plurality of feature images are fused, which specifically comprises:
(a) The M feature images to be fused are subjected to preliminary fusion through the addition of the alignment elements, and a preliminary fusion matrix U is obtained:
Figure FDA0004092050840000041
wherein U is i Representing an ith feature image to be fused;
(b) Global average pooling is carried out on the primary fusion matrix U to obtain a pooled result s,
Figure FDA0004092050840000045
Figure FDA0004092050840000042
wherein s is c A feature matrix of a c-th channel representing the pooled result s; u (U) c Representing a feature matrix of a c-th channel of the preliminary fusion matrix U; u (U) c (m, n) represents a matrix U c A pixel value at any pixel point (m, n);
(c) Using two full connection layers to build a correlation model between each channel of the feature map:
z=W 2 ·(δ(W 1 ·U))
wherein W is 1 Representing the weight of the first fully connected layer, W 2 Representing the weight of the second fully connected layer, delta representing the ReLU activation function;
(d) The internal correlation of the feature matrix in the spatial dimension is established using a 1 x 1 convolution layer:
v i =CNN 1×1 (W 3 ,U i )
wherein CNN 1×1 (. Cndot.) represents a convolution layer with a convolution kernel of 1×1; w (W) 3 A weight matrix representing a convolutional layer;
(e) Calculating the total correlation { a }' of the feature matrix i },
Figure FDA0004092050840000043
a i =v i ·z
(f) Using a sigmoid function pair { a i Recalibration is carried out to obtain a total weight vector { b } i }:
Figure FDA0004092050840000044
Where j=1, 2, M; (m, n, c) represents the position coordinates of a certain pixel point; b i,m,n,c Weights at pixel points (m, n, c) representing the ith feature image to be fused;
(g) The total weight vector { b } i And corresponding feature images to be fused { U }, respectively i Para-multiplying and adding to obtain fused result
Figure FDA0004092050840000051
Figure FDA0004092050840000052
Wherein, the ". Iy represents multiplication of para-elements.
6. The variable length input super resolution video reconstruction method based on deep learning according to claim 1, wherein the depth residual module is formed by stacking a plurality of improved residual modules.
7. The variable length input super resolution video reconstruction method based on deep learning as claimed in claim 6, wherein the modified residual block comprises four convolution layers, wherein the number of input channels is set to C, the convolution kernel size of the first convolution layer is 1 x 1, and the number of channels is 6 x C; the convolution kernel of the second convolution layer is 1 multiplied by 1, and the number of channels is C/2; the convolution kernel size of the third convolution layer is 3 multiplied by 3, and the channel number is C/2; the convolution kernel size of the fourth convolution layer is 1×1, and the number of channels is C.
CN202010290657.1A 2020-04-14 2020-04-14 Variable-length input super-resolution video reconstruction method based on deep learning Active CN111524068B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010290657.1A CN111524068B (en) 2020-04-14 2020-04-14 Variable-length input super-resolution video reconstruction method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010290657.1A CN111524068B (en) 2020-04-14 2020-04-14 Variable-length input super-resolution video reconstruction method based on deep learning

Publications (2)

Publication Number Publication Date
CN111524068A CN111524068A (en) 2020-08-11
CN111524068B true CN111524068B (en) 2023-06-02

Family

ID=71902261

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010290657.1A Active CN111524068B (en) 2020-04-14 2020-04-14 Variable-length input super-resolution video reconstruction method based on deep learning

Country Status (1)

Country Link
CN (1) CN111524068B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183353B (en) * 2020-09-28 2022-09-20 腾讯科技(深圳)有限公司 Image data processing method and device and related equipment
CN112365403B (en) * 2020-11-20 2022-12-27 山东大学 Video super-resolution recovery method based on deep learning and adjacent frames
CN112700392A (en) * 2020-12-01 2021-04-23 华南理工大学 Video super-resolution processing method, device and storage medium
CN112580473B (en) * 2020-12-11 2024-05-28 北京工业大学 Video super-resolution reconstruction method integrating motion characteristics
CN112750094B (en) * 2020-12-30 2022-12-09 合肥工业大学 Video processing method and system
CN112767247A (en) * 2021-01-13 2021-05-07 京东方科技集团股份有限公司 Image super-resolution reconstruction method, model distillation method, device and storage medium
CN112950470B (en) * 2021-02-26 2022-07-15 南开大学 Video super-resolution reconstruction method and system based on time domain feature fusion
CN113099038B (en) * 2021-03-08 2022-11-22 北京小米移动软件有限公司 Image super-resolution processing method, image super-resolution processing device and storage medium
CN112991183B (en) * 2021-04-09 2023-06-20 华南理工大学 Video super-resolution method based on multi-frame attention mechanism progressive fusion
CN113052764B (en) * 2021-04-19 2022-11-08 东南大学 Video sequence super-resolution reconstruction method based on residual connection
CN113507607B (en) * 2021-06-11 2023-05-26 电子科技大学 Compressed video multi-frame quality enhancement method without motion compensation
CN113592719B (en) * 2021-08-14 2023-11-28 北京达佳互联信息技术有限公司 Training method of video super-resolution model, video processing method and corresponding equipment
CN113888426B (en) * 2021-09-28 2024-06-14 国网安徽省电力有限公司电力科学研究院 Power monitoring video deblurring method based on depth separable residual error network
CN113902623A (en) * 2021-11-22 2022-01-07 天津大学 Method for super-resolution of arbitrary-magnification video by introducing scale information
CN114529456B (en) * 2022-02-21 2022-10-21 深圳大学 Super-resolution processing method, device, equipment and medium for video
CN114819109B (en) * 2022-06-22 2022-09-16 腾讯科技(深圳)有限公司 Super-resolution processing method, device, equipment and medium for binocular image
CN115035230B (en) * 2022-08-12 2022-12-13 浙江天猫技术有限公司 Video rendering processing method, device and equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108961186A (en) * 2018-06-29 2018-12-07 赵岩 A kind of old film reparation recasting method based on deep learning
WO2019120110A1 (en) * 2017-12-20 2019-06-27 华为技术有限公司 Image reconstruction method and device
CN110136056A (en) * 2018-02-08 2019-08-16 华为技术有限公司 The method and apparatus of image super-resolution rebuilding
WO2020015167A1 (en) * 2018-07-17 2020-01-23 西安交通大学 Image super-resolution and non-uniform blur removal method based on fusion network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019120110A1 (en) * 2017-12-20 2019-06-27 华为技术有限公司 Image reconstruction method and device
CN110136056A (en) * 2018-02-08 2019-08-16 华为技术有限公司 The method and apparatus of image super-resolution rebuilding
CN108961186A (en) * 2018-06-29 2018-12-07 赵岩 A kind of old film reparation recasting method based on deep learning
WO2020015167A1 (en) * 2018-07-17 2020-01-23 西安交通大学 Image super-resolution and non-uniform blur removal method based on fusion network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于多尺度特征残差学习卷积神经网络的视频超分辨率方法;林琦等;《信号处理》(第01期);全文 *
基于量化误差估计模型的视频超分辨率重建算法;王春萌;《金陵科技学院学报》(第01期);全文 *

Also Published As

Publication number Publication date
CN111524068A (en) 2020-08-11

Similar Documents

Publication Publication Date Title
CN111524068B (en) Variable-length input super-resolution video reconstruction method based on deep learning
CN110324664B (en) Video frame supplementing method based on neural network and training method of model thereof
CN109671023B (en) Face image super-resolution secondary reconstruction method
CN109102462B (en) Video super-resolution reconstruction method based on deep learning
CN108122197B (en) Image super-resolution reconstruction method based on deep learning
CN110136062B (en) Super-resolution reconstruction method combining semantic segmentation
CN111028150B (en) Rapid space-time residual attention video super-resolution reconstruction method
KR20190100320A (en) Neural Network Model Training Method, Apparatus and Storage Media for Image Processing
CN110675321A (en) Super-resolution image reconstruction method based on progressive depth residual error network
CN111835983B (en) Multi-exposure-image high-dynamic-range imaging method and system based on generation countermeasure network
CN110349087B (en) RGB-D image high-quality grid generation method based on adaptive convolution
CN114418853B (en) Image super-resolution optimization method, medium and equipment based on similar image retrieval
Niu et al. Blind motion deblurring super-resolution: When dynamic spatio-temporal learning meets static image understanding
CN111612703A (en) Image blind deblurring method based on generation countermeasure network
CN114339030A (en) Network live broadcast video image stabilization method based on self-adaptive separable convolution
CN112907448A (en) Method, system, equipment and storage medium for super-resolution of any-ratio image
CN114663509A (en) Self-supervision monocular vision odometer method guided by key point thermodynamic diagram
Bare et al. Real-time video super-resolution via motion convolution kernel estimation
Shen et al. Deeper super-resolution generative adversarial network with gradient penalty for sonar image enhancement
CN113096032B (en) Non-uniform blurring removal method based on image region division
CN112396554A (en) Image super-resolution algorithm based on generation countermeasure network
CN112200752B (en) Multi-frame image deblurring system and method based on ER network
CN112435165B (en) Two-stage video super-resolution reconstruction method based on generation countermeasure network
CN112598604A (en) Blind face restoration method and system
CN117196948A (en) Event data driving-based video super-resolution method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240102

Address after: 230000 floor 1, building 2, phase I, e-commerce Park, Jinggang Road, Shushan Economic Development Zone, Hefei City, Anhui Province

Patentee after: Dragon totem Technology (Hefei) Co.,Ltd.

Address before: 710061 No. 33, South Second Ring Road, Shaanxi, Xi'an

Patentee before: CHANG'AN University

TR01 Transfer of patent right