CN111524068B - Variable-length input super-resolution video reconstruction method based on deep learning - Google Patents
Variable-length input super-resolution video reconstruction method based on deep learning Download PDFInfo
- Publication number
- CN111524068B CN111524068B CN202010290657.1A CN202010290657A CN111524068B CN 111524068 B CN111524068 B CN 111524068B CN 202010290657 A CN202010290657 A CN 202010290657A CN 111524068 B CN111524068 B CN 111524068B
- Authority
- CN
- China
- Prior art keywords
- image
- feature
- frame
- super
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000013135 deep learning Methods 0.000 title claims abstract description 14
- 238000012549 training Methods 0.000 claims abstract description 65
- 230000004927 fusion Effects 0.000 claims abstract description 42
- 239000011159 matrix material Substances 0.000 claims description 41
- 238000013507 mapping Methods 0.000 claims description 11
- 230000006870 function Effects 0.000 claims description 8
- 230000033001 locomotion Effects 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 7
- 230000009466 transformation Effects 0.000 claims description 7
- 230000003321 amplification Effects 0.000 claims description 6
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 238000009827 uniform distribution Methods 0.000 claims description 3
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 2
- 230000008569 process Effects 0.000 abstract description 14
- 230000000694 effects Effects 0.000 abstract description 9
- 230000007246 mechanism Effects 0.000 abstract description 9
- 238000010586 diagram Methods 0.000 description 8
- 230000008901 benefit Effects 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/40—Scaling of whole images or parts thereof, e.g. expanding or contracting
- G06T3/4053—Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/60—Rotation of whole images or parts thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration using two or more images, e.g. averaging or subtraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20212—Image combination
- G06T2207/20221—Image fusion; Image merging
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a variable-length input super-resolution video reconstruction method based on deep learning; the method comprises the following steps: constructing training samples with random lengths, and acquiring a training set; establishing a super-resolution video reconstruction network model: the device comprises a feature extractor, a gradual alignment fusion module, a depth residual error module and a superposition module which are connected in sequence; training the super-resolution video reconstruction network model by adopting a training set to obtain a trained super-resolution video reconstruction network; and sequentially inputting the video to be processed into a trained super-resolution video reconstruction network to reconstruct the video, so as to obtain a corresponding super-resolution reconstruction video. The invention adopts a gradual alignment fusion mechanism to align and fuse the images frame by frame, and the alignment operation only acts on two adjacent frames of images, so that the model can process longer time sequence relation, and more adjacent video frames are used, which means that the input contains more scene information, and the reconstruction effect can be effectively improved.
Description
Technical Field
The invention belongs to the technical field of video restoration, and particularly relates to a variable-length input super-resolution video reconstruction method based on deep learning.
Background
Most image and video based applications have effects that depend on the quality of the image. In general, the quality of an image is related to the amount of information it contains, and the resolution of an image is used to measure how much information an image contains, expressed in terms of the number of pixels per unit area, such as 1024×768. It follows that the resolution of an image represents the quality of the image, so in real life and application scenarios, high resolution becomes a quality appeal of images and video.
However, when the video contains complex motions of occlusion, severe blurring, and large offsets, the video needs to be reconstructed to obtain high quality video information. In order to effectively fuse the complementary information of the multi-frame images and obtain a high-quality reconstructed image, all frames in the input video frame sequence must be aligned, and an accurate corresponding relationship is established to perform the following reconstruction step. Alignment is a challenging but very important issue for video super resolution due to the constant motion of the camera or object, the target frame and each adjacent frame are not aligned. At present, most superdivision models treat all adjacent frames equally, and the same alignment network is used for processing different adjacent frames, so that different intervals between the different adjacent frames and the target frame are not considered. Theoretically, the motion offset of different adjacent frames with respect to the target frame is different, and adjacent frames farther from the target frame have larger offsets, which is certainly difficult to learn the alignment operation of different adjacent frames simultaneously using one alignment network.
At present, most multi-frame image super-resolution models can only input image sequences with determined lengths, and images at two ends of a video sequence can not be normally processed in the reconstruction process of the models, which is caused by the structural limitation of the models, and the input image sequences can only be complemented by mirror image processing or copying of target frames. As shown in fig. 1, in fig. 1 (a), when the input length is 9 (the target frame and the left and right 4 frames of images), and the number of the left remaining video frames of the current target frame is insufficient, the fixed-length input model must be supplemented by copying other image frames, so that manual intervention marks are increased, and additional noise is introduced. If the variable length input in fig. 1 (b) is not needed to be processed, the variable length input can be directly input into the reconstruction model, so that the method meets the requirements of practical application. In addition, if the appropriate input sequence length (including the total length and the lengths of the adjacent frames on the left and right sides) can be selected according to the difference of the use scenes, the applicability of the multi-frame image super-resolution reconstruction model is greatly enhanced.
Disclosure of Invention
Aiming at the defects of the existing design method, the invention aims to provide a variable-length input super-resolution video reconstruction method based on deep learning. The problem of inaccurate alignment of a long input image sequence in a video super-resolution task is solved by adopting a variable length input sequence; the gradual alignment fusion network is adopted to align and fuse any number of adjacent frames without affecting the subsequent reconstruction task, so that the practicability is higher.
A variable-length input super-resolution video reconstruction method based on deep learning comprises the following steps:
step 2, establishing a super-resolution video reconstruction network model: the device comprises a feature extractor, a gradual alignment fusion module, a depth residual error module and a superposition module which are connected in sequence;
step 3, training the super-resolution video reconstruction network model by adopting a training set to obtain a trained super-resolution video reconstruction network;
step 4, inputting the video to be processed into the trained super-resolution video reconstruction network in sequence to reconstruct the video, so as to obtain a corresponding super-resolution reconstruction video;
the length of each input image sequence of the video to be processed is self-defined.
Further, the training samples with random length are constructed as follows:
first, given an input sequence length K, K > 0; selecting a data set;
secondly, giving a target frame to be reconstructed;
finally, selecting an x-frame image on the left side of the target frame and a K-1-x frame image on the right side of the target frame, and arranging the K frame images in sequence from left to right to obtain an input image sequence;
where x is an integer randomly derived by uniform distribution, x=0, 1, …, K-1.
Further, the acquiring the training set is:
firstly, random horizontal overturning and rotation are used for each original training sample, so that a space transformation training sample is obtained;
secondly, introducing an interval variable T, wherein T is more than 1, and acquiring an input image sequence with the length of the input sequence by taking T as a sampling interval so as to simulate a moving target with low acquisition frame rate or fast movement and obtain a time enhancement training sample;
finally, the training set is composed of the original training sample, the space transformation training sample and the time enhancement training sample.
Further, the training set is adopted to train the super-resolution video reconstruction network model, and specifically comprises the following steps:
3.1, initializing super-resolution video reconstruction network model parameters given the maximum training times;
3.2, input image sequence (I 1 ,…,I t ,…,I k ) Each image in the image sequence is subjected to feature extraction to obtain a corresponding feature image sequence (F 1 ,…,F t ,…,F k );
Wherein t is a target frame, and k is the length of the input image sequence; the input image sequence is a training sample;
3.3, adopting a gradual alignment fusion module to carry out gradual alignment feature fusion on the feature image sequence to obtain an aligned and fused feature image;
3.4, performing nonlinear mapping on the aligned and fused characteristic images by adopting a depth residual error module to obtain mapped characteristic images;
3.5, carrying out size amplification on the mapped characteristic image through sub-pixel convolution to obtain a characteristic image with a target size;
3.6, performing size amplification on the original target frame image through up-sampling to obtain an original image with a target size;
3.7, overlapping the characteristic image of the target size with the original image of the target size by adopting an overlapping module to obtain a reconstructed image of the target frame;
3.8, optimizing and updating parameters of the super-resolution video reconstruction network model;
for each input image sequence, repeating steps 3.2-3.8 until a maximum number of exercises is reached.
Furthermore, the gradual alignment fusion module is used for gradual alignment feature fusion of the feature image sequence, specifically:
first, for a sequence of feature images to the left of the target frame: let F l The left characteristic image of the target frame; from leftmost feature image F 1 Initially, the first frame characteristic image F 1 Alignment to second frame characteristic image F 2 Fusing the aligned first frame characteristic image and the second frame characteristic image to obtain a fused characteristic image F 2 ' let F l =F 2 'A'; fusing the characteristic images F 2 ' alignment to third frame characteristic image F 3 Re-fusion, corresponding to F 3 ' let F l =F 3 'A'; and so on until F t-1 F is then l =F t-1 ′;
Secondly, for the sequence of feature images to the right of the target frame: let F r The characteristic image on the right side of the target frame; from the rightmost feature image F k Starting to take the last frame of characteristic image F k Alignment to penultimate frame feature image F k-1 Fusing the two aligned frames of characteristic images to obtain a fused characteristic image F k-1 ' let F r =F k-1 'A'; fusing the characteristic images F k-1 ' alignment to third last frame feature image F k-2 Re-fusion, corresponding to F k-2 ' let F r =F k-2 'A'; and so on until F t+1 F is then r =F t+1 ′;
Finally, using the left-hand feature image F of the target frame l Target frame feature image F t And a target frame right-side characteristic image F r And fusing to obtain the characteristic images after alignment and fusion.
Further, the first frame characteristic image F 1 Alignment to second frame characteristic image F 2 The method specifically comprises the following steps: setting a first frame characteristic image F 1 And a second frame characteristic image F 2 W x H x C, where W is the width of the feature map, H is the height of the feature map,c is the number of channels of the feature map;
first, the first frame characteristic image F 1 And a second frame characteristic image F 2 Connecting in the channel direction to obtain a W×H×2C connection matrix;
secondly, mapping processing and channel number transformation are carried out on the connection matrix by using a plurality of convolution layers, so as to obtain a weight matrix of W multiplied by H multiplied by C;
finally, weighting the weight matrix to F by bit multiplication 1 Finish F 1 Alignment to F 2 Is performed according to the operation of (a).
Further, a plurality of feature images are fused, which specifically includes:
(a) The M feature images to be fused are subjected to preliminary fusion by adding the para-elements to obtain a preliminary fusion matrix U,
wherein U is i Representing an ith feature image to be fused;
(b) Global average pooling is carried out on the primary fusion matrix U to obtain a pooled result s,
wherein s is c A feature matrix of a c-th channel representing the pooled result s; u (U) c Representing a feature matrix of a c-th channel of the preliminary fusion matrix U; u (U) c (m, n) represents a matrix U c A pixel value at any pixel point (m, n);
(c) Using two full connection layers to build a correlation model between each channel of the feature map:
z=W 2 ·(δ(W 1 ·U))
wherein W is 1 Representing the weight of the first fully connected layer, W 2 Representing the weight of the second fully connected layer, delta representing the ReLU activation function;
(d) The internal correlation of the feature matrix in the spatial dimension is established using a 1 x 1 convolution layer:
v i =CNN 1×1 (W 3 ,U i )
wherein CNN 1×1 (. Cndot.) represents a convolution layer with a convolution kernel of 1×1; w (W) 3 A weight matrix representing the convolutional layer,
a i =v i ·z
(f) Using a sigmoid function pair { a i Recalibration is carried out to obtain a total weight vector { b } i }:
Where j=1, 2, M; (m, n, c) represents the position coordinates of a certain pixel point; b i,m,n,c Representing the weights at the pixel points (m, n, c) of the ith feature image to be fused,
(g) The total weight vector { b } i And corresponding feature images to be fused { U }, respectively i Para-multiplying and adding to obtain fused result
Wherein, the ". Iy represents multiplication of para-elements.
Further, the depth residual module is formed by stacking a plurality of improved residual modules.
Still further, the improved residual block comprises four convolution layers, wherein the number of input channels is set to C, the convolution kernel size of the first convolution layer is 1×1, and the number of channels is 6×C; the convolution kernel of the second convolution layer is 1 multiplied by 1, and the number of channels is C/2; the convolution kernel size of the third convolution layer is 3 multiplied by 3, and the channel number is C/2; the convolution kernel size of the fourth convolution layer is 1×1, and the number of channels is C.
Compared with the prior art, the invention has the advantages that:
(1) The invention adopts a gradual alignment fusion mechanism to align and fuse the images frame by frame, and the alignment operation only acts on two adjacent frames of images, so that the model can process longer time sequence relation, and more adjacent video frames are used, which means that the input contains more scene information, and the reconstruction effect can be effectively improved.
(2) The invention selects the frame sequences with different lengths as input, has stronger practicability, and the gradual alignment fusion module can align and fuse any number of adjacent frames without influencing the subsequent reconstruction task.
(3) The feature fusion of the invention considers that different video frames and different positions have different contribution degrees to the reconstruction effect, and can more effectively fuse the features of different video frames.
(4) The invention uses the improved depth residual error network as a reconstruction network, and has stronger learning mapping capability.
Drawings
FIG. 1 is a schematic diagram of a conventional fixed-length input model and a variable-length input model according to the present invention; wherein, (a) is a traditional fixed-length input model schematic diagram; (b) a comparison schematic of the variable length input model of the present invention;
FIG. 2 is a schematic diagram of random length training samples during training according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a super-resolution video reconstruction network structure according to an embodiment of the present invention;
FIG. 4 is a schematic diagram showing the comparison of the conventional residual module and the modified residual module in the embodiment of the present invention; wherein, (a) is a traditional residual module processing structure schematic diagram, and (b) is an improved residual module processing structure schematic diagram;
fig. 5 is a schematic structural diagram of a feature fusion module according to an embodiment of the present invention.
Detailed Description
To describe the technical contents, operation flow, achieved objects and effects of the present invention in detail, the following description of examples is given.
A variable-length input super-resolution video reconstruction method based on deep learning comprises the following steps:
illustratively, the process of acquiring training samples of random length:
first, given an input sequence length K, K > 0; selecting a data set;
secondly, giving a target frame to be reconstructed;
finally, selecting an x-frame image on the left side of the target frame and a K-1-x frame image on the right side of the target frame, and arranging the K frame images in sequence from left to right to obtain an input image sequence;
where x is an integer randomly derived by uniform distribution, x=0, 1.
The length of the input sequence in the invention can be fixed or can be changed according to the requirement. In the embodiment, during training, REDS is used as an original training sample set, and a bicubic interpolation method is utilized to obtain a low-resolution image; combining an RGB image block with the size of 64 multiplied by 64 in the low resolution image with a corresponding high resolution image block into a training sample; and meanwhile, random horizontal overturn and rotation are used for data enhancement, and the number of training samples is expanded. In addition, each data is subtracted by the average RGB value of the entire training set to pre-process all training data. Illustratively, a training sample is constructed: the input length is fixed at 15 during the training phase, and given the target frame to be reconstructed, an integer x (x=0, 1,.. x represents the length of the input sequence to the left of the target frame and K-1-x is the length of the input sequence to the right of the target frame, which are then combined into an input sequence of length K in left to right order, as shown in fig. 2. In order to take advantage of the characteristics of GPU acceleration matrix operations, the x values of different training samples in the same batch are the same.
Furthermore, when the training set is acquired, the invention can also perform data enhancement in time while using a general spatial data enhancement method (random horizontal overturn and rotation) in order to create training data more in line with the actual application scene. An interval variable T is introduced to represent the sampling interval of temporal data enhancement, and when T > 1, a lower acquisition frame rate or faster moving object can be simulated. For example, the target frame to be reconstructed is the ith frame image, the input length is 7, t is 2, and then the input image sequence can be expressed as:
i-6,i-4,i-2,i,i+2,i+4,i+6
with T of various sizes, more training data with complex movements can be created. Considering the characteristics of the REDS dataset, three time enhancement modes (t=1, namely the original image sequence) are selected. The time enhancement can increase the diversity and complexity of the training data in the time domain and improve the performance of super-resolution reconstruction in complex scenes.
Step 2, establishing a super-resolution video reconstruction network model: the device comprises a feature extractor, a gradual alignment fusion module, a depth residual error module and a superposition module which are connected in sequence;
referring to fig. 3, in one embodiment of the invention, the feature extractor uses a composition of 5 residual modules (convolutional layers) with batch normalization layers removed. The depth residual module stacks the depth residual module using 12 modified residual modules, illustratively having the following structure:
the number of input channels is set as C, and four convolution layers are used for mapping learning of the input: the convolution kernel size of the first convolution layer is 1×1, and the channel number is 6×c; the convolution kernel of the second convolution layer is 1 multiplied by 1, and the number of channels is C/2; the convolution kernel size of the third convolution layer is 3 multiplied by 3, and the channel number is C/2; the convolution kernel size of the fourth convolution layer is 1×1, and the number of channels is C.
The original residual module and the improved residual module structure pair are shown in fig. 4. The number of input channels is set to 128, and the improved residual error module performs mapping learning on the input by using four convolution layers: the convolution kernel size of the first convolution layer is 1×1, and the number of channels is 768; the convolution kernel size of the second convolution layer is 1×1, and the number of channels is 64; the convolution kernel size of the third convolution layer is 3×3, and the number of channels is 64; the convolution kernel size of the second convolution layer is 1×1 and the number of channels is 128.
The superposition module is an adder, and the mapped features output by the depth residual error module are added with original input features of the target frame to obtain a final output result.
Step 3, training the super-resolution video reconstruction network model by adopting a training set to obtain a trained super-resolution video reconstruction network;
specifically, the training set is adopted to train the super-resolution video reconstruction network model, and the specific steps are as follows:
3.1, initializing super-resolution video reconstruction network model parameters given the maximum training times;
in this embodiment, the batch size is set to 16, the maximum training number is 600000, adam is used as the optimizer, and the learning rate of all the structural layers of the network is initialized to 4e-4. Using the L1 distance as a loss function, the definition is as follows:
wherein I represents a real image,representing the predicted image, h, w, c are the height, width, and channel number of the image, respectively. Is thatThe numerical stability in the training process is ensured, and a very small constant E is added in the loss function, and 1e-3 is taken.
3.2, input image sequence (I 1 ,...,I t ,...,I k ) Each image in the image sequence is subjected to feature extraction to obtain a corresponding feature image sequence (F 1 ,...,F t ,...,F k );
Wherein t is a target frame, and k is the length of the input image sequence; the input image sequence is a training sample;
3.3, adopting a gradual alignment fusion module to carry out gradual alignment feature fusion on the feature image sequence to obtain an aligned and fused feature image; referring to fig. 3, the specific process is as follows:
first, for a sequence of feature images to the left of the target frame: let F l The left characteristic image of the target frame; from leftmost feature image F 1 Initially, the first frame characteristic image F 1 Alignment to second frame characteristic image F 2 Fusing the aligned first frame characteristic image and the second frame characteristic image to obtain a fused characteristic image F 2 ' let F l =F 2 'A'; fusing the characteristic images F 2 ' alignment to third frame characteristic image F 3 Re-fusion, corresponding to F 3 ' let F l =F 3 'A'; and so on until F t-1 F is then l =F t-1 ′;
Secondly, for the sequence of feature images to the right of the target frame: let F r The characteristic image on the right side of the target frame; from the rightmost feature image F k Starting to take the last frame of characteristic image F k Alignment to penultimate frame feature image F k-1 Fusing the two aligned frames of characteristic images to obtain a fused characteristic image F k-1 ' let F r =F k-1 'A'; fusing the characteristic images F k-1 ' alignment to third last frame feature image F k-2 Re-fusion, corresponding to F k-2 ' let F r =F k-2 'A'; and so on until F t+1 F is then r =F t+1 ′;
Finally, using the left-hand feature image F of the target frame l Target frame feature image F t And fusing the characteristic images with the characteristic image Fr on the right side of the target frame to obtain the characteristic images after alignment fusion.
The alignment process of the two adjacent characteristic images in the process is as follows:
for example: to the first frame characteristic image F 1 Alignment to second frame characteristic image F 2 The specific process is as follows: setting a first frame characteristic image F 1 And a second frame characteristic image F 2 W is the width of the feature map, H is the height of the feature map, and C is the number of channels of the feature map;
first, the first frame characteristic image F 1 And a second frame characteristic image F 2 Connecting in the channel direction to obtain a W×H×2C connection matrix;
secondly, mapping processing and channel number transformation are carried out on the connection matrix by using a plurality of convolution layers, so as to obtain a weight matrix of W multiplied by H multiplied by C;
finally, weighting the weight matrix to F by bit multiplication 1 Finish F 1 Alignment to F 2 Is performed according to the operation of (a).
3.4, performing nonlinear mapping on the aligned and fused characteristic images by adopting a depth residual error module to obtain mapped characteristic images;
3.5, carrying out size amplification on the mapped characteristic image through sub-pixel convolution to obtain a characteristic image with a target size;
3.6, performing size amplification on the original target frame image through up-sampling to obtain an original image with a target size; this embodiment upsamples using a bilinear interpolation method or upsamples using a 5 x 5 convolution layer and a sub-pixel convolution layer.
3.7, overlapping the characteristic image of the target size with the original image of the target size by adopting an overlapping module to obtain a reconstructed image of the target frame;
3.8, optimizing and updating parameters of the super-resolution video reconstruction network model;
for each input image sequence, repeating steps 3.2-3.8 until a maximum number of exercises is reached.
Further, as shown in fig. 5, the specific process of fusing the plurality of feature images in the above process is:
(a) The M feature images to be fused are subjected to preliminary fusion by adding the para-elements to obtain a preliminary fusion matrix U,
wherein U is i Representing an ith feature image to be fused;
(b) Global average pooling is carried out on the primary fusion matrix U to obtain a pooled result s,
wherein s is c A feature matrix of a c-th channel representing the pooled result s; u (U) c Representing a feature matrix of a c-th channel of the preliminary fusion matrix U; u (U) c (m, n) represents a matrix U c A pixel value at any pixel point (m, n);
(c) Using two full connection layers to build a correlation model between each channel of the feature map:
z=W 2 ·(δ(W 1 ·U))
wherein W is 1 Representing the weight of the first fully connected layer, W 2 Representing the weight of the second fully connected layer, delta representing the ReLU activation function;
(d) Using 1×Convolution of 1 will be { U }, respectively i The size of the input feature matrix is changed to W×H, and the internal correlation CNN of each input feature matrix in the space dimension is learned 1×1 (U i ):
v i =CNN 1×1 (W 3 ,U i )
Wherein CNN 1×1 (. Cndot.) represents a convolution layer with a convolution kernel of 1×1; w (W) 3 A weight matrix representing the convolutional layer,/>
ai=v i ·z
(f) Using a sigmoid function pair { a i Recalibration is carried out to obtain a total weight vector { b } i }:
Where j=1, 2, M; (m, n, c) represents the position coordinates of a certain pixel point; b i,m,n,c Representing the weights at the pixel points (m, n, c) of the ith feature image to be fused,the above process is calculated separately at each location of the feature map.
(g) The total weight vector { b } i And corresponding feature images to be fused { U }, respectively i Para-multiplying and adding to obtain fused result
Wherein, the ". Iy represents multiplication of para-elements. The alignment element is the element of the corresponding position.
Step 4, inputting the video to be processed into the trained super-resolution video reconstruction network in sequence to reconstruct the video, so as to obtain a corresponding super-resolution reconstruction video;
the length of each input image sequence of the video to be processed is self-defined.
The method adopts a depth residual error network, reduces the parameter quantity and improves the learning capacity of the network by improving the structure of a residual error module. Increasing the number of channels in the middle layer of the residual module helps to improve the reconstruction quality of the model, but if the number of channels is directly increased, the calculation amount is greatly increased, so that the number of channels of the feature map is changed by introducing 1×1 convolution. The 1 x 1 convolution is widely used for models such as ResNets, resNeXt and MobileNetV2 to reduce and increase the number of channels in the feature map. The number of channels is reduced first using a 1 x 1 convolution, then using a 3 x 3 convolution for feature extraction and mapping, and finally using a 1 x 1 convolution to recover the number of channels. Compared with the original residual error module, the improved residual error module not only reduces the calculated amount, but also enhances the modeling capability of the relation between channels, and is more beneficial to improving the reconstruction capability of the model.
The invention adopts a gradual alignment fusion mechanism to gradually align adjacent frames with the target frames and fuses the adjacent frames frame by frame, each alignment operation is only carried out on the adjacent two frames, and compared with the mode that all the adjacent frames in other models are respectively and independently aligned to the target frames, the gradual alignment fusion mechanism greatly improves the robustness of the reconstruction model to complex motions. In addition, in some optical flow-based methods, the original image is aligned, which is extremely susceptible to noise or occlusion, and the gradual alignment fusion mechanism aligns feature images after feature extraction, which are not susceptible to occlusion, blurring and noise in the original image. Therefore, the gradual alignment fusion mechanism not only can effectively improve alignment accuracy, but also can align and fuse a larger number of adjacent frames, which represents that more scene information can be used, and is helpful for improving the reconstruction effect of the model.
The invention adopts a random length training mechanism, and the variable length input is that the image super-resolution reconstruction model is required to allow a user to input video image sequences with different lengths on the premise of not influencing the reconstruction effect, and the proper input length can be selected independently according to the characteristics of real data. When no useful complementary information exists between adjacent images, only the target frame image is selected to be input, and if the adjacent frames can provide additional useful features, the proper input length is selected, which is of great significance to the application of the image super-resolution reconstruction technology. Through a random length training mechanism, although the input length is fixed, in the reconstruction process, the number of video frames aligned and fused by the gradual alignment fusion network is random in the front and rear directions of the current frame. Therefore, the gradual alignment fusion network can learn the feature fusion mapping of different numbers of video frames, so that the model is not influenced by the number of input video frames during testing, and the reconstruction effect of the model is ensured.
In summary, the invention not only improves the effect of video super-resolution reconstruction by means of two innovative mechanisms of gradual alignment fusion and random length training, but also allows the model to input image sequences of any length, including the total length of the input sequence and the length of a single-side sequence, thereby greatly improving the application range of video super-resolution reconstruction.
While the invention has been described in detail in this specification with reference to the general description and the specific embodiments thereof, it will be apparent to one skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.
Claims (7)
1. The variable-length input super-resolution video reconstruction method based on deep learning is characterized by comprising the following steps of:
step 1, constructing training samples with random lengths, and acquiring a training set;
step 2, establishing a super-resolution video reconstruction network model: the device comprises a feature extractor, a gradual alignment fusion module, a depth residual error module and a superposition module which are connected in sequence;
step 3, training the super-resolution video reconstruction network model by adopting a training set to obtain a trained super-resolution video reconstruction network;
the training set is adopted to train the super-resolution video reconstruction network model, and the specific steps are as follows:
3.1, initializing super-resolution video reconstruction network model parameters given the maximum training times;
3.2, input image sequence (I 1 ,...,I t ,...,I k ) Each image in the image sequence is subjected to feature extraction to obtain a corresponding feature image sequence (F 1 ,...,F t ,...,F k );
Wherein t is a target frame, and k is the length of the input image sequence; the input image sequence is a training sample;
3.3, adopting a gradual alignment fusion module to carry out gradual alignment feature fusion on the feature image sequence to obtain an aligned and fused feature image;
3.4, performing nonlinear mapping on the aligned and fused characteristic images by adopting a depth residual error module to obtain mapped characteristic images;
3.5, carrying out size amplification on the mapped characteristic image through sub-pixel convolution to obtain a characteristic image with a target size;
3.6, performing size amplification on the original target frame image through up-sampling to obtain an original image with a target size;
3.7, overlapping the characteristic image of the target size with the original image of the target size by adopting an overlapping module to obtain a reconstructed image of the target frame;
3.8, optimizing and updating parameters of the super-resolution video reconstruction network model;
repeating steps 3.2-3.8 for each input image sequence until a maximum training number is reached;
the gradual alignment fusion module is used for gradual alignment feature fusion of the feature image sequence, and specifically comprises the following steps:
first, for a sequence of feature images to the left of the target frame: let F l The left characteristic image of the target frame; from leftmost feature image F 1 Initially, the first frame characteristic image F 1 Alignment to second frame characteristic image F 2 Fusing the aligned first frame characteristic image and the second frame characteristic image to obtain a fused characteristic image F 2′ Order F l =F 2′ The method comprises the steps of carrying out a first treatment on the surface of the Fusing the characteristic images F 2′ Alignment to third frame feature image F 3 Re-fusion, corresponding to F 3′ Order F l =F 3′ The method comprises the steps of carrying out a first treatment on the surface of the And so on until F t-1 F is then l =F t-1′ ;
Secondly, for the sequence of feature images to the right of the target frame: let F r The characteristic image on the right side of the target frame; from the rightmost feature image F k Starting to take the last frame of characteristic image F k Alignment to penultimate frame feature image F k-1 Fusing the two aligned frames of characteristic images to obtain a fused characteristic image F k-1′ Order F r =F k-1′ The method comprises the steps of carrying out a first treatment on the surface of the Fusing the characteristic images F k-1′ Alignment to the third last frame feature image F k-2 Re-fusion, corresponding to F k-2′ Order F r =F k-2′ The method comprises the steps of carrying out a first treatment on the surface of the And so on until F t+1 F is then r =F t +1′ ;
Finally, using the left-hand feature image F of the target frame l Target frame feature image F t And a target frame right-side characteristic image F r Fusing to obtain aligned and fused characteristic images;
step 4, inputting the video to be processed into the trained super-resolution video reconstruction network in sequence to reconstruct the video, so as to obtain a corresponding super-resolution reconstruction video;
the length of each input image sequence of the video to be processed is self-defined.
2. The variable length input super resolution video reconstruction method based on deep learning of claim 1, wherein the training samples for constructing random lengths are:
first, given an input sequence length K, K > 0; selecting a data set;
secondly, giving a target frame to be reconstructed;
finally, selecting an x-frame image on the left side of the target frame and a K-1-x frame image on the right side of the target frame, and arranging the K frame images in sequence from left to right to obtain an input image sequence;
where x is an integer randomly derived by uniform distribution, x=0, 1.
3. The variable length input super resolution video reconstruction method based on deep learning according to claim 1, wherein the acquiring training set is:
firstly, random horizontal overturning and rotation are used for each original training sample, so that a space transformation training sample is obtained;
secondly, introducing an interval variable T, wherein T is more than 1, and acquiring an input image sequence with the length of the input sequence by taking T as a sampling interval so as to simulate a moving target with low acquisition frame rate or fast movement and obtain a time enhancement training sample;
finally, the training set is composed of the original training sample, the space transformation training sample and the time enhancement training sample.
4. The variable length input super resolution video reconstruction method based on deep learning as claimed in claim 1, wherein said first frame feature image F 1 Alignment to second frame characteristic image F 2 The method specifically comprises the following steps: setting a first frame characteristic image F 1 And a second frame characteristic image F 2 W is the width of the feature map, H is the height of the feature map, and C is the number of channels of the feature map;
first, the first frame characteristic image F 1 And a second frame characteristic image F 2 Connecting in the channel direction to obtain a W×H×2C connection matrix;
secondly, mapping processing and channel number transformation are carried out on the connection matrix by using a plurality of convolution layers, so as to obtain a weight matrix of W multiplied by H multiplied by C;
finally, weighting the weight matrix to F by bit multiplication 1 Finish F 1 Alignment to F 2 Is performed according to the operation of (a).
5. The variable length input super resolution video reconstruction method based on deep learning as claimed in claim 1, wherein a plurality of feature images are fused, which specifically comprises:
(a) The M feature images to be fused are subjected to preliminary fusion through the addition of the alignment elements, and a preliminary fusion matrix U is obtained:
wherein U is i Representing an ith feature image to be fused;
(b) Global average pooling is carried out on the primary fusion matrix U to obtain a pooled result s,
wherein s is c A feature matrix of a c-th channel representing the pooled result s; u (U) c Representing a feature matrix of a c-th channel of the preliminary fusion matrix U; u (U) c (m, n) represents a matrix U c A pixel value at any pixel point (m, n);
(c) Using two full connection layers to build a correlation model between each channel of the feature map:
z=W 2 ·(δ(W 1 ·U))
wherein W is 1 Representing the weight of the first fully connected layer, W 2 Representing the weight of the second fully connected layer, delta representing the ReLU activation function;
(d) The internal correlation of the feature matrix in the spatial dimension is established using a 1 x 1 convolution layer:
v i =CNN 1×1 (W 3 ,U i )
wherein CNN 1×1 (. Cndot.) represents a convolution layer with a convolution kernel of 1×1; w (W) 3 A weight matrix representing a convolutional layer;
a i =v i ·z
(f) Using a sigmoid function pair { a i Recalibration is carried out to obtain a total weight vector { b } i }:
Where j=1, 2, M; (m, n, c) represents the position coordinates of a certain pixel point; b i,m,n,c Weights at pixel points (m, n, c) representing the ith feature image to be fused;
(g) The total weight vector { b } i And corresponding feature images to be fused { U }, respectively i Para-multiplying and adding to obtain fused result
Wherein, the ". Iy represents multiplication of para-elements.
6. The variable length input super resolution video reconstruction method based on deep learning according to claim 1, wherein the depth residual module is formed by stacking a plurality of improved residual modules.
7. The variable length input super resolution video reconstruction method based on deep learning as claimed in claim 6, wherein the modified residual block comprises four convolution layers, wherein the number of input channels is set to C, the convolution kernel size of the first convolution layer is 1 x 1, and the number of channels is 6 x C; the convolution kernel of the second convolution layer is 1 multiplied by 1, and the number of channels is C/2; the convolution kernel size of the third convolution layer is 3 multiplied by 3, and the channel number is C/2; the convolution kernel size of the fourth convolution layer is 1×1, and the number of channels is C.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010290657.1A CN111524068B (en) | 2020-04-14 | 2020-04-14 | Variable-length input super-resolution video reconstruction method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010290657.1A CN111524068B (en) | 2020-04-14 | 2020-04-14 | Variable-length input super-resolution video reconstruction method based on deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111524068A CN111524068A (en) | 2020-08-11 |
CN111524068B true CN111524068B (en) | 2023-06-02 |
Family
ID=71902261
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010290657.1A Active CN111524068B (en) | 2020-04-14 | 2020-04-14 | Variable-length input super-resolution video reconstruction method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111524068B (en) |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112183353B (en) * | 2020-09-28 | 2022-09-20 | 腾讯科技(深圳)有限公司 | Image data processing method and device and related equipment |
CN112365403B (en) * | 2020-11-20 | 2022-12-27 | 山东大学 | Video super-resolution recovery method based on deep learning and adjacent frames |
CN112700392A (en) * | 2020-12-01 | 2021-04-23 | 华南理工大学 | Video super-resolution processing method, device and storage medium |
CN112580473B (en) * | 2020-12-11 | 2024-05-28 | 北京工业大学 | Video super-resolution reconstruction method integrating motion characteristics |
CN112750094B (en) * | 2020-12-30 | 2022-12-09 | 合肥工业大学 | Video processing method and system |
CN112767247A (en) * | 2021-01-13 | 2021-05-07 | 京东方科技集团股份有限公司 | Image super-resolution reconstruction method, model distillation method, device and storage medium |
CN112950470B (en) * | 2021-02-26 | 2022-07-15 | 南开大学 | Video super-resolution reconstruction method and system based on time domain feature fusion |
CN113099038B (en) * | 2021-03-08 | 2022-11-22 | 北京小米移动软件有限公司 | Image super-resolution processing method, image super-resolution processing device and storage medium |
CN112991183B (en) * | 2021-04-09 | 2023-06-20 | 华南理工大学 | Video super-resolution method based on multi-frame attention mechanism progressive fusion |
CN113052764B (en) * | 2021-04-19 | 2022-11-08 | 东南大学 | Video sequence super-resolution reconstruction method based on residual connection |
CN113507607B (en) * | 2021-06-11 | 2023-05-26 | 电子科技大学 | Compressed video multi-frame quality enhancement method without motion compensation |
CN113592719B (en) * | 2021-08-14 | 2023-11-28 | 北京达佳互联信息技术有限公司 | Training method of video super-resolution model, video processing method and corresponding equipment |
CN113888426B (en) * | 2021-09-28 | 2024-06-14 | 国网安徽省电力有限公司电力科学研究院 | Power monitoring video deblurring method based on depth separable residual error network |
CN113902623A (en) * | 2021-11-22 | 2022-01-07 | 天津大学 | Method for super-resolution of arbitrary-magnification video by introducing scale information |
CN114529456B (en) * | 2022-02-21 | 2022-10-21 | 深圳大学 | Super-resolution processing method, device, equipment and medium for video |
CN114819109B (en) * | 2022-06-22 | 2022-09-16 | 腾讯科技(深圳)有限公司 | Super-resolution processing method, device, equipment and medium for binocular image |
CN115035230B (en) * | 2022-08-12 | 2022-12-13 | 浙江天猫技术有限公司 | Video rendering processing method, device and equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108961186A (en) * | 2018-06-29 | 2018-12-07 | 赵岩 | A kind of old film reparation recasting method based on deep learning |
WO2019120110A1 (en) * | 2017-12-20 | 2019-06-27 | 华为技术有限公司 | Image reconstruction method and device |
CN110136056A (en) * | 2018-02-08 | 2019-08-16 | 华为技术有限公司 | The method and apparatus of image super-resolution rebuilding |
WO2020015167A1 (en) * | 2018-07-17 | 2020-01-23 | 西安交通大学 | Image super-resolution and non-uniform blur removal method based on fusion network |
-
2020
- 2020-04-14 CN CN202010290657.1A patent/CN111524068B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019120110A1 (en) * | 2017-12-20 | 2019-06-27 | 华为技术有限公司 | Image reconstruction method and device |
CN110136056A (en) * | 2018-02-08 | 2019-08-16 | 华为技术有限公司 | The method and apparatus of image super-resolution rebuilding |
CN108961186A (en) * | 2018-06-29 | 2018-12-07 | 赵岩 | A kind of old film reparation recasting method based on deep learning |
WO2020015167A1 (en) * | 2018-07-17 | 2020-01-23 | 西安交通大学 | Image super-resolution and non-uniform blur removal method based on fusion network |
Non-Patent Citations (2)
Title |
---|
基于多尺度特征残差学习卷积神经网络的视频超分辨率方法;林琦等;《信号处理》(第01期);全文 * |
基于量化误差估计模型的视频超分辨率重建算法;王春萌;《金陵科技学院学报》(第01期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111524068A (en) | 2020-08-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111524068B (en) | Variable-length input super-resolution video reconstruction method based on deep learning | |
CN110324664B (en) | Video frame supplementing method based on neural network and training method of model thereof | |
CN109671023B (en) | Face image super-resolution secondary reconstruction method | |
CN109102462B (en) | Video super-resolution reconstruction method based on deep learning | |
CN108122197B (en) | Image super-resolution reconstruction method based on deep learning | |
CN110136062B (en) | Super-resolution reconstruction method combining semantic segmentation | |
CN111028150B (en) | Rapid space-time residual attention video super-resolution reconstruction method | |
KR20190100320A (en) | Neural Network Model Training Method, Apparatus and Storage Media for Image Processing | |
CN110675321A (en) | Super-resolution image reconstruction method based on progressive depth residual error network | |
CN111835983B (en) | Multi-exposure-image high-dynamic-range imaging method and system based on generation countermeasure network | |
CN110349087B (en) | RGB-D image high-quality grid generation method based on adaptive convolution | |
CN114418853B (en) | Image super-resolution optimization method, medium and equipment based on similar image retrieval | |
Niu et al. | Blind motion deblurring super-resolution: When dynamic spatio-temporal learning meets static image understanding | |
CN111612703A (en) | Image blind deblurring method based on generation countermeasure network | |
CN114339030A (en) | Network live broadcast video image stabilization method based on self-adaptive separable convolution | |
CN112907448A (en) | Method, system, equipment and storage medium for super-resolution of any-ratio image | |
CN114663509A (en) | Self-supervision monocular vision odometer method guided by key point thermodynamic diagram | |
Bare et al. | Real-time video super-resolution via motion convolution kernel estimation | |
Shen et al. | Deeper super-resolution generative adversarial network with gradient penalty for sonar image enhancement | |
CN113096032B (en) | Non-uniform blurring removal method based on image region division | |
CN112396554A (en) | Image super-resolution algorithm based on generation countermeasure network | |
CN112200752B (en) | Multi-frame image deblurring system and method based on ER network | |
CN112435165B (en) | Two-stage video super-resolution reconstruction method based on generation countermeasure network | |
CN112598604A (en) | Blind face restoration method and system | |
CN117196948A (en) | Event data driving-based video super-resolution method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20240102 Address after: 230000 floor 1, building 2, phase I, e-commerce Park, Jinggang Road, Shushan Economic Development Zone, Hefei City, Anhui Province Patentee after: Dragon totem Technology (Hefei) Co.,Ltd. Address before: 710061 No. 33, South Second Ring Road, Shaanxi, Xi'an Patentee before: CHANG'AN University |
|
TR01 | Transfer of patent right |