CN113902623A

CN113902623A - Method for super-resolution of arbitrary-magnification video by introducing scale information

Info

Publication number: CN113902623A
Application number: CN202111385618.0A
Authority: CN
Inventors: 万亮; 盖赫; 冯伟
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2022-01-07

Abstract

The invention belongs to the super-resolution field of computer vision, and aims to solve the problem of arbitrary magnification of the super-resolution of a video and realize decimal magnification. The invention comprises the following steps: introducing an arbitrary-multiple video super-resolution method of scale information, firstly aligning adjacent frames at a characteristic level by using variable convolution; then fusing the features extracted from each element vector in the reverse position projection matrix by using an attention-like mechanism; and finally, inputting the offset representing the movement of the pixel point between the continuous adjacent frame and the target frame into a position weight prediction module to obtain the position weight, and inputting the alignment characteristics output by the alignment module into a characteristic weight prediction module to obtain the characteristic weight. And finally, multiplying the two weights correspondingly to obtain the weight of the up-sampling filter kernel, and multiplying the fusion characteristics by the weight of the up-sampling filter kernel respectively to realize the amplification of any times. The method is mainly applied to the occasion of arbitrary magnification of the video super-resolution.

Description

Method for super-resolution of arbitrary-magnification video by introducing scale information

Technical Field

The invention belongs to the super-resolution field of computer vision, and relates to a method for realizing arbitrary magnification. In particular to a super-resolution method of any-time video introducing scale information.

Background

The video super-resolution algorithm comprises the following steps: and (3) an algorithm for expanding the resolution of the video by a certain multiple. The super-resolution of video usually inputs 2N +1(N ═ 1, 2, 3, … …) frame images, a frame to be restored is called a target frame, a frame providing information for the target frame is called an adjacent frame, and the super-resolution of video restores a high-resolution frame by means of timing information between adjacent frames and its own spatial information. The process of the video super-resolution algorithm can be divided into four parts of alignment, fusion, reconstruction and up-sampling.

Video super-resolution differs from image super-resolution in that there is strong temporal and spatial continuity between successive video frames. Considering that the change of objects and scenes in space exists between video frames which are close to each other, aligning means that misplaced pixels (or characteristics) are aligned by using multi-frame video frame information, the difference between adjacent frames and a target frame is obtained at the pixel level or the characteristic level, and the adjacent frames are adjusted to be approximate to the target frame; the fusion process reserves information which is helpful to the super-resolution of the target frame in the aligned video frame, takes the target frame and the aligned adjacent frame as input, and screens and reserves characteristics according to the time information between the adjacent frame and the target frame and the space information of the target frame; reconstruction refers to operations such as receptive field amplification, feature extraction and the like on the features, and aims to extract deep features of the picture and fully utilize the existing information before restoring a high-resolution image; and finally, mapping the learned features to a high-dimensional image space through upsampling to obtain a final high-resolution video.

An arbitrary-time image super-resolution algorithm: the super-resolution technology of any multiple image can be divided into two types, one type directly uses a position matrix related to a low-resolution input frame and a high-resolution recovery frame of a single frame as the input of a module to predict the filter weights under different scale factors, only one model is needed to realize the amplification of any multiple, but the predicted weights under the same scale factor are the same. In the other type, scale factor information is introduced in the previous link of upsampling to help feature learning, but the scale factor is directly converted into a vector in such a mode, and the scale factor information cannot be fully expressed.

The video super-resolution is fundamentally different from the image super-resolution in that the video super-resolution takes a plurality of frames as input, and the information of adjacent frames is fully utilized to help recovery in each link of the algorithm, so that a better effect is achieved. Meanwhile, related research of arbitrary magnification is not carried out in the field of super-resolution of videos at present, and the method for the arbitrary image super-resolution cannot be directly transferred into the videos due to the fact that the characteristics of the videos are ignored. As a big hot research problem in the field of computer vision, the magnification of a video with any times is not negligible.

Variable convolution: the convolution kernel of the common convolution is fixed, the receptive field of the variable convolution can be self-adaptively adjusted to adapt to different deformations, the sampling shape is irregular, the offset generated by the convolution layer is added to the standard convolution kernel to be used as a sampling point, and the traditional convolution kernel is changed into a deformable convolution kernel. The flexible sampling nature of the variable convolution makes it applicable to the alignment module.

Reference to the literature

[1]Wang X,Chan C K K,Yu K,et al.EDVR:Video Restoration with Enhanced Deformable Convolutional Networks[J].CVPR Workshops,2019

[2]Hu X,Mu H,Zhang X,et al.Meta-SR:A Magnification-Arbitrary Network for Super-Resolution[J].CVPR,2019:1575–1584

[3]Fu Y,Chen J,Zhang T,et al.Residual Scale Attention Network for Arbitrary Scale Image Super-Resolution[J].Neurocomputing,2021:201–211.

[4]Dai J,Qi H,Xiong Y,et al.Deformable Convolutional Networks[J].ICCV,2017.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to solve the problem of arbitrary magnification of the video super-resolution and realize decimal magnification. Therefore, the technical scheme adopted by the invention is as follows: an arbitrary-time video super-resolution method introducing scale information. First, using a variable convolution to align adjacent frames at a feature level; then the fusion module calculates the similarity between the aligned adjacent frame and the target frame, the attention degree of the similar part is higher, the inter-frame time sequence information is fused in the mode, and the non-local fusion intra-frame space information is used; then, fusing the features extracted from each element vector in the reverse position projection matrix by using an attention-like mechanism, and fusing the features into a residual error learning process to help a network to adaptively adjust the feature learning process by performing explicit modeling on different scale factors; and finally, inputting the offset representing the movement of the pixel point between the continuous adjacent frame and the target frame into a position weight prediction module to obtain the position weight, and inputting the alignment characteristics output by the alignment module into a characteristic weight prediction module to obtain the characteristic weight. And finally, multiplying the two weights correspondingly to obtain the weight of the up-sampling filter kernel, and multiplying the fusion characteristics by the weight of the up-sampling filter kernel respectively to realize the amplification of any times.

The method comprises the following specific steps:

the super-resolution network is formed by alignment, fusion, reconstruction and up-sampling, scale factor information is introduced in the links of reconstruction and up-sampling, and the super-resolution network is realized according to the following steps:

s1, the network alignment adopts variable convolution to refer to the target frame at the feature level, and aligns all the adjacent frames to obtain the aligned target frame feature;

s2, the fusion is to take the aligned adjacent frames and the target frame as input, fuse information from two dimensions of time and space, fuse inter-frame information in the time dimension through an attention mechanism, and fuse intra-frame information in the space dimension through non-local fusion;

s3, the reconstruction link uses the reverse position projection matrix to represent the information related to the scale factors, so that the learned characteristic information under different scale factors is different;

s4, upsampling: firstly, predicting the position weight between low-grade input and high-grade output and the feature weight of different target frames, combining the two types of weights to obtain the weight of an up-sampling filter, and multiplying the fused features and the weight pixel by pixel to obtain a final super-score result.

The reconstruction module for introducing the scale factor information is realized by the following steps:

2.1, inputting the fused target frame characteristics and the reverse position projection matrix represented by the scale factor information into a scale perception position characteristic extraction module to obtain the fusion characteristics of the reverse position projection matrix;

2.2, inputting the fused target frame into a conventional residual block, wherein the residual block consists of two convolution layers and two activation layers to obtain residual fusion characteristics;

2.3 the formula of the feature of the introduced scale information obtained by splicing and adding the fusion feature of the reverse position projection matrix and the fusion feature of the residual error and the image feature is as follows:

in the formula (1), the first and second groups,

represents the feature of the fusion scale information, C () represents the convolution operation, con () represents the stitching operation, R () represents the activation function, E (IP) represents the inverse position projection matrix,

representing the input fused features.

The weight of the up-sampling filter consists of a position weight and a characteristic weight:

3.1 the position weight is obtained by inputting the corresponding position relation between multi-frame low-resolution input and high-resolution output and predicting scale factors;

3.2, the feature weight takes the alignment feature output by the alignment module as input to predict the feature weight of different target frames;

3.3 finally, the two weights are multiplied to obtain the weight of the up-sampling filter.

The detailed steps are as follows:

the super-resolution network comprises four stages of alignment, fusion, reconstruction and upsampling, and the super-resolution network realizes arbitrary-time video super-resolution according to the following steps:

s1, firstly, extracting the characteristics of the adjacent frame and the target frame, aligning the adjacent frame at the characteristic level by referring to the target frame through variable convolution in the alignment stage, inputting the adjacent frame and the target frame into the network, and extracting the characteristic by using a residual block; then, combining the characteristics of the adjacent frames and the target frame, obtaining the offset between the adjacent frames and the target frame through convolution, inputting the offset and the adjacent frames into variable convolution, and obtaining the aligned characteristics of the adjacent frames;

s2, inputting the aligned adjacent frames output by the alignment module and the target frame, calculating the similarity between each adjacent frame and the target frame in the fusion stage, fusing information with higher similarity to the target frame, and fusing spatial information to the target frame by using a non-local method;

s3, inputting the characteristics of the fused space-time information, introducing scale factors and extracting deeper information, wherein the reconstruction module is obtained by stacking scale perception residual blocks, and the specific working flow of stacking the scale perception residual blocks is as follows:

1) constructing a reverse position projection matrix, and for each pixel point on the high resolution frame, dividing the coordinate of the pixel point by a scale factor to obtain a corresponding pixel point on the low resolution frame, so that for each point on the low resolution frame, a plurality of points on the high resolution recovery frame form a matrix reverse position projection matrix, and the projection matrix of a certain point (i ', j') on the low resolution frame is constructed as follows:

F^l(i′,j′)＝{I^s(i,j)|i′r≤i≤(i′+1)r；j′r≤i≤(j′+1)r}, (2)

F^l(i ', j') represents a point on the low-resolution feature frame, and the right side of the equation represents a set of points on the high-resolution frame determined by the point (i ', j');

2) the fusion characteristics output by the fusion module and the reverse position projection matrix are used as input to obtain the fusion characteristics of the reverse position projection matrix, and the method specifically comprises the following steps:

reshaping element pixels at positions (i ', j') in the reverse position projection matrix to form a position offset matrix with dimensions (n,3), wherein n represents the number of high-resolution pixel positions contained in the current element position;

secondly, extracting the characteristics of the spliced vector of each position offset and the scale factor to obtain position offset characteristics;

thirdly, combining the input features and the position offset features of the current position, and predicting fusion weight by utilizing a full connection layer and a logistic regression activation layer;

fourthly, weighting all the position offset features by using fusion weights to obtain fusion features of a reverse position projection matrix;

3) inputting the fusion features into a residual block consisting of two convolution volumes and two activation functions to obtain new fusion features, splicing the new fusion features with the features in the step 2), and finally adding the fusion features to obtain the features of the fusion scale information;

in the up-sampling stage of S4, the weight of the up-sampling filter is predicted first, and the feature obtained in S3 is multiplied by the filter pixel by pixel to obtain a high resolution recovery frame, and the step of obtaining the weight of the up-sampling filter is as follows:

1) the O2P module inputs the corresponding relation between the low-resolution input and high-resolution image results of the predicted multiframe by the offset between the adjacent frame and the target frame obtained by the alignment module, learns and predicts the position relation of the high-resolution image pixel by pixel, and takes the relation as the input of the position weight prediction P2W module to obtain the position weight;

2) the offset is different from the information contained in the alignment feature, the offset corresponds to the corresponding relation between pixel points of two frames of images, more image information is contained in the alignment feature, the feature weight still adopts the structure of a P2W module, and the alignment feature output by the alignment module is used as input;

3) and multiplying the position weight and the characteristic weight pixel by pixel to obtain the final weight of the up-sampling filter.

The invention has the characteristics and beneficial effects that:

1) aiming at the blank of the current super-resolution work of videos with any times, a network structure for realizing the super-resolution of videos with any times is provided, and the amplification of any times can be realized only by training one model.

2) The invention provides a scale perception reconstruction module by analyzing the importance of scale factors in any times of video super-resolution, the module constructs a projection relation matrix between low-resolution images and features according to the scale factors, the features extracted from each element vector in the projection matrix are fused by using a similar attention mechanism, and the features are used as additional information of the reconstruction module to help network learning.

3) In order to fully utilize the video super-resolution continuous multi-frame input, the invention provides an up-sampling module for realizing any multiple, which takes the corresponding position relation and alignment characteristics between multi-frame input frames and output frames as input and dynamically predicts the weight of an up-sampling filter. The filter weights are different for different frames, preserving the characteristics of the result to a greater extent.

Description of the drawings:

fig. 1 is a flow chart of the super-resolution method of an arbitrary-magnification video introducing scale information according to the present invention.

Fig. 2 is a schematic diagram of the overall network structure of the present invention.

Fig. 3 is an exemplary diagram of the scale-aware residual block proposed by the present invention, which is a refinement of "scale-aware reconstruction" in fig. 2, and a plurality of scale-aware residual blocks are spliced together to form a scale-aware reconstruction module in an overall network.

Fig. 4 is a schematic diagram of obtaining a fusion feature of a reverse position projection matrix, that is, a specific implementation manner of the scale-aware position feature extraction in fig. 3.

FIG. 5 is a graph showing the results of comparing the present invention with a classical super-resolution method.

Detailed Description

In order to solve the problem that the video super-resolution method realizes arbitrary-time amplification, the invention provides an arbitrary-time video super-resolution method ASVSR introducing scale information through analysis, and decimal-time amplification is realized. The proposed ASVSR is divided into four stages, first aligning adjacent frames at a feature level using variable convolution; then the fusion module calculates the similarity between the aligned adjacent frame and the target frame, the attention degree of the similar part is higher, the inter-frame time sequence information is fused in the mode, and the non-local fusion intra-frame space information is used; and then fusing the features extracted from each element vector in the reverse position projection matrix by using an attention-like mechanism, and fusing the features into a residual error learning process. By performing explicit modeling on different scale factors, the network is helped to adaptively adjust the characteristic learning process; and finally, inputting the offset representing the movement of the pixel point between the continuous adjacent frame and the target frame into a position weight prediction module to obtain the position weight, and inputting the alignment characteristics output by the alignment module into a characteristic weight prediction module to obtain the characteristic weight. And finally, multiplying the two weights correspondingly to obtain the weight of the up-sampling filter kernel, and multiplying the fusion characteristics by the weight of the up-sampling filter kernel respectively to realize the amplification of any times.

The specific technical scheme of the invention is as follows:

a super-resolution method of any-time video introducing scale information comprises a super-resolution network formed by alignment, fusion, reconstruction and up-sampling, scale factor information is introduced in the links of reconstruction and up-sampling, and the super-resolution network is realized according to the following steps:

s1, the network alignment module adopts variable convolution to refer to the target frame at the feature level, and aligns all the adjacent frames to obtain the aligned target frame features;

s2, the fusion module takes the aligned adjacent and target frames as input, fuses information from two dimensions of time and space, fuses inter-frame information in the time dimension through an attention mechanism, and fuses intra-frame information in the space dimension through non-local fusion;

s4, the up-sampling module firstly predicts the position weight between the low-score input and the high-score output and the feature weight of different target frames, combines the two types of weights to obtain the weight of the up-sampling filter, and multiplies the fused features and the weight pixel by pixel to obtain the final super-score result.

The reconstruction module for introducing the scale factor information is realized by the following steps.

2.1 inputting the fused target frame characteristics and the reverse position projection matrix represented by the scale factor information into a scale perception position characteristic extraction module to obtain the fusion characteristics of the reverse position projection matrix.

2.2 inputting the fused target frame into a conventional residual block, wherein the residual block consists of two convolution layers and two activation layers, and the residual fusion characteristic is obtained.

in the formula (1), the first and second groups,

representing the input fused features.

The weight of the up-sampling filter consists of a position weight and a characteristic weight;

3.2 inputting the alignment characteristics output by the alignment module by the characteristic weight to predict the characteristic weight of different target frames;

The technical scheme of the invention is further explained by combining the attached drawings.

As shown in fig. 1, the present invention provides a method for introducing arbitrary-magnification super-resolution of a video with scale information, which includes a super-resolution network composed of four stages of alignment, fusion, reconstruction, and upsampling, and the super-resolution network implements arbitrary-magnification super-resolution of a video according to the following steps:

s1 first extracts features of the adjacent frame and the target frame, and the alignment module aligns the adjacent frame at a feature level with reference to the target frame by variable convolution, inputting the adjacent frame and the target frame into the network. The extraction features use a residual block, and the residual block is a common feature extraction method and can obtain deep feature information; then, combining the characteristics of the adjacent frames and the target frame, obtaining the offset between the adjacent frames and the target frame through convolution, inputting the offset and the adjacent frames into variable convolution, and obtaining the aligned characteristics of the adjacent frames;

s2, inputting the aligned adjacent frames and the target frames output by the alignment module into the fusion module, calculating the similarity between each adjacent frame and the target frame, fusing the information with higher similarity to the target frame, and fusing the spatial information of the target frame by using a non-local method.

S3 inputs the features of the fused spatio-temporal information to a reconstruction module to introduce scale factors and extract deeper information, the module is formed by stacking the scale perceptual residual blocks proposed by the present invention, as shown in fig. 3, the specific work flow of the scale perceptual residual blocks is as follows:

4) and constructing a reverse position projection matrix, and for each pixel point on the high resolution frame, dividing the coordinate of the pixel point by the scale factor to obtain a corresponding pixel point on the low resolution frame, so that for each point on the low resolution frame, a plurality of points on the high resolution recovery frame form the matrix reverse position projection matrix. The projection matrix of a certain point (i ', j') on the low frame is constructed as follows:

F^l(i′,j′)＝{I^s(i,j)|i′r≤i≤(i′+1)r；j′r≤i≤(j′+1)r}, (2)

F^l(i ', j') represents a point on the low-resolution feature frame, and the right side of the equation represents the set of points on the high-resolution frame decided by (i ', j').

5) The fusion characteristics output by the fusion module and the inverse position projection matrix are used as input to obtain the fusion characteristics of the inverse position projection matrix, as shown in fig. 4, the specific steps are as follows:

in the first step, the element pixel at the position (i ', j') in the reverse position projection matrix is reshaped to form a position offset matrix with the dimension (n,3), wherein n represents the number of high-resolution pixel positions contained in the current element position.

And secondly, extracting the characteristics of the spliced vector of each position offset and the scale factor to obtain the position offset characteristics.

And thirdly, combining the input features and the position offset features of the current position, and predicting the fusion weight by using the full-connection layer and the logistic regression activation layer.

And fourthly, weighting all the position offset features by using the fusion weight to obtain the fusion feature of the reverse position projection matrix.

6) And (3) inputting the fusion features into a residual block consisting of two convolution volumes and two activation functions to obtain new fusion features, splicing the new fusion features with the features in the step (2), and finally adding the fusion features to obtain the features of the fusion scale information.

The S4 upsampling module first predicts the weight of the upsampling filter, and multiplies the feature obtained in S3 by the filter pixel by pixel to obtain the high resolution recovery frame, as shown in the upper half of fig. 2, and the steps of obtaining the weight of the upsampling filter are as follows:

4) the O2P module inputs the corresponding relation between the low-resolution input and high-resolution image results of the predicted multiframe by the offset between the adjacent frame and the target frame obtained by the alignment module, learns and predicts the position relation of the high-resolution image pixel by pixel, and takes the relation as the input of the position weight prediction P2W module to obtain the position weight;

5) the offset is different from the information contained in the alignment feature, the offset corresponds to the corresponding relation between pixel points of two frames of images, more image information is contained in the alignment feature, the feature weight still adopts the structure of a P2W module, and the alignment feature output by the alignment module is used as input;

6) and multiplying the position weight and the characteristic weight pixel by pixel to obtain the final weight of the up-sampling filter.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A method for introducing arbitrary-multiple video super-resolution of scale information is characterized in that firstly variable convolution is used for aligning adjacent frames at a characteristic level; then the fusion module calculates the similarity between the aligned adjacent frame and the target frame, the attention degree of the similar part is higher, the inter-frame time sequence information is fused in the mode, and the non-local fusion intra-frame space information is used; then, fusing the features extracted from each element vector in the reverse position projection matrix by using an attention-like mechanism, and fusing the features into a residual error learning process to help a network to adaptively adjust the feature learning process by performing explicit modeling on different scale factors; and finally, inputting the offset representing the movement of the pixel point between the continuous adjacent frame and the target frame into a position weight prediction module to obtain the position weight, and inputting the alignment characteristics output by the alignment module into a characteristic weight prediction module to obtain the characteristic weight. And finally, multiplying the two weights correspondingly to obtain the weight of the up-sampling filter kernel, and multiplying the fusion characteristics by the weight of the up-sampling filter kernel respectively to realize the amplification of any times.

2. The method for super-resolution of videos at any multiple of introduced dimension information as claimed in claim 1, wherein the specific steps are as follows: the super-resolution network is formed by alignment, fusion, reconstruction and up-sampling, scale factor information is introduced in the links of reconstruction and up-sampling, and the super-resolution network is realized according to the following steps:

3. The method for super resolution of videos at any multiple of the introduced scale information as claimed in claim 1, wherein the reconstruction module of the introduced scale factor information is implemented by the following steps:

in the formula (1), the first and second groups,

representing the input fused features.

4. The method for super resolution of video at arbitrary multiple of scale information as claimed in claim 1, wherein the weight of the upsampling filter is composed of two parts of a position weight and a feature weight:

5. The method for super resolution of videos at any multiple of the introduced dimension information as claimed in claim 1, wherein the detailed steps are as follows:

F^l(i′,j′)＝{I^s(i,j)|i′r≤i≤(i′+1)r；j′r≤i≤(j′+1)r},(2)