CN113902623A - Method for super-resolution of arbitrary-magnification video by introducing scale information - Google Patents

Method for super-resolution of arbitrary-magnification video by introducing scale information Download PDF

Info

Publication number
CN113902623A
CN113902623A CN202111385618.0A CN202111385618A CN113902623A CN 113902623 A CN113902623 A CN 113902623A CN 202111385618 A CN202111385618 A CN 202111385618A CN 113902623 A CN113902623 A CN 113902623A
Authority
CN
China
Prior art keywords
resolution
weight
frame
fusion
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111385618.0A
Other languages
Chinese (zh)
Inventor
万亮
盖赫
冯伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202111385618.0A priority Critical patent/CN113902623A/en
Publication of CN113902623A publication Critical patent/CN113902623A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Television Systems (AREA)
  • Image Processing (AREA)

Abstract

The invention belongs to the super-resolution field of computer vision, and aims to solve the problem of arbitrary magnification of the super-resolution of a video and realize decimal magnification. The invention comprises the following steps: introducing an arbitrary-multiple video super-resolution method of scale information, firstly aligning adjacent frames at a characteristic level by using variable convolution; then fusing the features extracted from each element vector in the reverse position projection matrix by using an attention-like mechanism; and finally, inputting the offset representing the movement of the pixel point between the continuous adjacent frame and the target frame into a position weight prediction module to obtain the position weight, and inputting the alignment characteristics output by the alignment module into a characteristic weight prediction module to obtain the characteristic weight. And finally, multiplying the two weights correspondingly to obtain the weight of the up-sampling filter kernel, and multiplying the fusion characteristics by the weight of the up-sampling filter kernel respectively to realize the amplification of any times. The method is mainly applied to the occasion of arbitrary magnification of the video super-resolution.

Description

Method for super-resolution of arbitrary-magnification video by introducing scale information
Technical Field
The invention belongs to the super-resolution field of computer vision, and relates to a method for realizing arbitrary magnification. In particular to a super-resolution method of any-time video introducing scale information.
Background
The video super-resolution algorithm comprises the following steps: and (3) an algorithm for expanding the resolution of the video by a certain multiple. The super-resolution of video usually inputs 2N +1(N ═ 1, 2, 3, … …) frame images, a frame to be restored is called a target frame, a frame providing information for the target frame is called an adjacent frame, and the super-resolution of video restores a high-resolution frame by means of timing information between adjacent frames and its own spatial information. The process of the video super-resolution algorithm can be divided into four parts of alignment, fusion, reconstruction and up-sampling.
Video super-resolution differs from image super-resolution in that there is strong temporal and spatial continuity between successive video frames. Considering that the change of objects and scenes in space exists between video frames which are close to each other, aligning means that misplaced pixels (or characteristics) are aligned by using multi-frame video frame information, the difference between adjacent frames and a target frame is obtained at the pixel level or the characteristic level, and the adjacent frames are adjusted to be approximate to the target frame; the fusion process reserves information which is helpful to the super-resolution of the target frame in the aligned video frame, takes the target frame and the aligned adjacent frame as input, and screens and reserves characteristics according to the time information between the adjacent frame and the target frame and the space information of the target frame; reconstruction refers to operations such as receptive field amplification, feature extraction and the like on the features, and aims to extract deep features of the picture and fully utilize the existing information before restoring a high-resolution image; and finally, mapping the learned features to a high-dimensional image space through upsampling to obtain a final high-resolution video.
An arbitrary-time image super-resolution algorithm: the super-resolution technology of any multiple image can be divided into two types, one type directly uses a position matrix related to a low-resolution input frame and a high-resolution recovery frame of a single frame as the input of a module to predict the filter weights under different scale factors, only one model is needed to realize the amplification of any multiple, but the predicted weights under the same scale factor are the same. In the other type, scale factor information is introduced in the previous link of upsampling to help feature learning, but the scale factor is directly converted into a vector in such a mode, and the scale factor information cannot be fully expressed.
The video super-resolution is fundamentally different from the image super-resolution in that the video super-resolution takes a plurality of frames as input, and the information of adjacent frames is fully utilized to help recovery in each link of the algorithm, so that a better effect is achieved. Meanwhile, related research of arbitrary magnification is not carried out in the field of super-resolution of videos at present, and the method for the arbitrary image super-resolution cannot be directly transferred into the videos due to the fact that the characteristics of the videos are ignored. As a big hot research problem in the field of computer vision, the magnification of a video with any times is not negligible.
Variable convolution: the convolution kernel of the common convolution is fixed, the receptive field of the variable convolution can be self-adaptively adjusted to adapt to different deformations, the sampling shape is irregular, the offset generated by the convolution layer is added to the standard convolution kernel to be used as a sampling point, and the traditional convolution kernel is changed into a deformable convolution kernel. The flexible sampling nature of the variable convolution makes it applicable to the alignment module.
Reference to the literature
[1]Wang X,Chan C K K,Yu K,et al.EDVR:Video Restoration with Enhanced Deformable Convolutional Networks[J].CVPR Workshops,2019
[2]Hu X,Mu H,Zhang X,et al.Meta-SR:A Magnification-Arbitrary Network for Super-Resolution[J].CVPR,2019:1575–1584
[3]Fu Y,Chen J,Zhang T,et al.Residual Scale Attention Network for Arbitrary Scale Image Super-Resolution[J].Neurocomputing,2021:201–211.
[4]Dai J,Qi H,Xiong Y,et al.Deformable Convolutional Networks[J].ICCV,2017.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to solve the problem of arbitrary magnification of the video super-resolution and realize decimal magnification. Therefore, the technical scheme adopted by the invention is as follows: an arbitrary-time video super-resolution method introducing scale information. First, using a variable convolution to align adjacent frames at a feature level; then the fusion module calculates the similarity between the aligned adjacent frame and the target frame, the attention degree of the similar part is higher, the inter-frame time sequence information is fused in the mode, and the non-local fusion intra-frame space information is used; then, fusing the features extracted from each element vector in the reverse position projection matrix by using an attention-like mechanism, and fusing the features into a residual error learning process to help a network to adaptively adjust the feature learning process by performing explicit modeling on different scale factors; and finally, inputting the offset representing the movement of the pixel point between the continuous adjacent frame and the target frame into a position weight prediction module to obtain the position weight, and inputting the alignment characteristics output by the alignment module into a characteristic weight prediction module to obtain the characteristic weight. And finally, multiplying the two weights correspondingly to obtain the weight of the up-sampling filter kernel, and multiplying the fusion characteristics by the weight of the up-sampling filter kernel respectively to realize the amplification of any times.
The method comprises the following specific steps:
the super-resolution network is formed by alignment, fusion, reconstruction and up-sampling, scale factor information is introduced in the links of reconstruction and up-sampling, and the super-resolution network is realized according to the following steps:
s1, the network alignment adopts variable convolution to refer to the target frame at the feature level, and aligns all the adjacent frames to obtain the aligned target frame feature;
s2, the fusion is to take the aligned adjacent frames and the target frame as input, fuse information from two dimensions of time and space, fuse inter-frame information in the time dimension through an attention mechanism, and fuse intra-frame information in the space dimension through non-local fusion;
s3, the reconstruction link uses the reverse position projection matrix to represent the information related to the scale factors, so that the learned characteristic information under different scale factors is different;
s4, upsampling: firstly, predicting the position weight between low-grade input and high-grade output and the feature weight of different target frames, combining the two types of weights to obtain the weight of an up-sampling filter, and multiplying the fused features and the weight pixel by pixel to obtain a final super-score result.
The reconstruction module for introducing the scale factor information is realized by the following steps:
2.1, inputting the fused target frame characteristics and the reverse position projection matrix represented by the scale factor information into a scale perception position characteristic extraction module to obtain the fusion characteristics of the reverse position projection matrix;
2.2, inputting the fused target frame into a conventional residual block, wherein the residual block consists of two convolution layers and two activation layers to obtain residual fusion characteristics;
2.3 the formula of the feature of the introduced scale information obtained by splicing and adding the fusion feature of the reverse position projection matrix and the fusion feature of the residual error and the image feature is as follows:
Figure BDA0003367035030000021
in the formula (1), the first and second groups,
Figure BDA0003367035030000022
represents the feature of the fusion scale information, C () represents the convolution operation, con () represents the stitching operation, R () represents the activation function, E (IP) represents the inverse position projection matrix,
Figure BDA0003367035030000023
representing the input fused features.
The weight of the up-sampling filter consists of a position weight and a characteristic weight:
3.1 the position weight is obtained by inputting the corresponding position relation between multi-frame low-resolution input and high-resolution output and predicting scale factors;
3.2, the feature weight takes the alignment feature output by the alignment module as input to predict the feature weight of different target frames;
3.3 finally, the two weights are multiplied to obtain the weight of the up-sampling filter.
The detailed steps are as follows:
the super-resolution network comprises four stages of alignment, fusion, reconstruction and upsampling, and the super-resolution network realizes arbitrary-time video super-resolution according to the following steps:
s1, firstly, extracting the characteristics of the adjacent frame and the target frame, aligning the adjacent frame at the characteristic level by referring to the target frame through variable convolution in the alignment stage, inputting the adjacent frame and the target frame into the network, and extracting the characteristic by using a residual block; then, combining the characteristics of the adjacent frames and the target frame, obtaining the offset between the adjacent frames and the target frame through convolution, inputting the offset and the adjacent frames into variable convolution, and obtaining the aligned characteristics of the adjacent frames;
s2, inputting the aligned adjacent frames output by the alignment module and the target frame, calculating the similarity between each adjacent frame and the target frame in the fusion stage, fusing information with higher similarity to the target frame, and fusing spatial information to the target frame by using a non-local method;
s3, inputting the characteristics of the fused space-time information, introducing scale factors and extracting deeper information, wherein the reconstruction module is obtained by stacking scale perception residual blocks, and the specific working flow of stacking the scale perception residual blocks is as follows:
1) constructing a reverse position projection matrix, and for each pixel point on the high resolution frame, dividing the coordinate of the pixel point by a scale factor to obtain a corresponding pixel point on the low resolution frame, so that for each point on the low resolution frame, a plurality of points on the high resolution recovery frame form a matrix reverse position projection matrix, and the projection matrix of a certain point (i ', j') on the low resolution frame is constructed as follows:
Fl(i′,j′)={Is(i,j)|i′r≤i≤(i′+1)r;j′r≤i≤(j′+1)r}, (2)
Fl(i ', j') represents a point on the low-resolution feature frame, and the right side of the equation represents a set of points on the high-resolution frame determined by the point (i ', j');
2) the fusion characteristics output by the fusion module and the reverse position projection matrix are used as input to obtain the fusion characteristics of the reverse position projection matrix, and the method specifically comprises the following steps:
reshaping element pixels at positions (i ', j') in the reverse position projection matrix to form a position offset matrix with dimensions (n,3), wherein n represents the number of high-resolution pixel positions contained in the current element position;
secondly, extracting the characteristics of the spliced vector of each position offset and the scale factor to obtain position offset characteristics;
thirdly, combining the input features and the position offset features of the current position, and predicting fusion weight by utilizing a full connection layer and a logistic regression activation layer;
fourthly, weighting all the position offset features by using fusion weights to obtain fusion features of a reverse position projection matrix;
3) inputting the fusion features into a residual block consisting of two convolution volumes and two activation functions to obtain new fusion features, splicing the new fusion features with the features in the step 2), and finally adding the fusion features to obtain the features of the fusion scale information;
in the up-sampling stage of S4, the weight of the up-sampling filter is predicted first, and the feature obtained in S3 is multiplied by the filter pixel by pixel to obtain a high resolution recovery frame, and the step of obtaining the weight of the up-sampling filter is as follows:
1) the O2P module inputs the corresponding relation between the low-resolution input and high-resolution image results of the predicted multiframe by the offset between the adjacent frame and the target frame obtained by the alignment module, learns and predicts the position relation of the high-resolution image pixel by pixel, and takes the relation as the input of the position weight prediction P2W module to obtain the position weight;
2) the offset is different from the information contained in the alignment feature, the offset corresponds to the corresponding relation between pixel points of two frames of images, more image information is contained in the alignment feature, the feature weight still adopts the structure of a P2W module, and the alignment feature output by the alignment module is used as input;
3) and multiplying the position weight and the characteristic weight pixel by pixel to obtain the final weight of the up-sampling filter.
The invention has the characteristics and beneficial effects that:
1) aiming at the blank of the current super-resolution work of videos with any times, a network structure for realizing the super-resolution of videos with any times is provided, and the amplification of any times can be realized only by training one model.
2) The invention provides a scale perception reconstruction module by analyzing the importance of scale factors in any times of video super-resolution, the module constructs a projection relation matrix between low-resolution images and features according to the scale factors, the features extracted from each element vector in the projection matrix are fused by using a similar attention mechanism, and the features are used as additional information of the reconstruction module to help network learning.
3) In order to fully utilize the video super-resolution continuous multi-frame input, the invention provides an up-sampling module for realizing any multiple, which takes the corresponding position relation and alignment characteristics between multi-frame input frames and output frames as input and dynamically predicts the weight of an up-sampling filter. The filter weights are different for different frames, preserving the characteristics of the result to a greater extent.
Description of the drawings:
fig. 1 is a flow chart of the super-resolution method of an arbitrary-magnification video introducing scale information according to the present invention.
Fig. 2 is a schematic diagram of the overall network structure of the present invention.
Fig. 3 is an exemplary diagram of the scale-aware residual block proposed by the present invention, which is a refinement of "scale-aware reconstruction" in fig. 2, and a plurality of scale-aware residual blocks are spliced together to form a scale-aware reconstruction module in an overall network.
Fig. 4 is a schematic diagram of obtaining a fusion feature of a reverse position projection matrix, that is, a specific implementation manner of the scale-aware position feature extraction in fig. 3.
FIG. 5 is a graph showing the results of comparing the present invention with a classical super-resolution method.
Detailed Description
In order to solve the problem that the video super-resolution method realizes arbitrary-time amplification, the invention provides an arbitrary-time video super-resolution method ASVSR introducing scale information through analysis, and decimal-time amplification is realized. The proposed ASVSR is divided into four stages, first aligning adjacent frames at a feature level using variable convolution; then the fusion module calculates the similarity between the aligned adjacent frame and the target frame, the attention degree of the similar part is higher, the inter-frame time sequence information is fused in the mode, and the non-local fusion intra-frame space information is used; and then fusing the features extracted from each element vector in the reverse position projection matrix by using an attention-like mechanism, and fusing the features into a residual error learning process. By performing explicit modeling on different scale factors, the network is helped to adaptively adjust the characteristic learning process; and finally, inputting the offset representing the movement of the pixel point between the continuous adjacent frame and the target frame into a position weight prediction module to obtain the position weight, and inputting the alignment characteristics output by the alignment module into a characteristic weight prediction module to obtain the characteristic weight. And finally, multiplying the two weights correspondingly to obtain the weight of the up-sampling filter kernel, and multiplying the fusion characteristics by the weight of the up-sampling filter kernel respectively to realize the amplification of any times.
The specific technical scheme of the invention is as follows:
a super-resolution method of any-time video introducing scale information comprises a super-resolution network formed by alignment, fusion, reconstruction and up-sampling, scale factor information is introduced in the links of reconstruction and up-sampling, and the super-resolution network is realized according to the following steps:
s1, the network alignment module adopts variable convolution to refer to the target frame at the feature level, and aligns all the adjacent frames to obtain the aligned target frame features;
s2, the fusion module takes the aligned adjacent and target frames as input, fuses information from two dimensions of time and space, fuses inter-frame information in the time dimension through an attention mechanism, and fuses intra-frame information in the space dimension through non-local fusion;
s3, the reconstruction link uses the reverse position projection matrix to represent the information related to the scale factors, so that the learned characteristic information under different scale factors is different;
s4, the up-sampling module firstly predicts the position weight between the low-score input and the high-score output and the feature weight of different target frames, combines the two types of weights to obtain the weight of the up-sampling filter, and multiplies the fused features and the weight pixel by pixel to obtain the final super-score result.
The reconstruction module for introducing the scale factor information is realized by the following steps.
2.1 inputting the fused target frame characteristics and the reverse position projection matrix represented by the scale factor information into a scale perception position characteristic extraction module to obtain the fusion characteristics of the reverse position projection matrix.
2.2 inputting the fused target frame into a conventional residual block, wherein the residual block consists of two convolution layers and two activation layers, and the residual fusion characteristic is obtained.
2.3 the formula of the feature of the introduced scale information obtained by splicing and adding the fusion feature of the reverse position projection matrix and the fusion feature of the residual error and the image feature is as follows:
Figure BDA0003367035030000051
in the formula (1), the first and second groups,
Figure BDA0003367035030000052
represents the feature of the fusion scale information, C () represents the convolution operation, con () represents the stitching operation, R () represents the activation function, E (IP) represents the inverse position projection matrix,
Figure BDA0003367035030000053
representing the input fused features.
The weight of the up-sampling filter consists of a position weight and a characteristic weight;
3.1 the position weight is obtained by inputting the corresponding position relation between multi-frame low-resolution input and high-resolution output and predicting scale factors;
3.2 inputting the alignment characteristics output by the alignment module by the characteristic weight to predict the characteristic weight of different target frames;
3.3 finally, the two weights are multiplied to obtain the weight of the up-sampling filter.
The technical scheme of the invention is further explained by combining the attached drawings.
As shown in fig. 1, the present invention provides a method for introducing arbitrary-magnification super-resolution of a video with scale information, which includes a super-resolution network composed of four stages of alignment, fusion, reconstruction, and upsampling, and the super-resolution network implements arbitrary-magnification super-resolution of a video according to the following steps:
s1 first extracts features of the adjacent frame and the target frame, and the alignment module aligns the adjacent frame at a feature level with reference to the target frame by variable convolution, inputting the adjacent frame and the target frame into the network. The extraction features use a residual block, and the residual block is a common feature extraction method and can obtain deep feature information; then, combining the characteristics of the adjacent frames and the target frame, obtaining the offset between the adjacent frames and the target frame through convolution, inputting the offset and the adjacent frames into variable convolution, and obtaining the aligned characteristics of the adjacent frames;
s2, inputting the aligned adjacent frames and the target frames output by the alignment module into the fusion module, calculating the similarity between each adjacent frame and the target frame, fusing the information with higher similarity to the target frame, and fusing the spatial information of the target frame by using a non-local method.
S3 inputs the features of the fused spatio-temporal information to a reconstruction module to introduce scale factors and extract deeper information, the module is formed by stacking the scale perceptual residual blocks proposed by the present invention, as shown in fig. 3, the specific work flow of the scale perceptual residual blocks is as follows:
4) and constructing a reverse position projection matrix, and for each pixel point on the high resolution frame, dividing the coordinate of the pixel point by the scale factor to obtain a corresponding pixel point on the low resolution frame, so that for each point on the low resolution frame, a plurality of points on the high resolution recovery frame form the matrix reverse position projection matrix. The projection matrix of a certain point (i ', j') on the low frame is constructed as follows:
Fl(i′,j′)={Is(i,j)|i′r≤i≤(i′+1)r;j′r≤i≤(j′+1)r}, (2)
Fl(i ', j') represents a point on the low-resolution feature frame, and the right side of the equation represents the set of points on the high-resolution frame decided by (i ', j').
5) The fusion characteristics output by the fusion module and the inverse position projection matrix are used as input to obtain the fusion characteristics of the inverse position projection matrix, as shown in fig. 4, the specific steps are as follows:
in the first step, the element pixel at the position (i ', j') in the reverse position projection matrix is reshaped to form a position offset matrix with the dimension (n,3), wherein n represents the number of high-resolution pixel positions contained in the current element position.
And secondly, extracting the characteristics of the spliced vector of each position offset and the scale factor to obtain the position offset characteristics.
And thirdly, combining the input features and the position offset features of the current position, and predicting the fusion weight by using the full-connection layer and the logistic regression activation layer.
And fourthly, weighting all the position offset features by using the fusion weight to obtain the fusion feature of the reverse position projection matrix.
6) And (3) inputting the fusion features into a residual block consisting of two convolution volumes and two activation functions to obtain new fusion features, splicing the new fusion features with the features in the step (2), and finally adding the fusion features to obtain the features of the fusion scale information.
The S4 upsampling module first predicts the weight of the upsampling filter, and multiplies the feature obtained in S3 by the filter pixel by pixel to obtain the high resolution recovery frame, as shown in the upper half of fig. 2, and the steps of obtaining the weight of the upsampling filter are as follows:
4) the O2P module inputs the corresponding relation between the low-resolution input and high-resolution image results of the predicted multiframe by the offset between the adjacent frame and the target frame obtained by the alignment module, learns and predicts the position relation of the high-resolution image pixel by pixel, and takes the relation as the input of the position weight prediction P2W module to obtain the position weight;
5) the offset is different from the information contained in the alignment feature, the offset corresponds to the corresponding relation between pixel points of two frames of images, more image information is contained in the alignment feature, the feature weight still adopts the structure of a P2W module, and the alignment feature output by the alignment module is used as input;
6) and multiplying the position weight and the characteristic weight pixel by pixel to obtain the final weight of the up-sampling filter.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims (5)

1. A method for introducing arbitrary-multiple video super-resolution of scale information is characterized in that firstly variable convolution is used for aligning adjacent frames at a characteristic level; then the fusion module calculates the similarity between the aligned adjacent frame and the target frame, the attention degree of the similar part is higher, the inter-frame time sequence information is fused in the mode, and the non-local fusion intra-frame space information is used; then, fusing the features extracted from each element vector in the reverse position projection matrix by using an attention-like mechanism, and fusing the features into a residual error learning process to help a network to adaptively adjust the feature learning process by performing explicit modeling on different scale factors; and finally, inputting the offset representing the movement of the pixel point between the continuous adjacent frame and the target frame into a position weight prediction module to obtain the position weight, and inputting the alignment characteristics output by the alignment module into a characteristic weight prediction module to obtain the characteristic weight. And finally, multiplying the two weights correspondingly to obtain the weight of the up-sampling filter kernel, and multiplying the fusion characteristics by the weight of the up-sampling filter kernel respectively to realize the amplification of any times.
2. The method for super-resolution of videos at any multiple of introduced dimension information as claimed in claim 1, wherein the specific steps are as follows: the super-resolution network is formed by alignment, fusion, reconstruction and up-sampling, scale factor information is introduced in the links of reconstruction and up-sampling, and the super-resolution network is realized according to the following steps:
s1, the network alignment adopts variable convolution to refer to the target frame at the feature level, and aligns all the adjacent frames to obtain the aligned target frame feature;
s2, the fusion is to take the aligned adjacent frames and the target frame as input, fuse information from two dimensions of time and space, fuse inter-frame information in the time dimension through an attention mechanism, and fuse intra-frame information in the space dimension through non-local fusion;
s3, the reconstruction link uses the reverse position projection matrix to represent the information related to the scale factors, so that the learned characteristic information under different scale factors is different;
s4, upsampling: firstly, predicting the position weight between low-grade input and high-grade output and the feature weight of different target frames, combining the two types of weights to obtain the weight of an up-sampling filter, and multiplying the fused features and the weight pixel by pixel to obtain a final super-score result.
3. The method for super resolution of videos at any multiple of the introduced scale information as claimed in claim 1, wherein the reconstruction module of the introduced scale factor information is implemented by the following steps:
2.1, inputting the fused target frame characteristics and the reverse position projection matrix represented by the scale factor information into a scale perception position characteristic extraction module to obtain the fusion characteristics of the reverse position projection matrix;
2.2, inputting the fused target frame into a conventional residual block, wherein the residual block consists of two convolution layers and two activation layers to obtain residual fusion characteristics;
2.3 the formula of the feature of the introduced scale information obtained by splicing and adding the fusion feature of the reverse position projection matrix and the fusion feature of the residual error and the image feature is as follows:
Figure FDA0003367035020000011
in the formula (1), the first and second groups,
Figure FDA0003367035020000012
represents the feature of the fusion scale information, C () represents the convolution operation, con () represents the stitching operation, R () represents the activation function, E (IP) represents the inverse position projection matrix,
Figure FDA0003367035020000013
representing the input fused features.
4. The method for super resolution of video at arbitrary multiple of scale information as claimed in claim 1, wherein the weight of the upsampling filter is composed of two parts of a position weight and a feature weight:
3.1 the position weight is obtained by inputting the corresponding position relation between multi-frame low-resolution input and high-resolution output and predicting scale factors;
3.2, the feature weight takes the alignment feature output by the alignment module as input to predict the feature weight of different target frames;
3.3 finally, the two weights are multiplied to obtain the weight of the up-sampling filter.
5. The method for super resolution of videos at any multiple of the introduced dimension information as claimed in claim 1, wherein the detailed steps are as follows:
the super-resolution network comprises four stages of alignment, fusion, reconstruction and upsampling, and the super-resolution network realizes arbitrary-time video super-resolution according to the following steps:
s1, firstly, extracting the characteristics of the adjacent frame and the target frame, aligning the adjacent frame at the characteristic level by referring to the target frame through variable convolution in the alignment stage, inputting the adjacent frame and the target frame into the network, and extracting the characteristic by using a residual block; then, combining the characteristics of the adjacent frames and the target frame, obtaining the offset between the adjacent frames and the target frame through convolution, inputting the offset and the adjacent frames into variable convolution, and obtaining the aligned characteristics of the adjacent frames;
s2, inputting the aligned adjacent frames output by the alignment module and the target frame, calculating the similarity between each adjacent frame and the target frame in the fusion stage, fusing information with higher similarity to the target frame, and fusing spatial information to the target frame by using a non-local method;
s3, inputting the characteristics of the fused space-time information, introducing scale factors and extracting deeper information, wherein the reconstruction module is obtained by stacking scale perception residual blocks, and the specific working flow of stacking the scale perception residual blocks is as follows:
1) constructing a reverse position projection matrix, and for each pixel point on the high resolution frame, dividing the coordinate of the pixel point by a scale factor to obtain a corresponding pixel point on the low resolution frame, so that for each point on the low resolution frame, a plurality of points on the high resolution recovery frame form a matrix reverse position projection matrix, and the projection matrix of a certain point (i ', j') on the low resolution frame is constructed as follows:
Fl(i′,j′)={Is(i,j)|i′r≤i≤(i′+1)r;j′r≤i≤(j′+1)r},(2)
Fl(i ', j') represents a point on the low-resolution feature frame, and the right side of the equation represents a set of points on the high-resolution frame determined by the point (i ', j');
2) the fusion characteristics output by the fusion module and the reverse position projection matrix are used as input to obtain the fusion characteristics of the reverse position projection matrix, and the method specifically comprises the following steps:
reshaping element pixels at positions (i ', j') in the reverse position projection matrix to form a position offset matrix with dimensions (n,3), wherein n represents the number of high-resolution pixel positions contained in the current element position;
secondly, extracting the characteristics of the spliced vector of each position offset and the scale factor to obtain position offset characteristics;
thirdly, combining the input features and the position offset features of the current position, and predicting fusion weight by utilizing a full connection layer and a logistic regression activation layer;
fourthly, weighting all the position offset features by using fusion weights to obtain fusion features of a reverse position projection matrix;
3) inputting the fusion features into a residual block consisting of two convolution volumes and two activation functions to obtain new fusion features, splicing the new fusion features with the features in the step 2), and finally adding the fusion features to obtain the features of the fusion scale information;
in the up-sampling stage of S4, the weight of the up-sampling filter is predicted first, and the feature obtained in S3 is multiplied by the filter pixel by pixel to obtain a high resolution recovery frame, and the step of obtaining the weight of the up-sampling filter is as follows:
1) the O2P module inputs the corresponding relation between the low-resolution input and high-resolution image results of the predicted multiframe by the offset between the adjacent frame and the target frame obtained by the alignment module, learns and predicts the position relation of the high-resolution image pixel by pixel, and takes the relation as the input of the position weight prediction P2W module to obtain the position weight;
2) the offset is different from the information contained in the alignment feature, the offset corresponds to the corresponding relation between pixel points of two frames of images, more image information is contained in the alignment feature, the feature weight still adopts the structure of a P2W module, and the alignment feature output by the alignment module is used as input;
3) and multiplying the position weight and the characteristic weight pixel by pixel to obtain the final weight of the up-sampling filter.
CN202111385618.0A 2021-11-22 2021-11-22 Method for super-resolution of arbitrary-magnification video by introducing scale information Pending CN113902623A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111385618.0A CN113902623A (en) 2021-11-22 2021-11-22 Method for super-resolution of arbitrary-magnification video by introducing scale information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111385618.0A CN113902623A (en) 2021-11-22 2021-11-22 Method for super-resolution of arbitrary-magnification video by introducing scale information

Publications (1)

Publication Number Publication Date
CN113902623A true CN113902623A (en) 2022-01-07

Family

ID=79194948

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111385618.0A Pending CN113902623A (en) 2021-11-22 2021-11-22 Method for super-resolution of arbitrary-magnification video by introducing scale information

Country Status (1)

Country Link
CN (1) CN113902623A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114819109A (en) * 2022-06-22 2022-07-29 腾讯科技(深圳)有限公司 Super-resolution processing method, device, equipment and medium for binocular image
CN116012230A (en) * 2023-01-17 2023-04-25 深圳大学 Space-time video super-resolution method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190059157A (en) * 2017-11-22 2019-05-30 에스케이텔레콤 주식회사 Method and Apparatus for Improving Image Quality
CN111524068A (en) * 2020-04-14 2020-08-11 长安大学 Variable-length input super-resolution video reconstruction method based on deep learning
CN111583112A (en) * 2020-04-29 2020-08-25 华南理工大学 Method, system, device and storage medium for video super-resolution
CN112991183A (en) * 2021-04-09 2021-06-18 华南理工大学 Video super-resolution method based on multi-frame attention mechanism progressive fusion

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190059157A (en) * 2017-11-22 2019-05-30 에스케이텔레콤 주식회사 Method and Apparatus for Improving Image Quality
CN111524068A (en) * 2020-04-14 2020-08-11 长安大学 Variable-length input super-resolution video reconstruction method based on deep learning
CN111583112A (en) * 2020-04-29 2020-08-25 华南理工大学 Method, system, device and storage medium for video super-resolution
CN112991183A (en) * 2021-04-09 2021-06-18 华南理工大学 Video super-resolution method based on multi-frame attention mechanism progressive fusion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
冯伟昌;林玉池;何冬;宋乐;赵美蓉;: "基于FPGA的双通道实时图像处理***", 传感技术学报, no. 08, 20 August 2010 (2010-08-20) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114819109A (en) * 2022-06-22 2022-07-29 腾讯科技(深圳)有限公司 Super-resolution processing method, device, equipment and medium for binocular image
CN116012230A (en) * 2023-01-17 2023-04-25 深圳大学 Space-time video super-resolution method, device, equipment and storage medium
CN116012230B (en) * 2023-01-17 2023-09-29 深圳大学 Space-time video super-resolution method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
Gui et al. Featureflow: Robust video interpolation via structure-to-texture generation
JP7093886B2 (en) Image processing methods and devices, electronic devices and storage media
Liang et al. Swinir: Image restoration using swin transformer
Wang et al. Learning for video super-resolution through HR optical flow estimation
US20190228264A1 (en) Method and apparatus for training neural network model used for image processing, and storage medium
CN111652899B (en) Video target segmentation method for space-time component diagram
US20220222776A1 (en) Multi-Stage Multi-Reference Bootstrapping for Video Super-Resolution
CN113902623A (en) Method for super-resolution of arbitrary-magnification video by introducing scale information
CN107633482B (en) Super-resolution reconstruction method based on sequence image
CN111696035A (en) Multi-frame image super-resolution reconstruction method based on optical flow motion estimation algorithm
CN112435282A (en) Real-time binocular stereo matching method based on self-adaptive candidate parallax prediction network
CN109949221B (en) Image processing method and electronic equipment
CN113947531A (en) Iterative collaborative video super-resolution reconstruction method and system
Zhou et al. Image super-resolution based on dense convolutional auto-encoder blocks
CN115359370A (en) Remote sensing image cloud detection method and device, computer device and storage medium
CN113902620A (en) Video super-resolution system and method based on deformable convolution network
CN116403152A (en) Crowd density estimation method based on spatial context learning network
Tang et al. Structure-embedded ghosting artifact suppression network for high dynamic range image reconstruction
Li et al. HoloParser: Holistic visual parsing for real-time semantic segmentation in autonomous driving
US20240062347A1 (en) Multi-scale fusion defogging method based on stacked hourglass network
CN116152710A (en) Video instance segmentation method based on cross-frame instance association
Zhang et al. Image deblurring based on lightweight multi-information fusion network
CN115512393A (en) Human body posture estimation method based on improved HigherHRNet
Luo et al. Efficient lightweight network for video super-resolution
Li et al. Lightweight single image super-resolution based on multi-path progressive feature fusion and attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination