CN115170392A - Single-image super-resolution algorithm based on attention mechanism - Google Patents

Single-image super-resolution algorithm based on attention mechanism Download PDF

Info

Publication number
CN115170392A
CN115170392A CN202210719954.2A CN202210719954A CN115170392A CN 115170392 A CN115170392 A CN 115170392A CN 202210719954 A CN202210719954 A CN 202210719954A CN 115170392 A CN115170392 A CN 115170392A
Authority
CN
China
Prior art keywords
attention
output
module
input
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210719954.2A
Other languages
Chinese (zh)
Inventor
裴文江
蔡清
夏亦犁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202210719954.2A priority Critical patent/CN115170392A/en
Publication of CN115170392A publication Critical patent/CN115170392A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a single image super-resolution algorithm based on an attention mechanism, provides a new multi-scale attention residual block, improves a network design framework of residual in residual, and introduces interlayer attention. On the basis of the two innovation points, a new multi-scale whole attention network is proposed. The specific characteristics of the MSAB are as follows: (1) A channel attention mechanism and a space attention mechanism are introduced into a common residual block, a two-branch learning strategy is adopted, the two attention mechanisms are respectively used for different branches, and finally 1x1 convolutional layers are used for cascading. (2) On the basis, multi-scale convolution is introduced, two convolution blocks of 3x3 and 5x5 are adopted, feature extraction is also carried out respectively according to a double-branch strategy, and 1x1 convolution layers are used for cascading. The MSHAN network of the invention obtains remarkable results under the comprehensive measurement of the model performance and the parameter quantity.

Description

Single-image super-resolution algorithm based on attention mechanism
Technical Field
The invention relates to a super-resolution problem in image processing, in particular to a super-resolution method based on a convolutional neural network.
Background
With the continuous progress of deep learning technology, the powerful computing and characterizing capability of a Convolutional Neural Network (CNN) makes it splendid in the field of computer vision, gradually replacing those traditional learning methods. The convolutional neural network can not only extract shallow features of the image, such as background and contour, through convolutional kernels, but also extract features continuously through multiple layers of convolutional kernels, and can extract high-frequency details of the image.
Super-resolution (SR) is a typical pathological problem in image processing, and an appropriate filter is designed through an input Low-resolution (LR) image so as to output a High-resolution (HR) image corresponding to an enlarged scale. To solve this problem, many learning-based methods have been proposed to learn the mapping between LR and HR image pairs. The SR method based on learning is mainly divided into two types, namely, a traditional learning method (such as bicubic interpolation) and a deep learning method. Although some priori knowledge of the image can be implicitly learned through probability theory, the traditional learning method cannot learn deeper map features, and the generated high-resolution image lacks necessary high-frequency details. The deep learning-based method can rely on the powerful calculation and feature characterization capabilities of a convolutional neural network to extract the image features at a deeper level, and the main design of the method comprises two aspects: (1) The existing four SR frameworks include a pre-upsampling SR, a post-upsampling SR, a progressive upsampling SR, and an iterative upsampling SR, wherein the post-upsampling SR framework becomes a mainstream framework by virtue of its advantage of being able to reduce the number and complexity of model parameters. (2) Specific network structure design and deep convolutional network development to date, and a plurality of classic and efficient network structure designs, such as residual learning, recursive learning, dense connection and the like, emerge. Different designs can have different performance and parameter impacts on the network model, and a proper network structure design needs to be selected through continuous experiments.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a new Multi-scale global attention network (MSHAN) based on convolutional neural network to solve the problem that the former SR network equally processes information from different convolutional layers, different channels and different spatial locations.
In order to solve the problems, the invention adopts the technical scheme that: an attention-based single image super-resolution algorithm shown in fig. 1 comprises a Multi-scale attention Residual block (MSAB) and a network design framework with Improved Residual-in-Residual residuals (IRIR), in which inter-layer attention is introduced. The method comprises three steps in sequence, namely a shallow feature extraction operation, an intermediate feature mapping operation and an upsampling operation.
1) Shallow feature extraction operation with X and Y SR Respectively representing the input and the output of the whole network, and for an input low-resolution picture X, adopting a 3X3 convolutional layer to extract an initial shallow layer feature, as shown in the following formula:
F IFE =S IFENet (X)
wherein S IFENet Function representing shallow feature extraction module, extracted shallow F IFE The features are fed into the subsequent feature mapping section as initial input and are also used for global feature learning.
2) Intermediate feature mapping operation, the input of the intermediate feature mapping is the shallow feature extraction operation to obtain F IFE The basic unit of its main operation is a new multi-scale attention residual block, whose main structure is shown in fig. 2:
let the input of the multi-scale attention residual block be H 0 The input will first go through two parallel 3x3 convolution modules and 5x5 convolution modules to generate the corresponding output, as shown in the following formula:
Figure BDA0003710074550000021
Figure BDA0003710074550000022
wherein
Figure BDA0003710074550000023
And
Figure BDA0003710074550000024
representing the weights and offsets of the first convolutional layer of the 3x3 block,
Figure BDA0003710074550000025
and
Figure BDA0003710074550000026
weights and biases for the second convolution layer representing the 3x3 block; in the same way, the method has the advantages of,
Figure BDA0003710074550000027
and
Figure BDA0003710074550000028
representing the weights and offsets of the first convolutional layer of a 5x5 block,
Figure BDA0003710074550000029
and
Figure BDA00037100745500000210
representing the weights and offsets of the second convolutional layer of the 5x5 block. δ denotes the ReLU activation function, H m3 And H m5 Representing the output of the 3x3 and 5x5 modules, respectively.
Obtaining the output characteristic H of the 3x3 module m3 And 5x5 Module output characteristics H m5 Then, a cascade module is sent to fuse the convolved features under two different scales, and the dimension of the convolution layer is adjusted through 1x1 so as to be sent to a subsequent module for further processingThe characteristic extraction process comprises the following steps:
Figure BDA00037100745500000211
wherein
Figure BDA00037100745500000212
Represents the output of the first cascade module]Which is representative of a cascade operation,
Figure BDA00037100745500000213
and
Figure BDA00037100745500000214
representing the weights and offsets of the 1x1 convolutional layers in the first concatenated block.
With the output of the first intermediate cascaded block, attention is applied to further weight the channels and spatial locations rich in important features. Two parallel branches are designed for this purpose, one branch generates a weight coefficient with the size of Cx1x1 through a channel attention mechanism to adjust the characteristic value of each channel; the other branch uses the spatial attention mechanism to generate a weight coefficient with the size of 1xHxW to adjust the characteristic value of the spatial position in each channel. By using the parallel branches, the network can further extract effective characteristic representation by using the correlation between the channel and the space position, thereby improving the performance of the network. Defining characteristics of an input
Figure BDA0003710074550000031
Which contains C feature maps, each of which is then of size HxW.
The channel attention branch extraction process is shown in the channel attention module of FIG. 2, and first generates a total sum digital feature μ e R per channel through the global average pooling layer Cx1x1 The tie pooling layer acts on an individual feature channel, so the c-th channel of μ can be expressed as:
Figure BDA0003710074550000032
wherein
Figure BDA0003710074550000033
Representing the c-th channel at position (i, j)
Figure BDA0003710074550000034
The pixel value of (2). The digital signature μ is then fed into an activation function to perform convolution summation as follows:
Figure BDA0003710074550000035
wherein
Figure BDA0003710074550000036
And
Figure BDA0003710074550000037
the weights and offsets, respectively, of the first convolutional layer are used to change the number of channels by scaling gamma. For the same reason, the parameter is
Figure BDA0003710074550000038
And
Figure BDA0003710074550000039
the convolutional layer of (a) converts the number of channels to the original number. σ and δ represent sigmoid and ReLU activation functions, respectively.
In addition, the per-channel attention weight α fits a value between 0-1 by the sigmoid activation function σ and is used to rescale the input features. And after obtaining the channel attention coefficient alpha, multiplying each element of the original input characteristics of the channel attention coefficient alpha to obtain the final output of the channel attention module branch:
Figure BDA00037100745500000310
wherein H CA Indicating the final output of the channel attention module, F CA A per-channel multiplication representing a channel feature and its corresponding channel weight.
Characteristics of the output of the first cascade module
Figure BDA00037100745500000316
Will be input into another spatial attention module branch to perform the spatial attention adjustment feature. As shown in the spatial attention module of FIG. 2, it can be seen that the spatial attention module has one less global-average pooling layer than the channel attention module, because the spatial attention module does not need to communicate the global-average spatial information compression into the statistical descriptor for each channel through the global-average pooling layer. The rest of the process is similar to the channel attention, and the spatial attention mask coefficient is shown as the following formula:
Figure BDA00037100745500000311
where σ and δ represent sigmoid and ReLU activation functions, respectively, the first weight being
Figure BDA00037100745500000312
And a deviation of
Figure BDA00037100745500000313
Is used to generate a per-channel feature map, and the generated feature map is combined with a single attention map by a weight of
Figure BDA00037100745500000314
And a deviation of
Figure BDA00037100745500000315
The 1 × 1 convolutional layer of (1). Further, the sigmoid function σ normalizes the feature mapping to the range of 0-1 to obtain the spatial attention adaptive mask β. The scale factor γ of the convolutional layer is used to facilitate the change in dimension. After obtaining the spatial attention mask coefficient, the spatial attention mask coefficient is compared with the input feature
Figure BDA0003710074550000045
The multiplication of each element in the space position obtains the final output of the spatial attention module branch:
Figure BDA0003710074550000041
wherein H SA Representing the final output of the spatial attention Module, F SA Each element representing a spatial location feature and its corresponding spatial location weight is multiplied.
Obtaining the output H of the channel attention module CA And the output H of the spatial attention module SA Then, the input is used as an input and sent to a second cascade module to fuse the spatial characteristics of the two modules, the number of characteristic channels of the input is changed through a 1x1 convolution layer to achieve better transmission between the blocks, the output of the second cascade module is obtained, and then residual operation is performed on the output of the first cascade module and the input of the MSAB block to obtain the output of the whole MSAB, wherein the following formula is shown:
Figure BDA0003710074550000042
wherein H o Representing the final output of MSAB]Which is representative of a cascaded operation,
Figure BDA0003710074550000043
and
Figure BDA0003710074550000044
representing the weights and offsets of the 1x1 convolutional layers in the second cascaded block. To make better use of the features with rich low frequency information, a short jump connection is introduced to the residual block. The jump connection not only enables the main part of the network to learn residual error information, but also can avoid the problem of gradient disappearance during network training.
The middle feature mapping operation mainly adopts a network design framework of residual errors and residual errors in a set, namely, the outer residual errors mainly comprise a series of stacked attention groups and a global residual errorResidual learning forms, while intra residuals are formed by stacking a series of attention blocks and local residual learning stacks. The design framework can solve the problems that training is difficult and performance cannot be improved remarkably due to the fact that long residual blocks are simply stacked. If there are N Attention Groups (AG) in the outlier residual, the input and output of the nth attention group can be used as AG n-1 And AG n Expressed, the following formula can be obtained:
AG n =H n (AG n-1 )
=H n (H n-1 (…(H 1 (AG 0 ))…))
wherein H n Is the operation of the nth attention group. AG 0 Is the input of the first attention block, i.e. the output F of the shallow feature module IFE . The attention set is stacked from a series of MSAB blocks, but a simple stack residual block does not effectively utilize the characteristics of the previous blocks, so a dense connection is introduced between each attention block, i.e. the inputs of all intermediate blocks are cascaded from the outputs of the previous attention blocks. MSAB mainly comprises dense connection, local feature fusion and local residual learning to form a continuous memory mechanism. It is through passing the previous MSAB characteristics to the current MSAB, thus realizing the persistent memory storage mechanism. By F m-1 And F m Denotes the input and output of the mth MSAB, both of which have G 0 Feature mapping, then the output of the mth MSAB can be expressed as:
F m =M m ([F m-1 ,F m-2 ,···,F 1 ])
wherein M is m Represents the function of the mth MSAB, [ F ] m-1 ,F m-2 ,···,F 1 ]Representing the cascaded operation of the 1 st to m-1 st MSAB blocks. The output of the preceding MSAB and each layer has a direct connection to all subsequent layers, which not only preserves the feed forward properties, but also extracts locally dense features.
Local feature fusion is used to adaptively fuse the state of the entire convolutional layer in the previous MSAB and the current MSAB. As shown by AG in fig. 1, the feature map of the m-1 st MSAB is directly introduced into the m-th MSAB in a serial manner, and it is important to reduce the number of features. Inspired by MemNet, on the other hand, 1 × 1 convolutional layers are introduced to adaptively control output information. This operation is named Local Feature Fusion (LFF) formula as follows:
Figure BDA0003710074550000051
wherein, F n,LF Representing the fused output of n MSAB features,
Figure BDA0003710074550000052
indicates the function of the 1 × 1Conv layer in the nth AG. [ AG n-1 ,F 1 ,...,F M ]The concatenation of the input representing the previous AG with the M MSAB outputs in the current AG, a very deep dense network without LFF would be difficult to train as the growth rate G becomes larger.
Local residual learning is also introduced into the AG group to further improve the network feature liquidity, and the final output of the nth AG can be expressed as:
AG n =AG n-1 +F n,LF
it should be noted that local residual learning can further improve the network characterization capability, thereby obtaining better performance n It will then be fed into a 3x3 convolutional layer to produce the final output F of the entire intermediate feature mapping convolution N
F N =Conv(AG N )
Where Conv represents the last convolutional layer, the resulting output F N Output of the attention module to be followed between layers, and initial input characteristics F IFE Are sent together to the next module.
After extracting the intermediate features of a series of intermediate attention groups, the model introduces a layer attention module (LA, as shown in fig. 3) whose input is the concatenation of the features of each intermediate attention group, as follows:
F LA =H LA ([AG 1 ,AG 2 ,…,AG N ])
wherein H LA The representation introduces an inter-layer attention function whose input is the output of all intermediate layers, so that all characteristic information from the previous layer can be fully utilized. F LA Representing the output of the inter-layer attention module. [ AG 1 ,AG 2 ,…,AG N ]Representing a cascade operation of 1 to N attention groups.
Obtaining an intermediate convolution layer output F N And the output F of the inter-layer attention module LA A long jump connection (LSC) is introduced to enhance the stability of network training, and the LSC can also obtain better network performance through residual learning, namely, the initial shallow feature F IFE Together with them, a per-element addition is performed, so that the final output of the whole intermediate feature mapping module can be represented as:
F MF =F IFE +F N +F LA
wherein F MF Representing the final output of the intermediate feature mapping step.
3) Upsampling operation the input of the upsampling operation is the output F of the previous intermediate feature mapping operation MF Sub-pixel convolution is then used as the last upsampling module that converts the scaled samples of a given magnification factor into upsampling by pixel translation, the sub-pixel convolution operation being used to aggregate the low resolution feature maps while mapping the features into the high dimensional space to reconstruct the HR image. The whole process is shown as the following formula:
Y SR =U (F MF )
=U (F IFW +F N +F LA )
wherein U is Representing a sub-pixel convolution operation, Y SR Is the reconstructed SR result. In addition, a long jump connection is introduced to stabilize the proposed training of the deep network, and the sub-pixel upsampling block is divided into F IFE +F N +F LA As an input.
The MSAB is used as a basic building block of the network, so that the image characteristics under different scales can be utilized by utilizing multi-scale convolution, and the attention of the network can be focused on a channel and a space position which are richer in high-frequency information through a channel attention mechanism and a space attention mechanism.
The method adopts IRIR network structure design, can stack a plurality of MSAB blocks for use, can extract image feature characterization more fully, and can also reduce the problems of gradient disappearance and gradient explosion easily encountered in model training.
The invention adopts convolution layers with two different scales of 3x3 and 5x5 and adopts a double-branch network design structure, the input characteristics respectively enter a 3x3 convolution module and a 5x5 convolution module, and the image characteristics under the two scales are extracted; and developing the correlation among different channels from the level of each channel by adopting a channel attention mechanism, and developing the correlation among different spatial positions in the channels from the spatial positions of the channels by adopting a spatial attention mechanism.
According to the invention, an interlayer attention module is introduced into the external residual error, and the correlation among different layers can be developed through the interlayer attention module, so that a network can allocate different attention weights to the features of different layers, and the characterization capability of the extracted features is automatically improved; the dense connection network design is introduced into the internal residual error, and the dense connection can enable the current MSAB to fully utilize the output characteristics of the previous MSAB to be cascaded together as input, so that the circulation of all levels of characteristic information is improved, and the characteristic characterization capability of the network is improved.
The image features obtained by convolution of 3x3 and 5x5 scales are cascaded together, and the number of channels is adjusted by a 1x1 convolution layer, so that the obtained multi-scale features can conveniently enter a next-layer module; image features respectively obtained by the channel attention module and the space attention module are also cascaded together, the number of channels output by the MSAB block is finally adjusted by a 1x1 convolution layer, and the channel number is added with each element of features input at the beginning of the MSAB to perform residual learning, so that the problem of gradient disappearance is avoided, and the feature extraction capability of the network is improved.
Has the advantages that: compared with the prior SR model (EDSR, RDN and the like) with better performance, the invention has certain improvement on PSNR/SSIM indexes no matter the amplification scale factor of x2, x3 or x4, as shown in Table 1. 4-5 can see that the reconstructed HR image is also truer and has clearer texture than other models in a qualitative view. FIG. 6 also shows that the present invention achieves good results from a comprehensive measure of model performance and number of parameters.
Drawings
Fig. 1 is a block diagram of the method.
Fig. 2 is a diagram illustrating a multi-scale attention residual block in the method.
FIG. 3 is a schematic diagram of an inter-layer attention module in the method.
Figure 4 is a qualitative comparison of this method with other methods at the scale up of the Urban100 test data set x 3.
Figure 5 is a qualitative comparison of this method with other methods at the Manga109 test data set x4 scale-up.
FIG. 6 is a comprehensive comparison of performance and model parameters for this and other methods at the Urban100 test data set x4 scale-up.
Detailed Description
The method is a multi-scale integral attention residual error network method based on an attention mechanism, can be used for the super-resolution problem of a single image, and can generate a high-resolution image with clear texture structure by inputting a fuzzy low-resolution image through a training network.
The present invention has conducted comparative studies on widely used data sets Set5, set14, BSD100, urban100 and Manga109, which contain 5, 14, 100 and 109 images, respectively. Set5, set14 and BSD100 contain natural scene images, whereas Urban100 is composed of Urban scene images, with many details in different frequency bands, whereas Manga109 is composed of japanese caricature images with many fine structures. The present invention trains the proposed MSHAN model using 800 high quality training images in DIV 2K. And performs data enhancement on these training images, including random horizontal flipping and random rotation of 90 °.
The present invention uses peak signal-to-noise ratio (PSNR) and Structural Similarity (SSIM) as model evaluation indices. Higher PSNR and SSIM values indicate better quality of the high resolution pictures produced by the model. As is common in SISR, after removing pixels near the image boundary, all criteria are computed on the luminance channel of the image.
And simultaneously, reconstructing an over-resolution picture from a low-resolution picture of a corresponding high-resolution image by using a bicubic convolution kernel in a specific scale, and preprocessing all images by subtracting an average RGB value of a DIV2K data set. The low resolution pictures input for training are low resolution color blocks of size 48 × 48 randomly cropped from the DIV2K dataset LR images, with the minimum batch size set to 16, i.e., 16 pictures are trained at a time. Experimental ADAM optimizer training model (its parameter β) 1 Is 0.9, beta 2 Is 0.99, e is 10 -8 ) Initial learning rate set to 10 -4 Then every 3X 10 5 The number of the updates is reduced by half every 2 multiplied by 10 5 The number of sub-iterations is reduced by half. The invention uses a Pythrch framework to train and test a network model on a NVIDIAGTX 1080Ti GPU. In the MSHAN network, the feature number of all convolution layers is set to 64, the sizes of convolution kernels are only 1x1,3x3 and 5x5, after 1x1 convolution kernels are generally used for cascade operation, 5x5 convolution kernels only exist in the multi-scale attention block. The size of the attention group N in the network is set to 6 and the size of the multi-scale attention block M is set to 12.
Table 1 summarizes the quantitative results of the three scale-up factors (× 2, × 3, × 4) of the inventive MSHAN network on five reference datasets (Set 5, set14, BSD100, ubran100, manga 109) under the bicubic degeneration model. It can be seen from the table that when compared with the results of the EDSR, MSHAN can be higher by 0.1+ dB in the result on 5 reference data sets no matter which scale is enlarged, and although the MSHAN network structure depth is not as large as the EDSR, the network structure of the residual error in the improved residual error enables the network performance to be hierarchically improved, which is more beneficial to rapidly improving the network performance. Compared with the MSRN, although the network also applies multi-scale volume blocks, the network lacks the addition of channel attention and spatial attention and the application of inter-layer attention, so that the network treats each layer, each channel in the layer and each position of the channel are treated equally, attention cannot be paid to the more important layers, channels and spatial positions, the performance of the network cannot be improved to a certain extent, and the result is greatly different from the MSHAN. Similarly, RDN uses dense connections in the residual blocks and uses cascading between each block, but the result is slightly worse than MSHAN due to the lack of channel space attention and inter-layer attention. Especially, on the animation character data set formed by fine textures of Manga109, it is important whether the network training can focus on the most important fine texture features, and on the large scale factors of x2, x3 and x4, the result on the data set of Manga109, the RDN is about 0.1 dB less than the result on MSHAN. As illustrated in fig. 4-6, the present invention achieves significant results in a comprehensive measure of model performance and number of parameters compared to other classical SR models.
Figure BDA0003710074550000091
Figure BDA0003710074550000101
Table 1.

Claims (9)

1. A single-image super-resolution algorithm based on an attention mechanism is characterized in that: the method comprises three steps of shallow feature extraction operation, intermediate feature mapping operation and up-sampling operation;
1) Shallow feature extraction operation: with X and Y SR Respectively representing the input and the output of the whole network, and for an input low-resolution picture X, adopting a 3X3 convolutional layer to extract an initial shallow layer feature, as shown in the following formula:
F IFE =S IFENet (X)
wherein S IFENet Extracting a function representing a shallow feature extraction moduleShallow layer F of IFE The features are sent to a subsequent feature mapping part as initial input and are also used for learning of global features;
2) Intermediate feature mapping operation: the input of the intermediate feature mapping is a shallow feature extraction operation to obtain F IFE The basic unit of operation is a multi-scale attention residual block;
let the input of the multi-scale attention residual block be H 0 The input will first go through two parallel 3x3 convolution modules and 5x5 convolution modules to generate the corresponding output, as shown in the following formula:
Figure FDA0003710074540000011
Figure FDA0003710074540000012
wherein
Figure FDA0003710074540000013
And
Figure FDA00037100745400000114
representing the weights and offsets of the first convolutional layer of the 3x3 block,
Figure FDA0003710074540000014
and
Figure FDA0003710074540000015
weights and biases representing the second convolutional layer of the 3x3 block; in the same way, the method has the advantages of,
Figure FDA0003710074540000016
and
Figure FDA0003710074540000017
representing the first convolutional layer of a 5x5 moduleThe weight and the deviation are calculated based on the weight,
Figure FDA0003710074540000018
and
Figure FDA0003710074540000019
weights and biases representing the second convolutional layer of the 5x5 block; δ denotes the ReLU activation function, H m3 And H m5 Represents the output of the 3x3 and 5x5 modules, respectively;
obtaining the output characteristic H of the 3x3 module m3 And 5x5 Module output characteristics H m5 Then, a cascade module is sent to fuse the convolved features under two different scales, and the dimension of the convolution layer is adjusted through 1x1 so as to be sent to a subsequent module for further feature extraction, wherein the process is as follows:
Figure FDA00037100745400000110
wherein
Figure FDA00037100745400000111
Represents the output of the first cascade module]Which is representative of a cascaded operation,
Figure FDA00037100745400000112
and
Figure FDA00037100745400000113
representing the weights and offsets of the 1x1 convolutional layers in the first concatenated block;
3) And (3) upsampling operation: the input to the upsampling operation is the output F of the previous intermediate feature mapping operation MF Then using sub-pixel convolution as the last up-sampling module, which converts the proportional sampling of given amplification factor into up-sampling by pixel translation, the sub-pixel convolution operation is used to aggregate low resolution feature mapping, and simultaneously maps the features to high dimensional space to reconstruct HR image; the whole process is shown as the following formula:
Y SR =U (F MF )
=U (F IFE +F N +F LA )
wherein U is Representing a sub-pixel convolution operation, Y SR Is the reconstructed SR result; in addition, a long-hop connection is introduced to stabilize the proposed training of the deep network, sub-pixel upsampling blocks by F IFE +F N +F LA As an input.
2. The single image super-resolution algorithm based on the attention mechanism as claimed in claim 1, wherein: in the step 2), after the output of the first intermediate cascade block exists, applying an attention mechanism to further strengthen the weight of the channels rich in important features and the spatial positions; two parallel branches are designed for this purpose, one branch generates a weight coefficient with the size of Cx1x1 through a channel attention mechanism to adjust the characteristic value of each channel; the other branch utilizes a space attention mechanism to generate a weight coefficient with the size of 1xHxW to adjust the characteristic value of the space position in each channel; by utilizing the parallel branches, the network can further extract effective characteristic representation by utilizing the correlation between the channel and the space position, thereby improving the performance of the network; defining characteristics of an input
Figure FDA0003710074540000021
Which contains C feature maps, each of which is then of size HxW.
3. The single image super-resolution algorithm based on the attention mechanism as claimed in claim 2, wherein: in the step 2), the channel attention branch extraction process: first, a total number characteristic mu epsilon R of each channel is generated through a global average pooling layer Cx1x1 The tie pooling layer is applied to the individual feature channels, so the c-th channel of μ is represented as:
Figure FDA0003710074540000022
wherein
Figure FDA0003710074540000023
Representing the c-th channel at position (i, j)
Figure FDA0003710074540000024
The pixel value of (a); the digital signature μ is then fed into an activation function to perform convolution summation as follows:
Figure FDA0003710074540000025
wherein
Figure FDA0003710074540000026
And
Figure FDA0003710074540000027
the weights and offsets, respectively, of the first convolutional layer are used to change the number of channels by scaling γ; for the same reason, the parameter is
Figure FDA0003710074540000028
And
Figure FDA0003710074540000029
the convolution layer converts the channel number into the original number; σ and δ represent sigmoid and ReLU activation functions, respectively.
In addition, the per-channel attention weight α fits a value between 0-1 by the sigmoid activation function σ and uses it to rescale the input features; and after obtaining the channel attention coefficient alpha, multiplying each element of the original input characteristics of the channel attention coefficient alpha to obtain the final output of the channel attention module branch:
Figure FDA00037100745400000210
wherein H CA Indicating the final output of the channel attention module, F CA A per-channel multiplication representing a channel feature and its corresponding channel weight;
characteristics of output of first cascade module
Figure FDA00037100745400000311
Will be input into another spatial attention module branch to carry on the spatial attention adjustment characteristic; the spatial attention module has one global average less pooling layer than the channel attention module because the spatial attention module does not need to communicate that global spatial information is compressed into the statistical descriptor of each channel through the global average pooling layer; the rest of the process is similar to the channel attention, and the spatial attention mask coefficient is shown in the following formula:
Figure FDA0003710074540000031
where σ and δ represent sigmoid and ReLU activation functions, respectively, the first weight being
Figure FDA0003710074540000032
And a deviation of
Figure FDA0003710074540000033
Is used to generate a feature map for each channel, and the generated feature map is combined with a single attention map by a weight of
Figure FDA0003710074540000034
And a deviation of
Figure FDA0003710074540000035
The 1 × 1 convolutional layer of (1); the sigmoid function sigma normalizes the feature mapping to be in a range of 0-1 to obtain a space attention self-adaptive mask beta; convolutional layerThe scale factor γ of (d) is used to facilitate the change in dimension; after obtaining the spatial attention mask coefficient, the spatial attention mask coefficient is compared with the input feature
Figure FDA00037100745400000310
Multiplying each element at a spatial position to obtain the final output of the spatial attention module branch:
Figure FDA0003710074540000036
wherein H sA Representing the final output of the spatial attention Module, F SA Each element representing a spatial location feature and its corresponding spatial location weight is multiplied.
4. The single image super-resolution algorithm based on the attention mechanism as claimed in claim 3, wherein: in the step 2), the output H of the channel attention module is obtained CA And the output H of the spatial attention module SA Then, the input is used as an input and sent to a second cascade module to fuse the spatial characteristics of the two modules, the number of characteristic channels of the input is changed through a 1x1 convolution layer to achieve better transmission between the blocks, the output of the second cascade module is obtained, and then residual operation is performed on the output of the first cascade module and the input of the MSAB block to obtain the output of the whole MSAB, wherein the following formula is shown:
Figure FDA0003710074540000037
wherein H o Represents the final output of MSAB]Which is representative of a cascade operation,
Figure FDA0003710074540000038
and
Figure FDA0003710074540000039
represents the weight and bias of the 1x1 convolutional layer in the second cascaded block; is rich in for better utilizationThe feature of low frequency information, a short jump connection is introduced into the residual block, which not only enables the main part of the network to learn the residual information, but also avoids the problem of gradient disappearance during network training.
5. The single image super-resolution algorithm based on the attention mechanism as claimed in claim 4, wherein: in the step 2), a network design framework of residual errors and residual errors inside is adopted in the middle feature mapping operation, namely, the outer residual errors are mainly formed by stacking a series of attention groups and global residual error learning, and the inner residual errors are formed by stacking a series of attention blocks and local residual error learning stacks; if there are N Attention Groups (AG) in the outlier residual, the input/output AG of the nth attention group n-1 And AG n Expressed as follows:
AG n =H n (AG n-1 )
=H n (H n-1 (…(H 1 (AG 0 ))…))
wherein H n Is the operation of the nth attention group; AG 0 Is the input of the first attention block, i.e. the output F of the shallow feature module IFE (ii) a Attention groups are stacked from a series of MSAB blocks, but simple stacked residual blocks do not effectively utilize the characteristics of previous blocks, so dense connection is introduced between each attention block, namely the input of all intermediate blocks is formed by cascading the outputs of the previous attention blocks; MSAB mainly comprises dense connection, local feature fusion and local residual learning to form a continuous memory mechanism; the method is characterized in that a continuous memory storage mechanism is realized by transmitting the previous MSAB characteristics to the current MSAB; by F m-1 And F m Representing the input and output of the m-th MSAB, both having G 0 Feature mapping, then the output of the mth MSAB is represented as:
F m =M m ([F m-1 ,F m-2 ,…,F 1 ])
wherein M is m Represents the function of the mth MSAB, [ F ] m-1 ,F m-2 ,…,F 1 ]) Represents the cascade operation of 1 st to m-1 st MSAB blocks; the output of the preceding MSAB and each layer has a direct connection to all subsequent layers, which not only preserves the feed forward properties, but also extracts locally dense features.
6. The single image super-resolution algorithm based on the attention mechanism as claimed in claim 5, wherein: in the step 2), local feature fusion is used for adaptively fusing the states of the whole convolutional layer in the previous MSAB and the current MSAB; the feature mapping of the (m-1) th MSAB is directly introduced into the (m) th MSAB in a serial mode, so that the feature quantity reduction is of great importance; on the other hand, inspired by MemNet, 1 × 1 convolutional layers are introduced to adaptively control output information; this operation is named Local Feature Fusion (LFF) formula as follows:
Figure FDA0003710074540000041
wherein, F n,LF Representing the fused output of n MSAB features,
Figure FDA0003710074540000042
indicates the function of the 1 × 1Conv layer in the nth AG. [ AG n-1 ,F 1 ,...,F M ]The concatenation of the input representing the previous AG with the M MSAB outputs in the current AG, a very deep dense network without LFF would be difficult to train as the growth rate G becomes larger.
7. The single-image super-resolution algorithm based on the attention mechanism as claimed in claim 6, wherein: in the step 2), local residual learning is also introduced into the AG group to further improve the network feature currency, and the final output of the nth AG is represented as:
AG n =AG n-1 +F n,LF
local residual learning further improves network characterization capabilities to achieve better performance, resulting in a final attention groupOutput AG n It will then be fed into a 3x3 convolutional layer to produce the final output F of the entire intermediate feature mapping convolution N
F N =Conv(AG N )
Where Conv represents the last convolutional layer, the resulting output F N Output of the attention module to be followed and initial input characteristics F IFE Are sent together to the next module.
8. The single-image super-resolution algorithm based on the attention mechanism as claimed in claim 7, wherein: in step 2), after extracting the intermediate features of a series of intermediate attention groups, the model introduces a layer attention module, and the input of the module is the cascade connection of the features of each intermediate attention group, and the process is as follows:
F LA =H LA ([AG 1 ,AG 2 ,…,AG N ])
wherein H LA Representing the attention function between the introduced layers, the input of the function is the output of all the intermediate layers, so that all the characteristic information from the previous layer can be fully utilized; f LA Represents the output of the inter-layer attention module; [ AG 1 ,AG 2 ,…,AG N ]Representing a cascade operation of 1 to N attention groups.
9. The single image super-resolution algorithm based on attention mechanism as claimed in claim 8, wherein: in the step 2), obtaining an intermediate convolution layer output F N And the output F of the inter-layer attention module LA A long jump connection LSC is introduced to enhance the stability of network training, and the LSC can also acquire better network performance through residual learning, namely, an initial shallow feature F IFE Together with them, a per-element addition is performed, so that the final output of the whole intermediate feature mapping module is represented as:
M MF =F IFE +F N +F LA
wherein F MF Representing intermediate feature mappingsAnd (5) final output of the steps.
CN202210719954.2A 2022-06-23 2022-06-23 Single-image super-resolution algorithm based on attention mechanism Pending CN115170392A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210719954.2A CN115170392A (en) 2022-06-23 2022-06-23 Single-image super-resolution algorithm based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210719954.2A CN115170392A (en) 2022-06-23 2022-06-23 Single-image super-resolution algorithm based on attention mechanism

Publications (1)

Publication Number Publication Date
CN115170392A true CN115170392A (en) 2022-10-11

Family

ID=83486727

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210719954.2A Pending CN115170392A (en) 2022-06-23 2022-06-23 Single-image super-resolution algorithm based on attention mechanism

Country Status (1)

Country Link
CN (1) CN115170392A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116594061A (en) * 2023-07-18 2023-08-15 吉林大学 Seismic data denoising method based on multi-scale U-shaped attention network
CN117522682A (en) * 2023-12-04 2024-02-06 无锡日联科技股份有限公司 Method, device, equipment and medium for reconstructing resolution of radiographic image

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116594061A (en) * 2023-07-18 2023-08-15 吉林大学 Seismic data denoising method based on multi-scale U-shaped attention network
CN116594061B (en) * 2023-07-18 2023-09-22 吉林大学 Seismic data denoising method based on multi-scale U-shaped attention network
CN117522682A (en) * 2023-12-04 2024-02-06 无锡日联科技股份有限公司 Method, device, equipment and medium for reconstructing resolution of radiographic image

Similar Documents

Publication Publication Date Title
CN112330542B (en) Image reconstruction system and method based on CRCSAN network
CN110033410B (en) Image reconstruction model training method, image super-resolution reconstruction method and device
Hui et al. Fast and accurate single image super-resolution via information distillation network
CN111192200A (en) Image super-resolution reconstruction method based on fusion attention mechanism residual error network
CN112102177B (en) Image deblurring method based on compression and excitation mechanism neural network
CN115170392A (en) Single-image super-resolution algorithm based on attention mechanism
Luo et al. Lattice network for lightweight image restoration
CN111242846A (en) Fine-grained scale image super-resolution method based on non-local enhancement network
CN110060204B (en) Single image super-resolution method based on reversible network
CN111951164B (en) Image super-resolution reconstruction network structure and image reconstruction effect analysis method
CN112801904B (en) Hybrid degraded image enhancement method based on convolutional neural network
CN112862689A (en) Image super-resolution reconstruction method and system
CN112561799A (en) Infrared image super-resolution reconstruction method
CN113837946B (en) Lightweight image super-resolution reconstruction method based on progressive distillation network
CN114841856A (en) Image super-pixel reconstruction method of dense connection network based on depth residual channel space attention
CN117575915B (en) Image super-resolution reconstruction method, terminal equipment and storage medium
CN116091313A (en) Image super-resolution network model and reconstruction method
CN113298716A (en) Image super-resolution reconstruction method based on convolutional neural network
CN112819705A (en) Real image denoising method based on mesh structure and long-distance correlation
CN113962882B (en) JPEG image compression artifact eliminating method based on controllable pyramid wavelet network
CN116777745A (en) Image super-resolution reconstruction method based on sparse self-adaptive clustering
CN116485654A (en) Lightweight single-image super-resolution reconstruction method combining convolutional neural network and transducer
CN110111252A (en) Single image super-resolution method based on projection matrix
CN112070676B (en) Picture super-resolution reconstruction method of double-channel multi-perception convolutional neural network
CN113344786B (en) Video transcoding method, device, medium and equipment based on geometric generation model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination