CN115170392A

CN115170392A - Single-image super-resolution algorithm based on attention mechanism

Info

Publication number: CN115170392A
Application number: CN202210719954.2A
Authority: CN
Inventors: 裴文江; 蔡清; 夏亦犁
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-06-23
Filing date: 2022-06-23
Publication date: 2022-10-11

Abstract

The invention discloses a single image super-resolution algorithm based on an attention mechanism, provides a new multi-scale attention residual block, improves a network design framework of residual in residual, and introduces interlayer attention. On the basis of the two innovation points, a new multi-scale whole attention network is proposed. The specific characteristics of the MSAB are as follows: (1) A channel attention mechanism and a space attention mechanism are introduced into a common residual block, a two-branch learning strategy is adopted, the two attention mechanisms are respectively used for different branches, and finally 1x1 convolutional layers are used for cascading. (2) On the basis, multi-scale convolution is introduced, two convolution blocks of 3x3 and 5x5 are adopted, feature extraction is also carried out respectively according to a double-branch strategy, and 1x1 convolution layers are used for cascading. The MSHAN network of the invention obtains remarkable results under the comprehensive measurement of the model performance and the parameter quantity.

Description

Single-image super-resolution algorithm based on attention mechanism

Technical Field

The invention relates to a super-resolution problem in image processing, in particular to a super-resolution method based on a convolutional neural network.

Background

With the continuous progress of deep learning technology, the powerful computing and characterizing capability of a Convolutional Neural Network (CNN) makes it splendid in the field of computer vision, gradually replacing those traditional learning methods. The convolutional neural network can not only extract shallow features of the image, such as background and contour, through convolutional kernels, but also extract features continuously through multiple layers of convolutional kernels, and can extract high-frequency details of the image.

Super-resolution (SR) is a typical pathological problem in image processing, and an appropriate filter is designed through an input Low-resolution (LR) image so as to output a High-resolution (HR) image corresponding to an enlarged scale. To solve this problem, many learning-based methods have been proposed to learn the mapping between LR and HR image pairs. The SR method based on learning is mainly divided into two types, namely, a traditional learning method (such as bicubic interpolation) and a deep learning method. Although some priori knowledge of the image can be implicitly learned through probability theory, the traditional learning method cannot learn deeper map features, and the generated high-resolution image lacks necessary high-frequency details. The deep learning-based method can rely on the powerful calculation and feature characterization capabilities of a convolutional neural network to extract the image features at a deeper level, and the main design of the method comprises two aspects: (1) The existing four SR frameworks include a pre-upsampling SR, a post-upsampling SR, a progressive upsampling SR, and an iterative upsampling SR, wherein the post-upsampling SR framework becomes a mainstream framework by virtue of its advantage of being able to reduce the number and complexity of model parameters. (2) Specific network structure design and deep convolutional network development to date, and a plurality of classic and efficient network structure designs, such as residual learning, recursive learning, dense connection and the like, emerge. Different designs can have different performance and parameter impacts on the network model, and a proper network structure design needs to be selected through continuous experiments.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a new Multi-scale global attention network (MSHAN) based on convolutional neural network to solve the problem that the former SR network equally processes information from different convolutional layers, different channels and different spatial locations.

In order to solve the problems, the invention adopts the technical scheme that: an attention-based single image super-resolution algorithm shown in fig. 1 comprises a Multi-scale attention Residual block (MSAB) and a network design framework with Improved Residual-in-Residual residuals (IRIR), in which inter-layer attention is introduced. The method comprises three steps in sequence, namely a shallow feature extraction operation, an intermediate feature mapping operation and an upsampling operation.

1) Shallow feature extraction operation with X and Y _SR Respectively representing the input and the output of the whole network, and for an input low-resolution picture X, adopting a 3X3 convolutional layer to extract an initial shallow layer feature, as shown in the following formula:

F _IFE ＝S _IFENet (X)

wherein S _IFENet Function representing shallow feature extraction module, extracted shallow F _IFE The features are fed into the subsequent feature mapping section as initial input and are also used for global feature learning.

2) Intermediate feature mapping operation, the input of the intermediate feature mapping is the shallow feature extraction operation to obtain F _IFE The basic unit of its main operation is a new multi-scale attention residual block, whose main structure is shown in fig. 2:

let the input of the multi-scale attention residual block be H ₀ The input will first go through two parallel 3x3 convolution modules and 5x5 convolution modules to generate the corresponding output, as shown in the following formula:

wherein

And

representing the weights and offsets of the first convolutional layer of the 3x3 block,

and

weights and biases for the second convolution layer representing the 3x3 block; in the same way, the method has the advantages of,

and

representing the weights and offsets of the first convolutional layer of a 5x5 block,

and

representing the weights and offsets of the second convolutional layer of the 5x5 block. δ denotes the ReLU activation function, H _m3 And H _m5 Representing the output of the 3x3 and 5x5 modules, respectively.

Obtaining the output characteristic H of the 3x3 module _m3 And 5x5 Module output characteristics H _m5 Then, a cascade module is sent to fuse the convolved features under two different scales, and the dimension of the convolution layer is adjusted through 1x1 so as to be sent to a subsequent module for further processingThe characteristic extraction process comprises the following steps:

wherein

Represents the output of the first cascade module]Which is representative of a cascade operation,

and

representing the weights and offsets of the 1x1 convolutional layers in the first concatenated block.

With the output of the first intermediate cascaded block, attention is applied to further weight the channels and spatial locations rich in important features. Two parallel branches are designed for this purpose, one branch generates a weight coefficient with the size of Cx1x1 through a channel attention mechanism to adjust the characteristic value of each channel; the other branch uses the spatial attention mechanism to generate a weight coefficient with the size of 1xHxW to adjust the characteristic value of the spatial position in each channel. By using the parallel branches, the network can further extract effective characteristic representation by using the correlation between the channel and the space position, thereby improving the performance of the network. Defining characteristics of an input

Which contains C feature maps, each of which is then of size HxW.

The channel attention branch extraction process is shown in the channel attention module of FIG. 2, and first generates a total sum digital feature μ e R per channel through the global average pooling layer _Cx1x1 The tie pooling layer acts on an individual feature channel, so the c-th channel of μ can be expressed as:

wherein

Representing the c-th channel at position (i, j)

The pixel value of (2). The digital signature μ is then fed into an activation function to perform convolution summation as follows:

wherein

And

the weights and offsets, respectively, of the first convolutional layer are used to change the number of channels by scaling gamma. For the same reason, the parameter is

And

the convolutional layer of (a) converts the number of channels to the original number. σ and δ represent sigmoid and ReLU activation functions, respectively.

In addition, the per-channel attention weight α fits a value between 0-1 by the sigmoid activation function σ and is used to rescale the input features. And after obtaining the channel attention coefficient alpha, multiplying each element of the original input characteristics of the channel attention coefficient alpha to obtain the final output of the channel attention module branch:

wherein H _CA Indicating the final output of the channel attention module, F _CA A per-channel multiplication representing a channel feature and its corresponding channel weight.

Characteristics of the output of the first cascade module

Will be input into another spatial attention module branch to perform the spatial attention adjustment feature. As shown in the spatial attention module of FIG. 2, it can be seen that the spatial attention module has one less global-average pooling layer than the channel attention module, because the spatial attention module does not need to communicate the global-average spatial information compression into the statistical descriptor for each channel through the global-average pooling layer. The rest of the process is similar to the channel attention, and the spatial attention mask coefficient is shown as the following formula:

where σ and δ represent sigmoid and ReLU activation functions, respectively, the first weight being

And a deviation of

Is used to generate a per-channel feature map, and the generated feature map is combined with a single attention map by a weight of

And a deviation of

The 1 × 1 convolutional layer of (1). Further, the sigmoid function σ normalizes the feature mapping to the range of 0-1 to obtain the spatial attention adaptive mask β. The scale factor γ of the convolutional layer is used to facilitate the change in dimension. After obtaining the spatial attention mask coefficient, the spatial attention mask coefficient is compared with the input feature

The multiplication of each element in the space position obtains the final output of the spatial attention module branch:

wherein H _SA Representing the final output of the spatial attention Module, F _SA Each element representing a spatial location feature and its corresponding spatial location weight is multiplied.

Obtaining the output H of the channel attention module _CA And the output H of the spatial attention module _SA Then, the input is used as an input and sent to a second cascade module to fuse the spatial characteristics of the two modules, the number of characteristic channels of the input is changed through a 1x1 convolution layer to achieve better transmission between the blocks, the output of the second cascade module is obtained, and then residual operation is performed on the output of the first cascade module and the input of the MSAB block to obtain the output of the whole MSAB, wherein the following formula is shown:

wherein H _o Representing the final output of MSAB]Which is representative of a cascaded operation,

and

representing the weights and offsets of the 1x1 convolutional layers in the second cascaded block. To make better use of the features with rich low frequency information, a short jump connection is introduced to the residual block. The jump connection not only enables the main part of the network to learn residual error information, but also can avoid the problem of gradient disappearance during network training.

The middle feature mapping operation mainly adopts a network design framework of residual errors and residual errors in a set, namely, the outer residual errors mainly comprise a series of stacked attention groups and a global residual errorResidual learning forms, while intra residuals are formed by stacking a series of attention blocks and local residual learning stacks. The design framework can solve the problems that training is difficult and performance cannot be improved remarkably due to the fact that long residual blocks are simply stacked. If there are N Attention Groups (AG) in the outlier residual, the input and output of the nth attention group can be used as AG _n-1 And AG _n Expressed, the following formula can be obtained:

AG _n ＝H _n (AG _n-1 )

＝H _n (H _n-1 (…(H ₁ (AG ₀ ))…))

wherein H _n Is the operation of the nth attention group. AG ₀ Is the input of the first attention block, i.e. the output F of the shallow feature module _IFE . The attention set is stacked from a series of MSAB blocks, but a simple stack residual block does not effectively utilize the characteristics of the previous blocks, so a dense connection is introduced between each attention block, i.e. the inputs of all intermediate blocks are cascaded from the outputs of the previous attention blocks. MSAB mainly comprises dense connection, local feature fusion and local residual learning to form a continuous memory mechanism. It is through passing the previous MSAB characteristics to the current MSAB, thus realizing the persistent memory storage mechanism. By F _m-1 And F _m Denotes the input and output of the mth MSAB, both of which have G ₀ Feature mapping, then the output of the mth MSAB can be expressed as:

F _m ＝M _m ([F _m-1 ,F _m-2 ,···,F ₁ ])

wherein M is _m Represents the function of the mth MSAB, [ F ] _m-1 ,F _m-2 ,···,F ₁ ]Representing the cascaded operation of the 1 st to m-1 st MSAB blocks. The output of the preceding MSAB and each layer has a direct connection to all subsequent layers, which not only preserves the feed forward properties, but also extracts locally dense features.

Local feature fusion is used to adaptively fuse the state of the entire convolutional layer in the previous MSAB and the current MSAB. As shown by AG in fig. 1, the feature map of the m-1 st MSAB is directly introduced into the m-th MSAB in a serial manner, and it is important to reduce the number of features. Inspired by MemNet, on the other hand, 1 × 1 convolutional layers are introduced to adaptively control output information. This operation is named Local Feature Fusion (LFF) formula as follows:

wherein, F _n,LF Representing the fused output of n MSAB features,

indicates the function of the 1 × 1Conv layer in the nth AG. [ AG _n-1 ,F ₁ ,...,F _M ]The concatenation of the input representing the previous AG with the M MSAB outputs in the current AG, a very deep dense network without LFF would be difficult to train as the growth rate G becomes larger.

Local residual learning is also introduced into the AG group to further improve the network feature liquidity, and the final output of the nth AG can be expressed as:

AG _n ＝AG _n-1 +F _n,LF

it should be noted that local residual learning can further improve the network characterization capability, thereby obtaining better performance _n It will then be fed into a 3x3 convolutional layer to produce the final output F of the entire intermediate feature mapping convolution _N ：

F _N ＝Conv(AG _N )

Where Conv represents the last convolutional layer, the resulting output F _N Output of the attention module to be followed between layers, and initial input characteristics F _IFE Are sent together to the next module.

After extracting the intermediate features of a series of intermediate attention groups, the model introduces a layer attention module (LA, as shown in fig. 3) whose input is the concatenation of the features of each intermediate attention group, as follows:

F _LA ＝H _LA ([AG ₁ ,AG ₂ ,…,AG _N ])

wherein H _LA The representation introduces an inter-layer attention function whose input is the output of all intermediate layers, so that all characteristic information from the previous layer can be fully utilized. F _LA Representing the output of the inter-layer attention module. [ AG ₁ ,AG ₂ ,…,AG _N ]Representing a cascade operation of 1 to N attention groups.

Obtaining an intermediate convolution layer output F _N And the output F of the inter-layer attention module _LA A long jump connection (LSC) is introduced to enhance the stability of network training, and the LSC can also obtain better network performance through residual learning, namely, the initial shallow feature F _IFE Together with them, a per-element addition is performed, so that the final output of the whole intermediate feature mapping module can be represented as:

F _MF ＝F _IFE +F _N +F _LA

wherein F _MF Representing the final output of the intermediate feature mapping step.

3) Upsampling operation the input of the upsampling operation is the output F of the previous intermediate feature mapping operation _MF Sub-pixel convolution is then used as the last upsampling module that converts the scaled samples of a given magnification factor into upsampling by pixel translation, the sub-pixel convolution operation being used to aggregate the low resolution feature maps while mapping the features into the high dimensional space to reconstruct the HR image. The whole process is shown as the following formula:

Y _SR ＝U _↑ (F _MF )

＝U _↑ (F _IFW +F _N +F _LA )

wherein U is _↑ Representing a sub-pixel convolution operation, Y _SR Is the reconstructed SR result. In addition, a long jump connection is introduced to stabilize the proposed training of the deep network, and the sub-pixel upsampling block is divided into F _IFE +F _N +F _LA As an input.

The MSAB is used as a basic building block of the network, so that the image characteristics under different scales can be utilized by utilizing multi-scale convolution, and the attention of the network can be focused on a channel and a space position which are richer in high-frequency information through a channel attention mechanism and a space attention mechanism.

The method adopts IRIR network structure design, can stack a plurality of MSAB blocks for use, can extract image feature characterization more fully, and can also reduce the problems of gradient disappearance and gradient explosion easily encountered in model training.

The invention adopts convolution layers with two different scales of 3x3 and 5x5 and adopts a double-branch network design structure, the input characteristics respectively enter a 3x3 convolution module and a 5x5 convolution module, and the image characteristics under the two scales are extracted; and developing the correlation among different channels from the level of each channel by adopting a channel attention mechanism, and developing the correlation among different spatial positions in the channels from the spatial positions of the channels by adopting a spatial attention mechanism.

According to the invention, an interlayer attention module is introduced into the external residual error, and the correlation among different layers can be developed through the interlayer attention module, so that a network can allocate different attention weights to the features of different layers, and the characterization capability of the extracted features is automatically improved; the dense connection network design is introduced into the internal residual error, and the dense connection can enable the current MSAB to fully utilize the output characteristics of the previous MSAB to be cascaded together as input, so that the circulation of all levels of characteristic information is improved, and the characteristic characterization capability of the network is improved.

The image features obtained by convolution of 3x3 and 5x5 scales are cascaded together, and the number of channels is adjusted by a 1x1 convolution layer, so that the obtained multi-scale features can conveniently enter a next-layer module; image features respectively obtained by the channel attention module and the space attention module are also cascaded together, the number of channels output by the MSAB block is finally adjusted by a 1x1 convolution layer, and the channel number is added with each element of features input at the beginning of the MSAB to perform residual learning, so that the problem of gradient disappearance is avoided, and the feature extraction capability of the network is improved.

Has the advantages that: compared with the prior SR model (EDSR, RDN and the like) with better performance, the invention has certain improvement on PSNR/SSIM indexes no matter the amplification scale factor of x2, x3 or x4, as shown in Table 1. 4-5 can see that the reconstructed HR image is also truer and has clearer texture than other models in a qualitative view. FIG. 6 also shows that the present invention achieves good results from a comprehensive measure of model performance and number of parameters.

Drawings

Fig. 1 is a block diagram of the method.

Fig. 2 is a diagram illustrating a multi-scale attention residual block in the method.

FIG. 3 is a schematic diagram of an inter-layer attention module in the method.

Figure 4 is a qualitative comparison of this method with other methods at the scale up of the Urban100 test data set x 3.

Figure 5 is a qualitative comparison of this method with other methods at the Manga109 test data set x4 scale-up.

FIG. 6 is a comprehensive comparison of performance and model parameters for this and other methods at the Urban100 test data set x4 scale-up.

Detailed Description

The method is a multi-scale integral attention residual error network method based on an attention mechanism, can be used for the super-resolution problem of a single image, and can generate a high-resolution image with clear texture structure by inputting a fuzzy low-resolution image through a training network.

The present invention has conducted comparative studies on widely used data sets Set5, set14, BSD100, urban100 and Manga109, which contain 5, 14, 100 and 109 images, respectively. Set5, set14 and BSD100 contain natural scene images, whereas Urban100 is composed of Urban scene images, with many details in different frequency bands, whereas Manga109 is composed of japanese caricature images with many fine structures. The present invention trains the proposed MSHAN model using 800 high quality training images in DIV 2K. And performs data enhancement on these training images, including random horizontal flipping and random rotation of 90 °.

The present invention uses peak signal-to-noise ratio (PSNR) and Structural Similarity (SSIM) as model evaluation indices. Higher PSNR and SSIM values indicate better quality of the high resolution pictures produced by the model. As is common in SISR, after removing pixels near the image boundary, all criteria are computed on the luminance channel of the image.

And simultaneously, reconstructing an over-resolution picture from a low-resolution picture of a corresponding high-resolution image by using a bicubic convolution kernel in a specific scale, and preprocessing all images by subtracting an average RGB value of a DIV2K data set. The low resolution pictures input for training are low resolution color blocks of size 48 × 48 randomly cropped from the DIV2K dataset LR images, with the minimum batch size set to 16, i.e., 16 pictures are trained at a time. Experimental ADAM optimizer training model (its parameter β) ₁ Is 0.9, beta ₂ Is 0.99, e is 10 ^-8 ) Initial learning rate set to 10 ^-4 Then every 3X 10 ⁵ The number of the updates is reduced by half every 2 multiplied by 10 ⁵ The number of sub-iterations is reduced by half. The invention uses a Pythrch framework to train and test a network model on a NVIDIAGTX 1080Ti GPU. In the MSHAN network, the feature number of all convolution layers is set to 64, the sizes of convolution kernels are only 1x1,3x3 and 5x5, after 1x1 convolution kernels are generally used for cascade operation, 5x5 convolution kernels only exist in the multi-scale attention block. The size of the attention group N in the network is set to 6 and the size of the multi-scale attention block M is set to 12.

Table 1 summarizes the quantitative results of the three scale-up factors (× 2, × 3, × 4) of the inventive MSHAN network on five reference datasets (Set 5, set14, BSD100, ubran100, manga 109) under the bicubic degeneration model. It can be seen from the table that when compared with the results of the EDSR, MSHAN can be higher by 0.1+ dB in the result on 5 reference data sets no matter which scale is enlarged, and although the MSHAN network structure depth is not as large as the EDSR, the network structure of the residual error in the improved residual error enables the network performance to be hierarchically improved, which is more beneficial to rapidly improving the network performance. Compared with the MSRN, although the network also applies multi-scale volume blocks, the network lacks the addition of channel attention and spatial attention and the application of inter-layer attention, so that the network treats each layer, each channel in the layer and each position of the channel are treated equally, attention cannot be paid to the more important layers, channels and spatial positions, the performance of the network cannot be improved to a certain extent, and the result is greatly different from the MSHAN. Similarly, RDN uses dense connections in the residual blocks and uses cascading between each block, but the result is slightly worse than MSHAN due to the lack of channel space attention and inter-layer attention. Especially, on the animation character data set formed by fine textures of Manga109, it is important whether the network training can focus on the most important fine texture features, and on the large scale factors of x2, x3 and x4, the result on the data set of Manga109, the RDN is about 0.1 dB less than the result on MSHAN. As illustrated in fig. 4-6, the present invention achieves significant results in a comprehensive measure of model performance and number of parameters compared to other classical SR models.

Table 1.

Claims

1. A single-image super-resolution algorithm based on an attention mechanism is characterized in that: the method comprises three steps of shallow feature extraction operation, intermediate feature mapping operation and up-sampling operation;

1) Shallow feature extraction operation: with X and Y _SR Respectively representing the input and the output of the whole network, and for an input low-resolution picture X, adopting a 3X3 convolutional layer to extract an initial shallow layer feature, as shown in the following formula:

F _IFE ＝S _IFENet (X)

wherein S _IFENet Extracting a function representing a shallow feature extraction moduleShallow layer F of _IFE The features are sent to a subsequent feature mapping part as initial input and are also used for learning of global features;

2) Intermediate feature mapping operation: the input of the intermediate feature mapping is a shallow feature extraction operation to obtain F _IFE The basic unit of operation is a multi-scale attention residual block;

wherein

And

and

weights and biases representing the second convolutional layer of the 3x3 block; in the same way, the method has the advantages of,

and

representing the first convolutional layer of a 5x5 moduleThe weight and the deviation are calculated based on the weight,

and

weights and biases representing the second convolutional layer of the 5x5 block; δ denotes the ReLU activation function, H _m3 And H _m5 Represents the output of the 3x3 and 5x5 modules, respectively;

obtaining the output characteristic H of the 3x3 module _m3 And 5x5 Module output characteristics H _m5 Then, a cascade module is sent to fuse the convolved features under two different scales, and the dimension of the convolution layer is adjusted through 1x1 so as to be sent to a subsequent module for further feature extraction, wherein the process is as follows:

wherein

Represents the output of the first cascade module]Which is representative of a cascaded operation,

and

representing the weights and offsets of the 1x1 convolutional layers in the first concatenated block;

3) And (3) upsampling operation: the input to the upsampling operation is the output F of the previous intermediate feature mapping operation _MF Then using sub-pixel convolution as the last up-sampling module, which converts the proportional sampling of given amplification factor into up-sampling by pixel translation, the sub-pixel convolution operation is used to aggregate low resolution feature mapping, and simultaneously maps the features to high dimensional space to reconstruct HR image; the whole process is shown as the following formula:

Y _SR ＝U _↑ (F _MF )

＝U _↑ (F _IFE +F _N +F _LA )

wherein U is _↑ Representing a sub-pixel convolution operation, Y _SR Is the reconstructed SR result; in addition, a long-hop connection is introduced to stabilize the proposed training of the deep network, sub-pixel upsampling blocks by F _IFE +F _N +F _LA As an input.

2. The single image super-resolution algorithm based on the attention mechanism as claimed in claim 1, wherein: in the step 2), after the output of the first intermediate cascade block exists, applying an attention mechanism to further strengthen the weight of the channels rich in important features and the spatial positions; two parallel branches are designed for this purpose, one branch generates a weight coefficient with the size of Cx1x1 through a channel attention mechanism to adjust the characteristic value of each channel; the other branch utilizes a space attention mechanism to generate a weight coefficient with the size of 1xHxW to adjust the characteristic value of the space position in each channel; by utilizing the parallel branches, the network can further extract effective characteristic representation by utilizing the correlation between the channel and the space position, thereby improving the performance of the network; defining characteristics of an input

Which contains C feature maps, each of which is then of size HxW.

3. The single image super-resolution algorithm based on the attention mechanism as claimed in claim 2, wherein: in the step 2), the channel attention branch extraction process: first, a total number characteristic mu epsilon R of each channel is generated through a global average pooling layer _Cx1x1 The tie pooling layer is applied to the individual feature channels, so the c-th channel of μ is represented as:

wherein

Representing the c-th channel at position (i, j)

The pixel value of (a); the digital signature μ is then fed into an activation function to perform convolution summation as follows:

wherein

And

the weights and offsets, respectively, of the first convolutional layer are used to change the number of channels by scaling γ; for the same reason, the parameter is

And

the convolution layer converts the channel number into the original number; σ and δ represent sigmoid and ReLU activation functions, respectively.

In addition, the per-channel attention weight α fits a value between 0-1 by the sigmoid activation function σ and uses it to rescale the input features; and after obtaining the channel attention coefficient alpha, multiplying each element of the original input characteristics of the channel attention coefficient alpha to obtain the final output of the channel attention module branch:

wherein H _CA Indicating the final output of the channel attention module, F _CA A per-channel multiplication representing a channel feature and its corresponding channel weight;

characteristics of output of first cascade module

Will be input into another spatial attention module branch to carry on the spatial attention adjustment characteristic; the spatial attention module has one global average less pooling layer than the channel attention module because the spatial attention module does not need to communicate that global spatial information is compressed into the statistical descriptor of each channel through the global average pooling layer; the rest of the process is similar to the channel attention, and the spatial attention mask coefficient is shown in the following formula:

And a deviation of

Is used to generate a feature map for each channel, and the generated feature map is combined with a single attention map by a weight of

And a deviation of

The 1 × 1 convolutional layer of (1); the sigmoid function sigma normalizes the feature mapping to be in a range of 0-1 to obtain a space attention self-adaptive mask beta; convolutional layerThe scale factor γ of (d) is used to facilitate the change in dimension; after obtaining the spatial attention mask coefficient, the spatial attention mask coefficient is compared with the input feature

Multiplying each element at a spatial position to obtain the final output of the spatial attention module branch:

4. The single image super-resolution algorithm based on the attention mechanism as claimed in claim 3, wherein: in the step 2), the output H of the channel attention module is obtained _CA And the output H of the spatial attention module _SA Then, the input is used as an input and sent to a second cascade module to fuse the spatial characteristics of the two modules, the number of characteristic channels of the input is changed through a 1x1 convolution layer to achieve better transmission between the blocks, the output of the second cascade module is obtained, and then residual operation is performed on the output of the first cascade module and the input of the MSAB block to obtain the output of the whole MSAB, wherein the following formula is shown:

wherein H _o Represents the final output of MSAB]Which is representative of a cascade operation,

and

represents the weight and bias of the 1x1 convolutional layer in the second cascaded block; is rich in for better utilizationThe feature of low frequency information, a short jump connection is introduced into the residual block, which not only enables the main part of the network to learn the residual information, but also avoids the problem of gradient disappearance during network training.

5. The single image super-resolution algorithm based on the attention mechanism as claimed in claim 4, wherein: in the step 2), a network design framework of residual errors and residual errors inside is adopted in the middle feature mapping operation, namely, the outer residual errors are mainly formed by stacking a series of attention groups and global residual error learning, and the inner residual errors are formed by stacking a series of attention blocks and local residual error learning stacks; if there are N Attention Groups (AG) in the outlier residual, the input/output AG of the nth attention group _n-1 And AG _n Expressed as follows:

AG _n ＝H _n (AG _n-1 )

＝H _n (H _n-1 (…(H ₁ (AG ₀ ))…))

wherein H _n Is the operation of the nth attention group; AG ₀ Is the input of the first attention block, i.e. the output F of the shallow feature module _IFE (ii) a Attention groups are stacked from a series of MSAB blocks, but simple stacked residual blocks do not effectively utilize the characteristics of previous blocks, so dense connection is introduced between each attention block, namely the input of all intermediate blocks is formed by cascading the outputs of the previous attention blocks; MSAB mainly comprises dense connection, local feature fusion and local residual learning to form a continuous memory mechanism; the method is characterized in that a continuous memory storage mechanism is realized by transmitting the previous MSAB characteristics to the current MSAB; by F _m-1 And F _m Representing the input and output of the m-th MSAB, both having G ₀ Feature mapping, then the output of the mth MSAB is represented as:

F _m ＝M _m ([F _m-1 ，F _m-2 ，…，F ₁ ])

wherein M is _m Represents the function of the mth MSAB, [ F ] _m-1 ，F _m-2 ，…，F ₁ ]) Represents the cascade operation of 1 st to m-1 st MSAB blocks; the output of the preceding MSAB and each layer has a direct connection to all subsequent layers, which not only preserves the feed forward properties, but also extracts locally dense features.

6. The single image super-resolution algorithm based on the attention mechanism as claimed in claim 5, wherein: in the step 2), local feature fusion is used for adaptively fusing the states of the whole convolutional layer in the previous MSAB and the current MSAB; the feature mapping of the (m-1) th MSAB is directly introduced into the (m) th MSAB in a serial mode, so that the feature quantity reduction is of great importance; on the other hand, inspired by MemNet, 1 × 1 convolutional layers are introduced to adaptively control output information; this operation is named Local Feature Fusion (LFF) formula as follows:

wherein, F _n，LF Representing the fused output of n MSAB features,

indicates the function of the 1 × 1Conv layer in the nth AG. [ AG _n-1 ，F ₁ ，...，F _M ]The concatenation of the input representing the previous AG with the M MSAB outputs in the current AG, a very deep dense network without LFF would be difficult to train as the growth rate G becomes larger.

7. The single-image super-resolution algorithm based on the attention mechanism as claimed in claim 6, wherein: in the step 2), local residual learning is also introduced into the AG group to further improve the network feature currency, and the final output of the nth AG is represented as:

AG _n ＝AG _n-1 +F _n，LF

local residual learning further improves network characterization capabilities to achieve better performance, resulting in a final attention groupOutput AG _n It will then be fed into a 3x3 convolutional layer to produce the final output F of the entire intermediate feature mapping convolution _N ：

F _N ＝Conv(AG _N )

Where Conv represents the last convolutional layer, the resulting output F _N Output of the attention module to be followed and initial input characteristics F _IFE Are sent together to the next module.

8. The single-image super-resolution algorithm based on the attention mechanism as claimed in claim 7, wherein: in step 2), after extracting the intermediate features of a series of intermediate attention groups, the model introduces a layer attention module, and the input of the module is the cascade connection of the features of each intermediate attention group, and the process is as follows:

F _LA ＝H _LA ([AG ₁ ，AG ₂ ，…，AG _N ])

wherein H _LA Representing the attention function between the introduced layers, the input of the function is the output of all the intermediate layers, so that all the characteristic information from the previous layer can be fully utilized; f _LA Represents the output of the inter-layer attention module; [ AG ₁ ，AG ₂ ，…，AG _N ]Representing a cascade operation of 1 to N attention groups.

9. The single image super-resolution algorithm based on attention mechanism as claimed in claim 8, wherein: in the step 2), obtaining an intermediate convolution layer output F _N And the output F of the inter-layer attention module _LA A long jump connection LSC is introduced to enhance the stability of network training, and the LSC can also acquire better network performance through residual learning, namely, an initial shallow feature F _IFE Together with them, a per-element addition is performed, so that the final output of the whole intermediate feature mapping module is represented as:

M _MF ＝F _IFE +F _N +F _LA

wherein F _MF Representing intermediate feature mappingsAnd (5) final output of the steps.