CN117575915A

CN117575915A - Image super-resolution reconstruction method, terminal equipment and storage medium

Info

Publication number: CN117575915A
Application number: CN202410056441.7A
Authority: CN
Inventors: 谢瀚荣; 吴昌徽; 黄育明; 廖源; 陈颖频; 宋彬辉; 胡浩荣; 陈晶晶; 张钦洪; 陈星萍; 李倩; 张妹珠
Original assignee: Minnan Normal University
Current assignee: Minnan Normal University
Priority date: 2024-01-16
Filing date: 2024-01-16
Publication date: 2024-02-20

Abstract

The invention relates to an image super-resolution reconstruction method, terminal equipment and a storage medium, wherein the method comprises the following steps: constructing an image super-resolution reconstruction model, and training the model through a training set for super-resolution reconstruction of the image; the network structure of the model sequentially comprises a shallow layer feature extraction module, a deep layer feature extraction module and an up-sampling module; the deep feature extraction module consists of a plurality of enhanced Swin transducer modules, and features are extracted by local features and global features alternately; the extracted global features are channel attention features extracted by using a block sparse global perception module, a window multi-scale self-attention and low-parameter residual channel attention module. The invention improves the long-distance modeling capability of the model and enables the model to utilize local information of different layers.

Description

Image super-resolution reconstruction method, terminal equipment and storage medium

Technical Field

The present invention relates to the field of image processing, and in particular, to an image super-resolution reconstruction method, a terminal device, and a storage medium.

Background

Super-resolution reconstruction of a single image is a classical problem in the field of image processing, whose main function is to generate images with high spatial resolution and clear details. The super-resolution nature of the image is to recover the lost high frequency signal in the low resolution image, thereby obtaining a high quality image. The image super-resolution reconstruction technology is widely applied to the fields of remote sensing imaging, infrared imaging, medical imaging and the like.

Image super-resolution has methods based on interpolation, model driving and data driving. Interpolation-based algorithms have found wide application due to their simplicity and efficiency. However, the reconstruction result based on the interpolation algorithm has the problems of saw tooth, blurring and the like, and the SR image quality is seriously affected. The algorithm based on model driving utilizes image priori knowledge to recover detail information, and the method is limited in engineering application due to the large calculation complexity. With the development and maturity of parallel computing technology, data driven algorithms are receiving attention from students. For example, dong et al propose image super-resolution reconstruction using SRCNN models of only three convolutional layers (C. Dong, C. Loy, K. He, et al Learning a deep convolutional network for image super-resolution [ C ]. Computer Vision-ECCV 2014, zurich, switzerland, 2014, 184-199.). Firstly, the convolutional neural network is used for reconstructing the super-resolution of the image, and a better reconstruction effect is achieved compared with a method based on interpolation and model driving. Researchers have then increased the feature expression capacity of network models by increasing network depth, and Simonyan et al have proposed VGG networks up to 19 layers deep (k.simonyan and a. Zisselman, very deep convolutional networks for large-scale image recognition C International Conference on Learning Representations, san Diego, CA, USA, 2015, 1-14). The ResNet model proposed by He et al has achieved 152 layers (K.He, X.zhang, S.ren, et al Deep residual learning for image recognition [ C ]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, las Vegas, NV, USA, 2016, 770-778.) in addition, the network proposed residual learning to avoid network gradient disappearance or explosion to ensure information integrity. Ledig et al propose an SRResNet network based on a residual error module, and on this basis propose an SRGAN model with a generated countermeasure structure in cooperation with a discriminator network. The model can recover SR images with more texture details in larger scale factor reconstruction tasks (C. Ledig, L. Theis, F. Husz ar, et al, photo-realistic single image super-resolution using a generative adversarial network [ C ]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, honoulu, HI, USA, 2017, 4681-4690.). Wang et al introduced a Residual-dense block (Residual-in-Residual Dense Block, RRDB) in the SRGAN model that proposed the ESRGAN model that learned finer edge information by increasing network depth (X. Wang, K. Yu, S. Wu, et al Esrgan: enhanced super-resolution generative adversarial networks [ C ]. Proceedings of the European Conference on Computer Vision (ECCV), munich, germany, 2018, 63-79.). However, since RRDB consumes a lot of memory and the joining operation brings about a lot of computation, these problems make it difficult to be widely used in engineering. This works to enlarge the spatial resolution of the image at the last layer of the network. Unlike the previous work, lai et al propose a Laplacian pyramid super resolution network (LapSRN). The network reconstructs the detail information of the sub-bands of the high-resolution image in a progressive reconstruction mode, and the super-resolution reconstruction method of the image from coarse to fine is more efficient (W.—S.Lai, J.—B.—Huang, N..Ahuja, et al Deep laplacian pyramid networks for fast and accurate super-resolution [ C ]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, hounolulu, HI, USA, 2017, 624-632.). Most CNN-based approaches focus on well-designed architectures such as residual learning and dense connections. Although their performance is significantly improved over traditional model-based approaches, they generally suffer from two fundamental problems: first, the interaction between the image and the convolution kernel is content independent. The use of the same convolution kernel to recover different image regions may not be a good choice. Secondly, based on local processing principles, convolution is limited for long-distance dependence modeling (J.Liang, J.Cao, G.Sun, et al, swinir: image restoration using swin transformer [ C ]. Proceedings of the IEEE/CVF International Conference on Computer Vision, montreal, QC, canada, 2021, 1833-1844.) if the receptive field is to be enlarged, the number of network layers needs to be increased, resulting in larger computational resource overhead.

The method adopts a convolution structure, and can not directly obtain the characteristics of a larger range by only sensing the local information of the image.

Disclosure of Invention

In order to solve the problems, the invention provides an image super-resolution reconstruction method, a terminal device and a storage medium.

The specific scheme is as follows:

an image super-resolution reconstruction method, comprising: constructing an image super-resolution reconstruction model, and training the model through a training set for super-resolution reconstruction of the image;

the network structure of the model sequentially comprises a shallow layer feature extraction module, a deep layer feature extraction module and an up-sampling module; the low-resolution image is input into a shallow feature extraction module to obtain shallow features, the shallow features are input into a deep feature extraction module to obtain deep features, and the shallow features and the deep features are added and then input into an up-sampling module to obtain a super-resolution image;

the deep feature extraction module consists of a plurality of enhanced Swin transform modules, and features are extracted by local features and global features alternately in the enhanced Swin transform modules; when local features are extracted from the enhanced Swin transducer module, RELU activation function extraction between two shift convolutions is used, and the extracted global features are channel attention features extracted by using a block sparse global perception module, a window multi-scale self-attention and low-parameter residual channel attention module.

Further, the shallow feature extraction module extracts shallow features by using 3×3 convolution.

Further, the upsampling module consists of a 3 x3 convolution and pixel shuffling.

Further, the loss function of the model uses L1 loss.

Further, the block sparse global perception module sequentially performs layer normalization, channel dimension feature mapping and GELU activation function on the input tensor to obtain an intermediate tensor; then, space feature mapping is carried out on the intermediate tensor through a multi-layer perceptron; and finally, carrying out full-connection feature mapping in the channel direction on the tensor after the space feature mapping, and carrying out residual connection on the full-connection feature mapping result and the intermediate tensor to obtain an output tensor.

Further, the window multi-scale self-attention adopts a shift window multi-scale self-attention.

Furthermore, in the low-parameter residual error channel attention module, firstly, 1×1 convolution is adopted to expand the input characteristics; then, 3X 3 convolution is adopted to learn the feature after dimension expansion and restore the feature to the same dimension as the input feature; finally, a channel attention module is used for selecting the channel of the feature.

Further, the low parameter residual channel attention module is expressed as:

wherein,representing the output characteristics of an ith low-parameter residual error channel attention module; />Representing input features of an ith low-parameter residual channel attention module; />A1 x 1 convolution kernel representing the dimension of the dilated feature channel; />A 3 x3 convolution kernel representing the compressed channel dimension; />A1 x 1 convolution kernel representing the dimension of the dilated feature channel; />A1 x 1 convolution kernel representing the compressed channel dimension; subscripts c1 and c2 each represent a sequence number of the compressed channel convolutional layer; subscripts e1 and e2 each represent a sequence number of the expanded channel convolutional layer; />Representing a RELU activation function; />Representing a two-dimensional global average pooling; />Representing an intermediate function->Representation function->Is input to the computer; />Representing an activation function; sign->Representing the number of channel directions multiplied.

The image super-resolution reconstruction terminal device comprises a processor, a memory and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the steps of the method according to the embodiment of the invention when executing the computer program.

A computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method described above for embodiments of the present invention.

The invention adopts the technical scheme and has the beneficial effects that:

(1) The image super-resolution reconstruction model can learn global information sparsely, and long-distance modeling capacity of the model is improved.

(2) The multi-headed self-attention in the Swin transducer module is replaced with multi-scale self-attention so that the model can utilize different levels of local information.

(3) A low-parameter residual channel attention module LRCAB is designed for reassigning the channel weights of features, guiding the model to pay attention to the effective information.

Drawings

Fig. 1 is a schematic structural diagram of a Swin transducer module according to a first embodiment of the present invention.

Fig. 2 is a schematic diagram showing window multi-head self-attention and shift window multi-head self-attention in this embodiment.

Fig. 3 is a schematic structural diagram of an image super-resolution reconstruction model in this embodiment.

Fig. 4 is a schematic diagram showing the channel expansion shift convolution in this embodiment.

Fig. 5 is a schematic diagram showing the channel compression shift convolution in this embodiment.

Fig. 6 is a schematic diagram of a block sparse global sensing module in this embodiment.

Fig. 7 shows a multi-scale self-attention schematic of a window in this embodiment.

Fig. 8 is a schematic diagram showing the self-attention map in this embodiment.

Fig. 9 is a schematic diagram of a low-parameter residual channel attention module in this embodiment.

Fig. 10 is a diagram showing a multi-scale self-attention calculation of the shift window in this embodiment.

Fig. 11 shows a partially ascribed pictorial view in this embodiment.

Fig. 12 shows a qualitative comparison of the first picture x4 scale with the latest lightweight SR model in this example.

Fig. 13 shows a qualitative comparison of the second picture in this example on the x4 scale with the latest lightweight SR model.

Fig. 14 shows a qualitative comparison of the third picture in this example on the x4 scale with the latest lightweight SR model.

Fig. 15 shows the comparison result of the low-parameter residual channel attention module and the other types of channel attention modules in this embodiment.

Fig. 16 shows SR results and local attribution graph results for a Swin Transformer based lightweight SR network in this embodiment.

Detailed Description

For further illustration of the various embodiments, the invention is provided with the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments and together with the description, serve to explain the principles of the embodiments. With reference to these matters, one of ordinary skill in the art will understand other possible embodiments and advantages of the present invention.

The invention will now be further described with reference to the drawings and detailed description.

Embodiment one:

the embodiment of the invention provides an image super-resolution reconstruction method, and related knowledge is introduced first.

(1) Image super-resolution reconstruction

Images are often disturbed by various factors in the transmission process, so that information is lost, and the quality of the images is affected. The degradation from High-Resolution (HR) images to Low-Resolution (LR) images may occur due to blurring, noise interference, downsampling, and the like. The image Super-Resolution reconstruction technique restores a lost high-frequency signal in a low-Resolution image, thereby acquiring a Super-Resolution (SR) image.

In order to solve the problem of image degradation, scholars at home and abroad have conducted a great deal of research in the field of image super-resolution reconstruction. In recent years, image super-resolution reconstruction technology based on deep learning is rapidly developed, and the mapping from LR images to HR images is learned from a large-scale pair data set, so that good reconstruction effect is achieved. In order to improve the super-resolution reconstruction performance of the model, some reconstruction algorithms add finer neural network architecture, such as residual learning, dense connection, and the like. Other super-resolution reconstruction algorithms apply attention mechanisms in the CNN framework, and achieve good reconstruction performance.

(2) Swin transducer module

Recently, a transducer model in the field of natural language processing is widely applied in the field of computer vision, focuses on important areas of images through an attention mechanism, obtains better performance in the field of image processing, and starts to be used in the field of image super-resolution reconstruction.

Recently, related studies such as target detection, target classification, video classification and the like show great potential of a transducer model in the field of computer vision. Recent research studies on visual transducers (ViT) demonstrate their great potential as alternatives to the CNN model. The Vit model converts visual problems into sequence-to-sequence problems by dividing pictures into non-overlapping image blocks, and good image classification performance is achieved by completely using a transducer structure. Swin transducer proposes shift window self-attention, and information interaction is realized between adjacent windows while the calculated amount is reduced.

The visual Transformer obtains superior classification performance by superimposing a plurality of transform blocks and processing non-overlapping image blocks, however, ordinary attention with the secondary complexity of the input length is difficult to adapt to the visual task of high resolution images as input. Liu et al propose a Swin transducer model (Z.Liu, Y.Lin, Y.Cao, et al Swin transducer: hierarchical vision Transformer using shifted windows [ C ]. Proceedings of the IEEE/CVF International Conference on Computer Vision, montreal, QC, canada, 2021, 10012-10022.) that reduces computational complexity and improves the local feature modeling capability of the model by calculating the self-attentiveness of the image patch in the window. However, the fixed window size constrains the ability of the transducer to handle objects of different dimensions. For this reason, the introduction of a Swin transducer block with a multi-scale window in this embodiment improves the multi-scale learning ability of the model.

As shown in FIGS. 1 and 2, a Swin transducer block calculates Window Multi-headed self-attention (Window Multi-head Self Attention, W-MSA) and shifted Window Multi-headed self-attention (Shifted Window Multi-head Self Attention, SW-MSA) modeling image local region texture information sequentially. A two-layer multi-layer perceptron (MLP) is then used for further feature transformation, with a gel activation function between the two layers of MLP. A layer normalization (LayerNorm, LN) layer is applied before the W/SW-MSA module and the MLP module, respectively, and a residual connection is applied after each module. The model has the advantage of processing large scale images because of the small window attention calculations of the Swin transducer. The SwinIR image recovery model applies the Swin transducer model to super-resolution reconstruction, so that better performance is realized.

This increases the computational complexity and training time of the model, since the multi-headed self-attention calculates self-attention directly to the sequence. The W-MSA proposed by Swin transducer limits the scope of attention calculations to windows, making it more advantageous when dealing with large-scale data or long sequences, with lower computational complexity. However, the window multi-headed self-attention of the Swin transducer makes the long-range information interaction capability of the feature map insufficient, resulting in limited receptive fields. Therefore, in the embodiment, sparse sensing and a multi-scale mechanism are introduced on the basis of the Swin transform to improve the long-range information interaction capability of the model.

Based on the above background knowledge, in order to make the constructed image super-resolution reconstruction model have global perceptibility and avoid adding excessive parameters, an enhanced Swin transducer network (Enhanced Swin Transformer Network, abbreviated as ESTN) for sparse global perception of blocks under a multi-scale view of super-resolution is proposed in this embodiment. In addition, in order to analyze the influence of the proposed ESTN network on the receptive field, the embodiment also introduces a local-area-based map (Local Attribution Map, LAM) to visualize the sparse global perceptibility of the reconstructed network.

As shown in fig. 3, the image super-resolution reconstruction network ESTN proposed in this embodiment is composed of 3 parts, which are respectively: a shallow feature extraction Module (Shallow Feature Extraction Module, SFEM), a deep feature extraction Module (Deep Feature Extraction Module, DFEM) and an Upsampling Module (UM).

The shallow feature extraction module extracts shallow features by adopting 3×3 convolution.

The deep feature extraction module consists of a plurality of enhanced Swin transducer modules (Enhanced Swin Transformer Module, ESTM) in which features are alternately extracted as local features and global features. First, two Shift convolutions (Shift-Conv, SC) with varying number of feature channels and an activation function RELU between them are used to extract local features, which facilitates model restoration to fine texture. Channel attention features are then extracted using a block sparse global perceptual Module (Block Sparse Global-Awareness Module, BSGM), window Multi-scale self-attention (Window Multi-Scale Self Attention, W-MSSA), and Low-parameter residual channel attention (Low-parametric Residual Channel Attention Block, LRCAB). The sparse global perception of the BSGM can effectively improve the receptive field of the model, and the W-MSSA can mine more object information through windows with different sizes, so that efficient channel selection is realized by changing the construction mode of the LRCAB. Finally, the local and global feature extraction is carried out on the features again in an alternating mode, wherein a multi-scale self-attention module in the global feature extraction adopts a shift window multi-scale self-attention (Shifted Window Multi-Scale Self Attention, SW-MSSA), and information interaction across windows is realized through the shift operation on the features.

The up-sampling module consists of a 3×3 convolution and Pixel Shuffle (Pixel Shuffle), and can realize that the image is enlarged according to a specified multiplying power, so that a super-resolution image is output.

(1) Shallow and deep feature extraction

Given a Low resolution image input (LR)Shallow features are extracted using convolution kernels with a spatial resolution of 3 x3, thus, where each slice convolution operation is defined as:

(1)

wherein the method comprises the steps ofRepresenting convolution kernel set->The%>A convolution kernel with spatial resolution +.>；/>Representing a convolution operation; />Is a shallow layer feature; />Indicate the>Horizontal slices; />The number of channels representing the intermediate feature. To simplify the expression, the subsequent convolution only uses a method similar to +.>The way in which (a) expresses the relationship between the convolution kernel, the convolved tensor, and the convolution result.

(2)

Wherein the method comprises the steps ofRepresenting a depth feature extraction module; />Representing deep features.

(3)

Wherein the method comprises the steps ofIndicate->ESTM Module; ->Indicate->The ESTM module outputs the characteristics.

(2) Upsampling module

Will shallow layer featureAnd deep features->The super-resolution image can be restored by adding, then carrying out 3X 3 convolution and pixel shuffling.

(4)

Wherein the method comprises the steps ofRepresenting a pixel shuffling operation; />Representing the scale becoming larger +.>A double super resolution image;a convolution kernel with a spatial resolution of 3 x3 is represented.

(3) Loss function

Adam is used as an optimizer in this embodiment to optimize parameters of the ESTN reconstruction network by minimizing L1 losses:

(5)

wherein,respectively express +.>(/>Representing the batch number) Zhang Chao resolution and high resolution pictures.

(4) Enhanced Swin transducer module

The existing Swin transducer-based image super-resolution reconstruction network is limited by the small size of an attention window, and has very limited capability of constructing long-distance dependence, so that the recovered SR image is poor in quality. To solve this problem, the present embodiment introduces BSGM in the Swin Transformer, so that the reestablished network can better construct long-distance dependence. Meanwhile, MSA of the Swin transducer is replaced by MSSA to pay attention to multi-scale information. As shown in FIG. 3, ESTM has SC, BSGM, W/SW-MSSA and LRCAB.

Stage 1: local feature extraction stage

Fig. 4 and 5 show details of the first stage local feature extraction in fig. 3. As shown in fig. 4, the features are shifted by convolution and 1×1 convolution to extract local features and dilated channel dimensions, respectively, as shown in equation (6):

(6)

wherein the method comprises the steps ofRepresenting shifted convolution kernels stacked along five sets of convolution kernels on a channel, as shown in fig. 4;a layer-by-layer convolution operator representing the channel direction; />A1 x 1 convolution kernel representing the dimension of the dilated feature channel;representing a RELU activation function; />Representing the characteristics after expanding the channel dimension.

As shown in fig. 5, for the characteristicsAfter shift convolution, the channel dimensions of the features are compressed to +.1 with the input features using a1×1 convolution kernel>The channel dimensions of (2) are identical as shown in equation (7).

(7)

Wherein the method comprises the steps ofA convolution kernel representing a spatially moving feature; />A1 x 1 convolution kernel representing the compressed channel number dimension; />Representing the characteristics after compressing the channel dimensions.

Features to be characterizedFeatures after dimension of compression channel->Residual connection is performed to obtain a shifted local feature->。

(8)

Stage 2: global feature extraction stage

(a) Block sparse global perception module in phase 2

In the embodiment, BSGM is adopted to construct sparse global perception for the features.

(9)

Wherein the method comprises the steps ofIndicate->BSGM of the second stage in the ESTM module.

BSGM as shown in FIG. 6, assuming the tensor of the input is。/>Obtaining +.>：

(10)

Wherein the method comprises the steps ofRepresenting a fully connected feature mapping layer.

For a pair ofAnd (3) performing spatial feature mapping:

(11)

wherein the method comprises the steps ofRepresenting dividing tensors by specified size; />Representing the spatial arrangement of the change tensor.

(12)

Wherein the method comprises the steps ofSpatial map representing multi-layer perceptronAnd (5) performing injection operation.

(13)

Wherein the method comprises the steps ofRepresenting restoration of the size of the tensor to the original size.

Finally, to tensorFull connection feature mapping of channel direction and +.>Make residual connection output tensor->：

(14)

(b) Window multi-scale self-attention module in stage 2

The present embodiment introduces MSSA so that the model can learn multi-scale information.

(15)

Wherein the method comprises the steps ofIndicate->Phase 2W-MSSA modules in the ESTM modules.

The BSGM calculates multi-scale self-attention after establishing sparse global perception on the features. As shown in fig. 7, the tensor is first divided equally from the channel dimension into three. The attention matrix is then calculated using three different scale windowed self-attention modules (W-SAs (s=0, 1, 2)) to process objects of different scales, respectively, where the self-attention range uses yellowAnd (5) color marking. Self-attention moment matrix acquisition as shown in FIG. 8, query matrices were acquired using 1×1 convolutions, respectivelyKey matrix->Sum matrix->. To ensure that the model is able to handle images of different resolutions, the present embodiment uses reflective padding at the boundaries of the image so that the image size is an integer multiple of the size of each window.

(16)

Wherein the method comprises the steps ofIs an activation function along the column direction; />Representing the size of the local window.

(c) Low parameter residual channel attention module in phase 2

In recent years, attention mechanisms have been widely used for their excellent performance. Since the contribution degree of each channel feature to the super-resolution reconstruction result is different, the present embodiment hopes to draw the attention of the channel to select the channel of the feature. Meanwhile, to avoid increasing excessive network parameters and calculation amount, the present embodiment reconstructs channel attention to LRCAB.

As shown in FIG. 9, a1×1 convolution is used on the input featuresExpanding the dimension. The dimension of the expansion feature may result in richer features such as texture features of different directions and different frequencies. These features are then learned using a 3 x3 convolution and restored to the same dimensions as the input features. Finally, using channel attention module to pass through characteristicsThe track is selected.

(17)

Wherein the method comprises the steps ofA1 x 1 convolution kernel representing the dimension of the dilated feature channel; />A 3 x3 convolution kernel representing the compressed channel dimension.

(18)

Wherein the method comprises the steps ofA1 x 1 convolution kernel representing the dimension of the dilated feature channel; />A1 x 1 convolution kernel representing a reduced channel dimension; />Representing a two-dimensional global average pooling function; />Is an activation function; sign->Representing the number of channel directions multiplied.

Stage 3: local feature extraction stage

Equations (22) - (23) illustrate mathematical expressions for the third stage local feature extraction in fig. 3.

(19)

(20)

Wherein the method comprises the steps ofA1 x 1 convolution kernel representing the number of dilated feature channels; />A1 x 1 convolution kernel representing the number of compression channels; />Representing the local features of the third stage shifted convolution output in the ESTM.

Stage 4: global feature extraction stage

(a) Block sparse global perception module in phase 4

Equation (24) illustrates a mathematical expression of the ESTM of FIG. 3 in which stage 4 learns sparse global perceptions, similar to stage 2.

(21)

Wherein the method comprises the steps ofIndicate->BSGM module in stage 4 of ESTM module.

(b) Shift window multi-scale self-attention module in stage 4

As shown in fig. 10, the SW-MSSA has more cyclic shift and inverse cyclic shift operations than the W-MSSA. The cyclic shift is a distance of half the current window size. The shift window multi-scale attention is calculated by equation (22).

(22)

Wherein the method comprises the steps ofIndicate->The SW-MSSA modules in ESTM module stage 4.

(c) Low parameter residual channel attention module in stage 4

Similar to phase 2, channel attention is calculated to reassign channel weights.

(23)

Wherein the method comprises the steps ofIndicate->LRCAB in stage 4 of ESTM module.

(5) Local map of attribution

In order to explore the global information modeling capability of the proposed BSGM module, LAM is introduced for verification in the embodiment. And carrying out gradient back transmission by using the LAM through path integration, and calculating the relation between the generation of local features in the output image and pixels of the LR image.

As shown in fig. 11, the low resolution imageReconstruction into a super-resolution image by a super-resolution reconstruction network +.>. Then, select super resolution image +.>Extracting features from a region of interest, analyzing the features with the low resolution image +.>In the relationship of pixels. Dark pixels in the LAM result indicate a large characteristic contribution to restoring the selected area. First->LAM results of dimensions mayCalculated from equation (24).

(24)

Wherein,respectively representing a super-resolution network and a local feature extractor; />Representing a smooth path function>Representing an image blurred to an input image +.>，/>Image representing input without blurring process +.>；/>Indicate->And (5) calculating a local attribution graph result by the group image, and taking an average value as a LAM result after calculation.

Experimental analysis

In this example, double, triple and quadruple scale single image super resolution experiments were performed on five data sets, and the proposed network was compared with the most advanced network, quantitatively and qualitatively verifying the superior performance of the proposed ESTN reconstruction network. The embodiment provides the functions of all components of the ESTN reconstruction network proposed by comprehensive ablation experiment evaluation. Finally, the receptive fields of the proposed ESTN network were visualized and analyzed using a local attribution map.

The present embodiment uses a LR-H system comprising 800 LR-HThe super-resolution dataset of DIV2K of the R image pair trains the proposed ESTN network. HR image crop is 256×256, small batch data size is. Comparison was made with the most advanced method using 5 test datasets: set5, set14, BSD100, urban100, and Manga109.

(1) Training arrangement

The present embodiment trains the two-fold, three-fold, four-fold scale super-resolution reconstruction tasks, respectively. ESTN rebuild network is composed of 12 channelsThe window of BSGM is set to 4X 4 and the multiscale windows of W/SW-MSSA are set to 4X 4, 8X 8 and 16X 16. In one ESTM, the attention score calculated in the W-MSSA is shared for use by the SW-MSSA to reduce the amount of computation. A training image pair is generated using bicubic downsampling, and 64 image blocks of size 64 x 64 are randomly cropped from the LR image as training batches for the ESTN network. The present embodiment trains the network at an initial learning rate of 0.0002, attenuating the learning rate by half at generation 250, 400, 425, 450, 475 for a total of 500 generations. For optimization, the present embodiment uses Adam optimizer based on +.>And the weight decay is 1e-8. All experiments were performed on two NVIDIA RTX3090 GPUs.

(2) Test setup

The present embodiment focuses mainly on the lightweight performance and reconstruction effect of the model, and the lightweight performance focuses mainly on the parameter number (parameters) and floating point operands (Floating point operations, FLOPs). The results of FLPs were calculated at an output SR resolution of 1280×720. The reconstruction effect is evaluated by widely used PSNR and SSIM indexes, the super-resolution image is converted from RGB channel to YCbCr space, and then calculated on Y channel.

(3) Experimental results

The embodiment compares the ESTN reconstruction network with 7 most advanced single image super-resolution lightweight SR models: SRCNN, CARN, IMDN, LAPAR-A, ESRT, ELAN-light and SwinIR-light.

(a) Quantitative comparison

As shown in table 1 (quantitative comparison (average PSNR/SSIM) between the reference data set and the lightweight image super-resolution method), the ESTN reconstruction network proposed in this embodiment obtains the most advanced performance index in three scale super-resolution reconstructions over 5 test sets. In the 4 x scale super resolution results, performance index improvement on the Urban100 and Manga109 datasets where reconstruction was more difficult was greater. In the 4-scale super-resolution reconstruction result of the Manga109 dataset, the PSNR of our ESTN reconstruction network is improved by 0.21dB compared with ELAN-light and SwinIR-light, and good performance improvement is achieved on the Set5, set14, BSD100 and Urban100 datasets. Compared with SwinIR-light, the ESTN super-resolution reconstruction network provided by the embodiment achieves better performance by using a smaller parameter number.

TABLE 1

(b) Qualitative comparison

As shown in fig. 12, 13 and 14, the present example performs a qualitative comparison of x4 SR results on three images img044, img078 and img092 in Urban 100.

As shown in fig. 12, the SR image magnification of the carrn, IMDN, and LAPAR-a based on the CNN model is very blurred and has poor visual effect. The SR image magnification sections of ESRT, ELAN-light, and SwinIR-light based on the Transformer model can restore the texture of the image, but still have edge blurring. The ESTN of this embodiment can restore sharp-edged SR images. As shown in fig. 13, only the ESTN of the present embodiment restores the correct texture in the SR image enlargement portion. Both the CNN-based CARN, IMDN, LAPAR-a SR model and the Transformer-based ESRT, ELAN-light, swinIR-light SR image magnification restored erroneous textures. In fig. 14, only the ESTN of the present embodiment can give consideration to image texture restoration in several different directions in the SR image enlargement section. The SR image texture direction restored by CNN-based carrn, IMDN and LAPAR-a is completely blurred. The SR image magnification part based on ESRT, ELAN-light, swinIR-light of the transducer cannot give consideration to texture recovery in different directions.

In summary, the SR model based on CNN may have the worst reconstruction effect due to the smaller receptive field. The algorithm based on the transducer model has a certain improvement on the SR image recovery performance based on CNN, but still has room for improvement. Based on the method, the sparse global perception is introduced from the receptive field of the improved model, so that the model has the global receptive field, and the restored SR image is more accurate in texture in the enlarged area.

Through the qualitative and quantitative analysis, the method of the embodiment is superior to other advanced methods, the reconstructed SR image is closer to the HR image, and the effectiveness of sparse global perception modeling is proved.

Ablation experiments

To better understand how the ESTN works, the present example conducted a comprehensive ablative experiment to evaluate the effects of the various components of the ESTN, as well as the experimental results of different designs of the low-parameter channel attention module. The model in the ablation experiment is trained in batches of 4, and the rest parameters are consistent with the experimental settings.

(1) Component parts of ESTN

The floating point total number FLPs and the parameter amount Params are reference indexes for measuring the lightweight network. Therefore, in order to intuitively demonstrate the efficiency of the proposed improvement strategy, the present embodiment performs an ablation experiment with or without BSGM and a low parameter residual channel attention module LRCAB. As shown in Table 2 (module ablation experiments at x4 resolution for dataset manga 109), the BSGM module-equipped network was improved by 0.12 dB over the ELAN-light network PSNR, the parameters were increased by 114K only, and the floating point population was increased by 17G. The network with LRCAB module has an improvement of 0.09 to dB over the ELAN-light network PSNR with BSGM, the total number of floating point operations is increased by only 4G, and the parameter amount is increased by 168K.

TABLE 2

(2) Low parameter channel attention ablation experiment

The efficiency of LRCAB is demonstrated, and four channel attentions are presented in this example for comparison as shown in fig. 15. As shown in table 3 (channel attention module ablation experiment at x4 resolution for data set manga 109), first, the three channel attention modules in fig. 15 all perform up-down dimension on features and then calculate channel attention to improve PSNR index compared to the residual channel attention module (Residual Channel Attention Block, RCAB). Second, the use of two 1 x 1 convolutions in the second channel attention module of fig. 15 provides a more limited improvement in the post-feature lifting dimension calculation channel attention. While the third channel attention module in fig. 15 uses two 3×3 convolutions to respectively perform up-down dimension on the channel number to calculate the channel attention, although the performance is the best, the parameter number of 173K is increased compared with the original RCAB. The LRCAB proposed in this embodiment is shown in the fourth channel attention module in fig. 15, and it is more efficient to calculate the channel attention after expanding the channel number by using the 1×1 convolution and compressing by using the 3×3 convolution, and the number of parameters and the total number of floating point operations of the original RCAB are increased by 20K and 4G respectively.

TABLE 3 Table 3

(3) Actual receptive field analysis

As shown in fig. 16, this example demonstrates SR and LAM results based on a transducer model. In LAM results, dark pixels indicate that they have a large impact on the recovery results of the selected region of interest. From the LAM results of the Swin transducer-based ELAN-light and SwinIR-light, it can be seen that the dark pixels are distributed mainly around the selected area with a limited global perceptual perceptibility. As can be seen from the LAM result of the ESTN model proposed in this embodiment, dark pixels are sparsely distributed throughout the entire region except for dense dark pixels around the selected region of interest. This shows that the model proposed in this embodiment can recover the selected region of interest using the information of the entire input LR image, which is advantageous for reconstructing accurate, sharp textures.

The embodiment of the invention provides ESTN with block sparse global information perception under a multi-scale view angle. Firstly, ESTM is designed to increase only a small parameter quantity to realize global information sensing and local multiscale information sensing. On the basis, a low-parameter residual error channel attention module is designed to redistribute the channel weights of the features. Then, the sparse global perception effect of the proposed ESTM module is visualized using a local-view-map. Finally, experiments show that the ESTN network has better performance improvement on the image super-resolution data sets disclosed by Set5, set14, BSD100, urban100 and Manga109.

The embodiment of the invention has the following technical effects:

(1) The image super-resolution model based on the Swin transducer has smaller receptive field, overcomes the defect by introducing sparse global perception and multi-scale self-attention information, and obtains better image super-resolution reconstruction effect by increasing less parameter quantity. Furthermore, replacing the MSA in the Swin transducer module with the MSSA enables the model to utilize different levels of local information.

(2) The feature channels are selected to enable the model without introducing excessive amounts of parameters. The LRCAB is only increased by a small number of parameters, so that better super-resolution reconstruction performance is realized.

(3) The explanation of the neural network at the present stage is limited, so as to prove the superior performance of the proposed BSGM on improving the receptive field of the network. The LAM is introduced to visualize attribution results for the selected region of interest. Experiments show that the BSGM can construct sparse global perception, so that the ESTN reconstruction network can effectively utilize global information of an LR image to reconstruct the super-resolution image.

Embodiment two:

the invention also provides an image super-resolution reconstruction terminal device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the steps in the method embodiment of the first embodiment of the invention are realized when the processor executes the computer program.

Further, as an executable scheme, the image super-resolution reconstruction terminal device may be a computing device such as a desktop computer, a notebook computer, a palm computer, and a cloud server. The image super-resolution reconstruction terminal device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the above-mentioned constituent structure of the image super-resolution reconstruction terminal device is merely an example of the image super-resolution reconstruction terminal device, and does not constitute limitation of the image super-resolution reconstruction terminal device, and may include more or less components than the above, or may combine some components, or different components, for example, the image super-resolution reconstruction terminal device may further include an input/output device, a network access device, a bus, and the like, which is not limited by the embodiment of the present invention.

Further, as an executable scheme, the processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, etc. The general processor may be a microprocessor or the processor may be any conventional processor, etc., and the processor is a control center of the image super-resolution reconstruction terminal device, and connects various parts of the whole image super-resolution reconstruction terminal device by using various interfaces and lines.

The memory may be used to store the computer program and/or the module, and the processor may implement various functions of the image super-resolution reconstruction terminal device by running or executing the computer program and/or the module stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for a function; the storage data area may store data created according to the use of the cellular phone, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

The present invention also provides a computer readable storage medium storing a computer program which when executed by a processor implements the steps of the above-described method of an embodiment of the present invention.

The modules/units integrated in the image super-resolution reconstruction terminal device may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a software distribution medium, and so forth.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An image super-resolution reconstruction method, which is characterized by comprising the following steps: constructing an image super-resolution reconstruction model, and training the model through a training set for super-resolution reconstruction of the image;

the deep feature extraction module consists of a plurality of enhanced Swin transform modules, and features are extracted by local features and global features alternately in the enhanced Swin transform modules; when local features are extracted from the enhanced Swin transform module, RELU activation functions between two shift convolutions are used for extraction, and the extracted global features are channel attention features extracted by using a block sparse global perception module, a window multi-scale self-attention and low-parameter residual error channel attention module; the low parameter residual channel attention module is expressed as:

2. The image super-resolution reconstruction method according to claim 1, wherein: the shallow feature extraction module extracts shallow features by adopting 3×3 convolution.

3. The image super-resolution reconstruction method according to claim 1, wherein: the up-sampling module consists of a 3 x3 convolution and pixel shuffling.

4. The image super-resolution reconstruction method according to claim 1, wherein: the loss function of the model uses the L1 loss.

5. The image super-resolution reconstruction method according to claim 1, wherein: sequentially carrying out layer normalization, channel dimension feature mapping and GELU activation functions on the input tensor in the block sparse global perception module to obtain an intermediate tensor; then, space feature mapping is carried out on the intermediate tensor through a multi-layer perceptron; and finally, carrying out full-connection feature mapping in the channel direction on the tensor after the space feature mapping, and carrying out residual connection on the full-connection feature mapping result and the intermediate tensor to obtain an output tensor.

6. The image super-resolution reconstruction method according to claim 1, wherein: the window multi-scale self-attention adopts a shift window multi-scale self-attention.

7. The image super-resolution reconstruction method according to claim 1, wherein: in the low-parameter residual error channel attention module, firstly, 1X 1 convolution is adopted to expand the dimension of the input characteristics; then, 3X 3 convolution is adopted to learn the feature after dimension expansion and restore the feature to the same dimension as the input feature; finally, a channel attention module is used for selecting the channel of the feature.

8. An image super-resolution reconstruction terminal device, which is characterized in that: comprising a processor, a memory and a computer program stored in the memory and running on the processor, the processor implementing the steps of the method according to any one of claims 1 to 7 when the computer program is executed.

9. A computer-readable storage medium storing a computer program, characterized in that: the computer program implementing the steps of the method according to any of claims 1 to 7 when executed by a processor.