CN111754403B

CN111754403B - Image super-resolution reconstruction method based on residual learning

Info

Publication number: CN111754403B
Application number: CN202010542676.9A
Authority: CN
Inventors: 张敏; 黄刚; 陈啟超
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-06-15
Filing date: 2020-06-15
Publication date: 2022-08-12
Anticipated expiration: 2040-06-15
Also published as: CN111754403A

Abstract

The invention discloses an image super-resolution reconstruction method based on residual error learning, which comprises the following steps: acquiring a common image data set for a training system, and processing to obtain a training set and a test set; acquiring a low-resolution image to be processed; building and fusing all modules of a system core neural network to form a deep convolutional network based on residual learning; calculating network prediction and label loss, and adjusting network parameters according to the loss; training a depth residual error network and inputting a low-resolution image to be processed to the depth residual error network; and outputting the reconstructed high-resolution image, evaluating the reconstruction precision of the image, and finishing the reconstruction of the super-resolution image. The method increases the extraction capability of the image characteristics, improves the generalization and the high efficiency of the image reconstruction system, and ensures that the image reconstruction precision is more accurate.

Description

Image super-resolution reconstruction method based on residual learning

Technical Field

The invention relates to an image processing method, in particular to an image super-resolution reconstruction method based on residual error learning, and belongs to the field of deep learning.

Background

Image super-resolution reconstruction is a classic problem in the field of computer vision, aiming at converting one or more low-resolution images into high-resolution images by means of algorithms. As a criterion for evaluating the image quality, the higher the resolution, the higher the pixel density of the picture, and the richer the detail information contained, the higher the picture quality. In reality, under the influence of a plurality of objective factors such as acquisition equipment, noise, shooting environment and the like, the acquired picture often cannot meet the requirement, so that the resolution of the image is improved, the quality of the image is improved, and the problem is always a hot spot concerned by people. The most direct method for improving the image resolution is to improve the optical hardware in the acquisition system, but this method is constrained by the difficulty in greatly improving the manufacturing process, the high manufacturing cost, and the like, so the technology for realizing the super-resolution image reconstruction from the viewpoint of software and algorithm becomes a hot research topic in a plurality of fields such as image processing and computer vision.

At present, image super-resolution reconstruction algorithms are mainly classified into 3 types, that is, interpolation-based image reconstruction algorithms, reconstruction-based image reconstruction algorithms, and learning-based image reconstruction algorithms.

The super-resolution reconstruction algorithm based on interpolation is a reconstruction technology starting to be widely researched in the early stage, and the algorithm mainly utilizes adjacent pixel points to calculate the gray value of the pixel point to be inserted. The interpolation method is a non-iterative spatial domain algorithm, has the characteristic of high calculation efficiency, and is the simplest image super-resolution reconstruction algorithm. But at the same time, the reconstructed image may have blocking and blurring effects due to the simplistic interpolation method.

The reconstruction-based super-resolution reconstruction algorithm reflects the relationship between the low-resolution image and the high-resolution image by establishing a degradation model of the image, and based on some constraint conditions, the optimal estimation value of the high-resolution image is solved from the model by using some optimization algorithms. The reconstruction effect of such methods depends to a large extent on the choice of a priori conditions, and the generated image may lack important detailed information or be too smooth.

The method learns the mapping relation of the super-resolution image based on the learning image super-resolution reconstruction algorithm by utilizing the low-resolution image and the high-resolution image, and then reconstructs the high-resolution image. The existing mainstream methods include a sparse coding method and super-resolution reconstruction based on deep learning. With the emergence of massive data and the development of artificial intelligence technology, a deep learning-based method becomes a main trend of current research. In 2014, Dong et al introduced the image super-resolution field through a convolutional network first, and originally proposed an SRCNN method, the algorithm takes a low-resolution image subjected to bicubic interpolation as input, utilizes three convolutional layers to nonlinearly map the image from a low-resolution space to a high-resolution space, completes the processes of image extraction and feature representation, nonlinear mapping and image reconstruction, directly reconstructs the high-resolution image, realizes end-to-end image reconstruction, has a reconstruction effect remarkably superior to that of a traditional super-resolution algorithm, and proves the excellent performance of deep learning in the image super-resolution field. Then Dong et al propose FSRCNN to improve SRCNN, propose training with low resolution images as input to the network, and enlarge the size and reconstruct high resolution images at the end of the network through a deconvolution layer. In 2016, Kim et al propose a 20-layer deep convolutional neural network model VDSR to perform deep extraction on image features, and accelerate the convergence rate of the network by introducing a residual error model, thereby improving the generalization capability of the model.

The image super-resolution reconstruction technology based on deep learning utilizes a convolutional neural network framework, and obtains hierarchical feature representation through automatic learning of an image data set, so that a good effect is achieved. However, in the training process, because the parameters are too many and the convergence speed is slow, designing a network with high image reconstruction quality and fast training convergence becomes the key of the problem.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems of weak network learning capability, long training time, pending improvement of reconstructed image quality and the like of most networks, the invention provides an image super-resolution method based on residual learning, which takes an original low-resolution image as the input of a model, cascades a deep convolutional network by using a small convolutional kernel, enlarges the receptive field and fully utilizes context information; by introducing the residual block, high-frequency information of the image is learned, the calculation complexity is reduced, the convergence rate of network training is improved, and high-precision real-time reconstruction of the image is realized.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:

an image super-resolution reconstruction method based on residual learning comprises the following steps:

step 1), a common data set for training the system is obtained.

The data sets include DIV2K, Set5, and Set14, with DIV2K as the training data Set and Set5 and Set14 as the testing data Set.

And 2) acquiring a low-resolution image to be processed.

And 3) constructing a neural network unit based on the residual error structure to form an image super-resolution network based on residual error learning.

The neural network unit comprises a feature extraction layer of a low-resolution image, a nonlinear mapping layer for detail recovery and an image reconstruction layer.

The feature extraction layer comprises a 3 x 3 convolutional layer, the number of channels is 64, and the feature extraction layer is used for extracting shallow features of the image.

The nonlinear mapping layer is composed of two residual error modules, the extracted shallow layer features are used as input to learn high-frequency information of the image, and the two residual error modules are marked as a first residual error convolution module and a second residual error convolution module. The first residual convolution module inputs the feature diagram with the shape of (n, c, x, y) into 8 convolution layers in sequence, wherein n is the batch size, c is the number of channels of the feature diagram, x and y represent the dimensions of the feature diagram, and then adds the output result to the original feature diagram, and the 8 convolution layers are all 3 × 3 convolution layers with the number of channels being 64. The second residual module inputs 8 convolutional layers in the input sequence with the feature diagram output by the first residual module as the input sequence, the shape of the convolutional layers is (n, c, x, y), and the output result is added with the original feature diagram.

The convolution operation of the feature extraction layer and the nonlinear mapping layer is expressed as the following formula:

wherein f is ^l Represents the output of the current layer, W _l Representing the filter weight of each layer, b _l Representing the offset vector of each layer, f ^l-1 Represents the output of the previous layer, represents the convolution operation of the current layer,

and expressing a nonlinear activation function, wherein the activation function is selected from a corrected Linear Unit (ReLU).

The image reconstruction layer consists of a 3 x 3 deconvolution layer. Inputting the feature image into a deconvolution layer so that the 2-, 3-, or 4-fold reduced image is up-sampled to a high resolution image size of the target, inputting a relationship between the image size and the target size:

wherein, W ₁ And H ₁ Respectively representing the width and height, W, of the input image ₂ And H ₂ Respectively representing the width and height of the output image, S being the step size of the convolution kernel movement, F being the size of the convolution kernel, and P representing 0 fill.

Step 4), calculating the loss of the network prediction and the label, adjusting the network parameters according to the loss, and calculating the loss by using a Root Mean Square Error (RMSE) loss function, wherein the loss function formula is as follows:

wherein, H and W respectively represent the height and width of the image, and X (i, j) and Y (i, j) respectively represent pixel points corresponding to the reconstructed image and the original image.

The loss function is calculated by using a mean square and error function, so that the deviation between an observed value and a true value can be measured, the relation between a reconstructed image and a true value is directly displayed, and the prediction effect of the model can be directly obtained.

And during training, comparing the reconstructed image with a corresponding high-resolution image in a training set, and making convergence judgment. If the convergence is reached, outputting a reconstructed image, otherwise, updating the parameters and training until the convergence is reached.

And 5) inputting the data set into a neural network unit for training to obtain a trained neural network unit, and inputting the low-resolution image to be processed obtained in the step 2) into the trained neural network unit.

And 6), outputting the reconstructed high-resolution image, and evaluating the image reconstruction precision.

And (3) measuring the difference between the generated image and the original high-resolution image by using two judgment standards of peak signal-to-noise ratio (PSNR) and Structural Similarity (SSIM).

The peak signal-to-noise ratio PSNR measures the image reconstruction quality by calculating the error between corresponding pixel points, and the calculation method is as the formula:

where n represents the number of bits per pixel and MSE refers to the mean square error. A larger peak signal-to-noise ratio PSNR value indicates less distortion.

The structural similarity SSIM comprehensively measures the similarity of images from three angles of brightness, contrast and structure, x and y respectively represent an original high-resolution image and a restored high-resolution image, and the calculation method is as shown in the formula:

SSIM(x,y)＝l(x,y)*c(x,y)*s(x,y)

in the above calculation formula, l (x, y) represents luminance comparison, c (x, y) represents contrast comparison, and s (x, y) represents structural comparison. Wherein, mu _x And mu _y Representing the mean value of the pixels, delta, of the two images, respectively _x And delta _y Representing the standard deviation, delta, of two images _xy Representing the covariance of the blocks of pixels in the two images, C ₁ 、C ₂ And C ₃ Is a constant. SSIM value range is [0,1 ]]A closer result to 1 indicates less distortion.

Preferably: each convolutional layer consists of 64 channels, and the convolutional kernel size is 3 × 3.

Preferably: zero padding of the feature image is required prior to each convolution operation.

Preferably: the number of bits per pixel, n, is 8.

Compared with the prior art, the invention has the following beneficial effects:

the invention increases the extraction capability of the image characteristics, reduces the calculation complexity of a training system, and improves the accuracy of edge prediction, so that the image reconstruction precision is increased.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a diagram comparing the structure of the conventional residual block and the residual block of the present invention.

Fig. 3 is a diagram of a super-resolution network structure.

Detailed Description

The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.

An image super-resolution reconstruction method based on residual learning is disclosed, as shown in fig. 1, and comprises the following steps:

step 1), a common data set for training the system is obtained. The data sets include DIV2K, Set5, and Set14, where DIV2K contains 800 training images, Set5 contains 100 verification images, Set14 contains 100 test data sets, 800 training data sets in DIV2K are used, and Set5 and Set14 are used as test data sets, each instance of data contains a high resolution image and a low resolution image that has been down-sampled by 2, 3, or 4 times.

And 2) acquiring a low-resolution image to be processed.

Step 3), a neural network unit is constructed on the basis of a residual error structure to form an image super-resolution network based on residual error learning, as shown in FIG. 3;

the neural network unit comprises a feature extraction layer of a low-resolution image, a non-linear mapping layer based on a residual block for detail recovery and an image reconstruction layer.

The nonlinear mapping layer is composed of two residual error modules, and is marked as a first residual error convolution module and a second residual error convolution module.

Considering that the batch regularization in the conventional residual block regularizes the output characteristics of the convolutional layer, which results in partial loss of information, and because the batch regularization parameters are the same as the convolution operation parameters, a large amount of memory is consumed, and the conventional residual structure is shown in the left part of fig. 2. Therefore, the optimized residual error module is used, compared with the traditional residual error module, the batch normalization layer is removed, the network parameters are reduced, and the network performance is improved. The present invention uses the residual structure shown in the right part of fig. 2.

And taking the extracted shallow features as input to learn high-frequency information of the image. The first residual convolution module inputs the feature map with the shape of (n, c, x, y) into 8 convolution layers in sequence, wherein n is the batch size, c is the number of channels of the feature map, x and y represent the size of the feature map, the output result is added with the image features after feature extraction, and the 8 convolution layers are 3 x 3 convolution layers with the number of channels of 64. The second residual module inputs 8 convolutional layers in the input sequence by taking the (n, c, x, y) feature diagram output by the first residual module as an input sequence, and then adds the output result with the image features after feature extraction. In addition, zero filling needs to be carried out on the characteristic image before convolution operation every time, so that the size of the image after convolution is not changed, and the accuracy of edge prediction can be improved.

The convolution operation of the feature extraction layer and the nonlinear mapping layer can be simply expressed as the following formula:

wherein, W _l Representing the filter weight of each layer, b _l Representing the offset vector of each layer, f ^l-1 Represents the output of the previous layer, represents the convolution operation of the current layer,

representing a nonlinear activation function, the activation function of the invention selects a modified Linear Unit (ReLU), f ^l Representing the output of the current layer. Downsampling the convolved results in view of the pooling layer, resulting in refinementsSection information is lost and the goal of the super resolution algorithm is to increase the detail information of the image, so the present invention does not use a pooling layer.

The image reconstruction layer consists of a 3 x 3 deconvolution layer. The deconvolution layer is used to learn the upsampling kernel of the image features and can be considered as the inverse of the convolution. Deconvolution is performed by transposing the convolution kernel and convolving it with the convolved result. Inputting the feature image into the deconvolution layer such that the 2-, 3-, or 4-fold reduced image is up-sampled to a high resolution image size of the target, the relationship between the input image size and the target size can be expressed as:

Step 4), calculating the loss of the network prediction and the label, adjusting the network parameters according to the loss, and calculating the loss by using an MSE loss function, wherein the loss function formula is as follows:

wherein, H and W respectively represent the height and width of the image, and X (i, j) and Y (i, j) respectively represent pixel points corresponding to the reconstructed image and the original image. And during training, comparing the reconstructed image with a corresponding high-resolution image in a training set, and making convergence judgment. If the convergence is achieved, outputting a reconstructed image; otherwise, the information is reversely propagated from bottom to top, and the parameters are updated to carry out training and learning again until convergence.

Network parameters are adjusted according to the loss, and MSE is minimized by Adam optimization method (using random gradient descent). Initial learning rate of 10 ^-4 The learning rate attenuation period is 200 steps, and the number of training iteration steps is 4000 steps.

Each convolution layer in the neural network unit is composed of 64 channels, and the size of the convolution kernel is 3 x 3. In addition, zero filling needs to be carried out on the characteristic image before convolution operation every time, so that the size of the image is not changed after convolution, and the accuracy of edge prediction can be improved.

The evaluation standard of the reconstructed image uses two judgment standards of Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) to measure the difference between the generated image and the original high-resolution image.

PSNR measures image reconstruction quality by calculating errors between corresponding pixel points, and the calculation method is as the formula (4):

where n represents the number of bits per pixel, typically 8, and MSE refers to the mean square error. PSNR is in dB, with larger values indicating less distortion.

The similarity of the images by the SSIM is comprehensively measured from three aspects of brightness, contrast and structure. Assuming that x and y represent the original high-resolution image and the restored high-resolution image, respectively, the calculation method is as shown in equations (5) to (8):

SSIM(x,y)＝l(x,y)*c(x,y)*s(x,y) (5)

in the above calculation formula, l (x, y) represents luminance comparison, c (x, y) represents contrast comparison, and s (x, y) represents structural comparison; wherein, mu _x And mu _y Representing the mean value of the pixels, delta, of the two images, respectively _x And delta _y Representing the standard deviation, delta, of two images _xy Representing the covariance of the blocks of pixels in the two images, C ₁ 、C ₂ And C ₃ Is constant and is used for avoiding system errors caused by the denominator being 0. SSIM value range is [0,1 ]]A closer result to 1 indicates less distortion.

The invention provides an image super-resolution method based on residual error learning, aiming at the problems of weak network learning capability, long training time, pending improvement of reconstructed image quality and the like of most of current networks. Through verification, the system has excellent performance on image reconstruction accuracy, and achieves better super-resolution reconstruction performance.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. An image super-resolution reconstruction method based on residual learning is characterized by comprising the following steps:

step 1), acquiring a public data set for training a system;

step 2), acquiring a low-resolution image to be processed;

step 3), constructing a neural network unit on the basis of a residual error structure to form an image super-resolution network based on residual error learning;

the neural network unit comprises a feature extraction layer of a low-resolution image, a nonlinear mapping layer for detail recovery and an image reconstruction layer;

the feature extraction layer comprises a 3 x 3 convolution layer, the number of channels is 64, and the feature extraction layer is used for extracting shallow features of the image;

the nonlinear mapping layer is composed of two residual error modules, the extracted shallow layer characteristics are used as input to learn high-frequency information of the image, and the two residual error modules are marked as a first residual error convolution module and a second residual error convolution module; the first residual convolution module inputs the feature diagram with the shape of (n, c, x, y) into 8 convolution layers in sequence, wherein n is the batch size, c is the number of channels of the feature diagram, x and y represent the dimensions of the feature diagram, the output result is added with the original feature diagram, and the 8 convolution layers are all 3 x 3 convolution layers with the number of channels of 64; the second residual error module takes the characteristic diagram output by the first residual error module as an input sequence to input 8 convolutional layers, the shape of the convolutional layers is (n, c, x and y), and then the output result is added with the original characteristic diagram;

expressing a nonlinear activation function, wherein the activation function selects a modified linear unit ReLU;

the image reconstruction layer is composed of a 3 x 3 deconvolution layer; inputting the feature image into a deconvolution layer so that the 2-, 3-, or 4-fold reduced image is up-sampled to a high resolution image size of the target, inputting a relationship between the image size and the target size:

wherein, W ₁ And H ₁ Respectively representing the width and height, W, of the input image ₂ And H ₂ Respectively representing the width and the height of an output image, wherein S is the step length of the movement of a convolution kernel, F is the size of the convolution kernel, and P represents 0 filling;

step 4), calculating the loss of the network prediction and the label, adjusting the network parameters according to the loss, and calculating the loss by using a root mean square error loss function, wherein the loss function formula is as follows:

h and W respectively represent the height and width of the image, and X (i, j) and Y (i, j) respectively represent pixel points corresponding to the reconstructed image and the original image;

during training, the reconstructed image is compared with a high-resolution image corresponding to a training set, and convergence judgment is made; if the convergence is reached, outputting a reconstructed image, otherwise, updating the parameters and training until the convergence is reached;

step 5), inputting the data set into a neural network unit for training to obtain a trained neural network unit, and inputting the low-resolution image to be processed obtained in the step 2) into the trained neural network unit;

step 6), outputting the reconstructed high-resolution image and evaluating the image reconstruction precision;

measuring the difference between the generated image and the original high-resolution image by using two judgment standards of peak signal-to-noise ratio (PSNR) and Structural Similarity (SSIM);

wherein n represents the number of bits per pixel, and MSE refers to the mean square error; the larger the peak signal-to-noise ratio PSNR value is, the smaller the distortion is;

SSIM(x,y)＝l(x,y)*c(x,y)*s(x,y)

in the above calculation formula, l (x, y) represents luminance comparison, c (x, y) represents contrast comparison, and s (x, y) represents structural comparison; wherein, mu _x And mu _y Representing the mean value of the pixels, delta, of the two images, respectively _x And delta _y Representing the standard deviation, delta, of two images _xy Representing the covariance of the blocks of pixels in the two images, C ₁ 、C ₂ And C ₃ Is a constant; SSIM value range is [0,1 ]]A closer result to 1 indicates less distortion.

2. The image super-resolution reconstruction method based on residual learning of claim 1, wherein: each convolutional layer consists of 64 channels, and the convolutional kernel size is 3 × 3.

3. The image super-resolution reconstruction method based on residual learning of claim 1, wherein: zero padding of the feature image is required prior to each convolution operation.

4. The image super-resolution reconstruction method based on residual learning of claim 1, wherein: the number of bits per pixel, n, is 8.

5. The image super-resolution reconstruction method based on residual learning of claim 1, wherein: the data sets include DIV2K, Set5, and Set14, with DIV2K as the training data Set and Set5 and Set14 as the testing data Set.