Disclosure of Invention
In order to solve the problems in the prior art, the invention provides an image super-resolution reconstruction method based on an attention mechanism and a two-channel network, which comprises the following steps: acquiring an image to be detected in real time, and preprocessing the image to be detected; inputting the preprocessed image into a trained image super-resolution reconstruction model to obtain a high-definition reconstruction image; evaluating the reconstructed image by adopting the peak signal-to-noise ratio and the structural similarity, and marking the high-definition reconstructed image according to an evaluation result; the basis of the image super-resolution reconstruction model is a convolutional neural network;
the process of training the image super-resolution reconstruction model comprises the following steps:
s1: obtaining an original high-definition picture data set, and zooming pictures in the data set by adopting a bicubic interpolation degradation model;
s2: preprocessing the zoomed data set to obtain a training data set;
s3: inputting each image data in the training data set into a shallow layer characteristic channel and a deep layer characteristic channel in an image super-resolution reconstruction model respectively for characteristic extraction;
s4: extracting initial features of the input image by adopting the first convolution layer; inputting the initial characteristics into an information cascade module, and aggregating the hierarchical characteristic information of the convolutional layer;
s5: inputting hierarchical characteristic information aggregated by the information cascade module into an improved residual error module to obtain relevance on a channel and dependency information on a global space;
s6: adopting non-local cavity convolution to carry out global feature extraction on the dependence information to obtain a final deep feature map;
s7: extracting initial features of the input image by adopting the second convolution layer; inputting the initial features into an improved VGG network, and extracting shallow features of the image to obtain a shallow feature map;
s8: fusing the deep layer characteristic diagram and the shallow layer characteristic diagram, and performing up-sampling on the fused characteristic diagram to obtain a high-definition reconstruction diagram;
s9: and (5) constraining the difference between the high-definition reconstructed image and the original high-definition image by using a loss function, and continuously adjusting the parameters of the model until the model is converged to finish the training of the model.
Preferably, the images in the data set are scaled by 2 times, 3 times, 4 times and 8 times using a bicubic interpolation degradation model.
Preferably, the formula of the bicubic interpolation degradation model is as follows:
ILR=HdnIHR+n
preferably, the process of preprocessing the scaled data set includes performing enhancement processing on the image, including performing translation processing and flipping processing in horizontal and vertical directions on the image; and dividing the enhanced data into different small image blocks, and collecting the divided images to obtain a training data set.
Preferably, the information concatenation module comprises stacking 10 times the feature aggregation structure; the characteristic aggregation structure comprises at least three layers of convolutional neural networks, a characteristic channel merging layer, a channel attention layer and a channel number conversion layer, wherein all the layers of convolutional neural networks are connected in sequence, the output end branches of all the layers of convolutional neural networks except the last layer of convolutional neural network are connected with the characteristic channel merging layer, and the characteristic channel merging layer, the channel attention layer and the channel number conversion layer are connected in sequence to form an information cascade module; the process of the module processing image data comprises: firstly, extracting characteristic information of an input image in sequence by using each layer of convolutional neural network, then combining the characteristic information extracted by each layer of convolutional on a characteristic channel combining layer, distinguishing the importance of the combined information by using a channel attention mechanism, finally reducing the number of channels into the number of input channels, and repeating the steps for 10 times to obtain the hierarchical characteristic information of the aggregation convolutional layer.
Preferably, the improved residual module comprises: the system comprises a residual error network structure, a channel attention mechanism layer and a space attention mechanism layer, wherein the residual error network structure comprises a convolutional neural network layer, a nonlinear activation layer and a convolutional neural network layer; the process of the module processing image data comprises: inputting the hierarchical feature information into a residual error network structure to extract feature information, acquiring the relevance of the extracted feature information on a channel by using a channel attention mechanism, transmitting the relevance downwards, and acquiring the dependency on the global space by using a space attention mechanism.
Preferably, the non-local hole convolution block includes: four layers of parallel cavity convolution layers with expansion parameters of 1, 2, 4 and 6 and three layers of common convolution neural network layers; the process of the module processing image data comprises: firstly, extracting characteristic information of improved residual error network input dependency information by adopting four cavity convolution with different expansion parameters and two common convolution neural networks respectively; then, fusing the characteristic information obtained by convolution of the four cavities on a characteristic channel, and fusing the characteristic information extracted by the common convolution neural network according to the value of the pixel matrix; finally, the two kinds of fused feature information are added to obtain global feature information
Preferably, the improved VGG network structure comprises: embedding the pooling layers into the common convolutional layers to obtain a VGG network structure, wherein the pooling layers comprise 10 common convolutional layers and 3 pooling layers; the process of the module processing image data comprises: firstly, extracting 64 channel feature information by using 2-layer convolution and one-layer pooling, then extracting 128 channel feature information by using 2-layer convolution and one-layer pooling, then extracting 512 channel feature information by using 3-layer convolution and one-layer pooling, and finally recovering the 512 channel information to 64 channels by using 3-layer convolution; wherein the pooling layer maintains the feature dimension unchanged using padding.
Preferably, the loss function expression of the image super-resolution reconstruction model is as follows:
preferably, the formula for evaluating the reconstructed image by using the peak signal-to-noise ratio and the structural similarity is as follows:
the invention has the advantages that:
1. the invention uses a dual-channel network, one network uses an improved residual error structure to extract valuable high-frequency characteristics, namely high-level characteristics, and the other network uses an improved VGG (parameters of VGG convolution layers and pooling layers are finely adjusted to ensure that the sizes of input and output images are consistent, and the last full-connection layer is discarded) to extract rich low-frequency characteristics, and finally, the characteristics are fused.
2. The invention uses a dense connection mode at the specific position (2 information cascade modules at the head and the tail) of the model, aggregates the information of each convolution layer to achieve the purpose of fully utilizing the information of the convolution layers, and finally uses a channel attention mechanism to calculate the channel weight of the combined information instead of simply reducing the channel.
3. The invention uses a space attention mechanism, and adds space attention after the existing channel attention mechanism, so that the extraction of global information is more sufficient, and the utilization of characteristics is more comprehensive. Meanwhile, before upsampling, non-local cavity convolution is used, and previous result information is subjected to one-time global dependent feature extraction, so that the output result relation is tighter, and the feature information is richer.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
An image super-resolution reconstruction method based on an attention mechanism and a two-channel network comprises the following steps: acquiring an image to be detected in real time, and preprocessing the image to be detected; inputting the preprocessed image into a trained image super-resolution reconstruction model to obtain a high-definition reconstruction image; evaluating the reconstructed image by adopting the peak signal-to-noise ratio and the structural similarity, and marking the high-definition reconstructed image according to an evaluation result; the image super-resolution reconstruction model is based on a convolutional neural network.
An image super-resolution reconstruction model structure is shown in fig. 1, and comprises a deep layer feature channel, a shallow layer feature channel, an upsampling layer and a third convolution layer; the deep feature channel comprises a first convolution layer, an information cascade module, an improved residual error module and a non-local cavity convolution block; processing an input image through a first convolution layer in sequence through an information cascade module, an improved residual error module and a non-local cavity convolution block to obtain a deep characteristic diagram; the shallow feature channel comprises a second convolution layer and an improved VGG network, and the input image is processed by the second convolution layer and then is processed by the improved VGG network to obtain a shallow feature map; and fusing the deep layer characteristic diagram and the shallow layer characteristic diagram, upsampling the fused image by adopting an upsampling layer, and performing convolution operation on the upsampled image by adopting a third convolution layer to obtain a high-definition reconstruction diagram.
Optionally, the deep feature channel includes n information cascade modules and m improved residual modules, wherein all the information cascade modules are connected in series to obtain an information cascade module group; and all the improved residual modules are connected in series to obtain an improved residual module group.
Preferably, the deep feature channel comprises 2n information cascade modules, wherein n information cascade modules are connected in series to obtain a first information cascade module group, and the rest n information cascade modules are connected in series to obtain a second information cascade module group; the first information cascade module group and the second information cascade module group are respectively arranged at the input end and the output end of the improved residual module group.
The process of training the image super-resolution reconstruction model comprises the following steps:
s1: obtaining an original high-definition picture data set, and zooming pictures in the data set by adopting a bicubic interpolation degradation model;
s2: preprocessing the zoomed data set to obtain a training data set;
s3: inputting each image data in the training data set into a shallow layer characteristic channel and a deep layer characteristic channel in an image super-resolution reconstruction model respectively for characteristic extraction;
s4: extracting initial features of the input image by adopting the first convolution layer; inputting the initial characteristics into an information cascade module, and aggregating the hierarchical characteristic information of the convolutional layer;
s5: inputting hierarchical characteristic information aggregated by the information cascade module into an improved residual error module to obtain relevance on a channel and dependency information on a global space;
s6: adopting non-local cavity convolution to carry out global feature extraction on the dependence information to obtain a final deep feature map;
s7: extracting initial features of the input image by adopting the second convolution layer; inputting the initial features into an improved VGG network, and extracting shallow features of the image to obtain a shallow feature map;
s8: fusing the deep layer characteristic diagram and the shallow layer characteristic diagram, and performing up-sampling on the fused characteristic diagram to obtain a high-definition reconstruction diagram;
s9: and (5) constraining the difference between the high-definition reconstructed image and the original high-definition image by using a loss function, and continuously adjusting the parameters of the model until the model is converged to finish the training of the model.
The data set adopts a DIV2K data set, wherein eight hundred high-definition (HR) pictures and low-resolution-rate (LR) pictures which are subjected to degradation models (bicubic interpolation degradation) and correspond to the HR pictures are used as training sets, and five pictures are used as verification sets. Five data sets of Set5, Set14, Urban100, Manga109 and BSD100 are used as test sets, and the test data sets are characterized in that texture information is very rich, most of texture information can be lost in degraded low-resolution pictures, and the accuracy of super-resolution reconstruction of the images is very tested. The evaluation indexes are traditional PSNR and SSIM, wherein PSNR represents peak signal-to-noise ratio, and SSIM represents structural similarity.
One forward and one backward propagation of all data in the training set in the neural network is called a round, each round updates the parameters of the model, and the maximum number of rounds is set to 1000 rounds. We set the learning rate to be updated every 200 iterations, and the model and its parameters that achieve the best results on the test data set are saved during 1000 iterations of training the model.
The times of scaling the images in the data set by adopting a bicubic interpolation degradation model in the original high-definition image data set are 2 times, 3 times, 4 times and 8 times. The formula of the degradation model is:
ILR=HdnIHR+n
wherein, ILRRepresenting low resolution images, HdnRepresenting a model of degradation, IHRRepresenting the original high resolution image and n representing additional noise.
The preprocessing process of the zoomed data set comprises the steps of enhancing the image, including the steps of translating the image and turning the image in the horizontal and vertical directions; and dividing the enhanced data into different small image blocks, and collecting the divided images to obtain a training data set.
As shown in fig. 2, the structure of the information concatenation module includes: the following structures are stacked 10 times-three layers of convolutional neural networks, a feature channel merging layer, a channel attention layer and a channel number transformation layer in sequence. The process of the module processing image data comprises: firstly, extracting characteristic information of an input image in sequence by using each layer of convolutional neural network, then combining the characteristic information extracted by each layer of convolutional on a characteristic channel combining layer, distinguishing the importance of the combined information by using a channel attention mechanism, finally reducing the number of channels into the number of input channels, and repeating the steps for 10 times to obtain the hierarchical characteristic information of the aggregation convolutional layer.
The information cascade module is used for carrying out aggregation on image information, information of each layer of convolution layer is fully reserved, because the image just enters the convolution neural network, low-frequency information is sufficient and abundant, but with the deepening of the network layer number, more attention is paid to more abstract characteristics, and a lot of edge texture information and smooth information are gradually lost, so that the information cascade module can be used for well capturing more low-frequency information and fusing the low-frequency information into the model.
FIC=HIC(ILR)
Wherein, ILRRepresenting a low resolution input image, HICRepresenting convolution operations of cascaded blocks, FICRepresenting the result of the convolution calculation.
As shown in fig. 3, the structure of the improved residual module includes: the system comprises a residual error network structure, a channel attention mechanism layer and a space attention mechanism layer, wherein the residual error network structure comprises a convolutional neural network layer, a nonlinear activation layer and a convolutional neural network layer; the process of the module processing image data comprises: inputting the hierarchical feature information into a residual error network structure to extract feature information, acquiring the relevance of the extracted feature information on a channel by using a channel attention mechanism, transmitting the relevance downwards, and acquiring the dependency on the global space by using a space attention mechanism.
The output of the cascade module is used as the input of the improved residual module, a channel attention mechanism and a space attention mechanism are connected after each ResNetBlock, meanwhile, relevance on the channel and dependency information on the global space are captured and integrated into the convolutional neural network, characteristic information is enriched, and stability is provided for training of the deep network.
FRBC=HRBC(FIC)
FCA=HCA(FRBC)
FSA=HSA(FCA)
Wherein HRBCRepresenting a residual block structured convolution operation with input source concatenation, i.e. fusing the input information with the output via the residual block. FRBCRepresents the output characteristic information through the residual block, which can be expressed as [ f [ ]1,f2,f3…fn,]Respectively representing the channel characteristics calculated by each convolution kernel. Then, a channel attention mechanism is used for each channel characteristic, and the product of the weight value of each channel and the original input data, namely H, is obtainedCAConvolution operation, F, representing the attention of the channelCARepresenting the characteristic information after the attention of the channel. Then, global dependency information is calculated on the output characteristic information by using a spatial attention mechanism and is fused with the original input data, namely HSAConvolution operations, F, representing a spatial attention mechanismSARepresenting the feature information after spatial attention.
As shown in fig. 4, the channel attention structure includes: the structure consists of a global tie pooling layer, a 1x1 convolutional layer, a nonlinear activation layer and a 1x1 convolutional layer in sequence. The process of the module processing image data comprises: firstly, obtaining a weight representation symbol of each channel through global average pooling, then reducing the number of channels by using a 1x1 convolutional layer, introducing nonlinear information by using a nonlinear active layer, then converting the number of channels back by using a 1x1 convolutional layer, and finally multiplying the number of channels back by the original input characteristic information to obtain the relevance on the characteristic channel. The spatial attention structure includes: the structure comprises a 1x1 convolution layer, a softmax active layer, a 1x1 convolution layer and a nonlinear active layer in sequence. The process of the module processing image data comprises: firstly, converting input feature information CxHxW into a global feature map of HWx1x1 through a 1x1 convolutional layer, then carrying out normalization constraint on the global feature map by using a softmax function, then multiplying the global feature map back to the original input information, and finally obtaining dependency information on a global space through a layer of 1x1 convolutional layer and a layer of nonlinear activation layer.
As shown in fig. 5, the structure of the non-local void volume block includes: the structure comprises four layers of parallel cavity convolution layers with expansion parameters of 1, 2, 4 and 6 respectively and three layers of common convolution neural network layers. The process of the module processing image data comprises: firstly, feature information can be extracted by simultaneously using cavity convolutions of four different expansion parameters and two common convolution neural networks, then the feature information obtained by the convolution of the four cavities is fused on a feature channel on one hand, the feature information extracted by the common convolution neural networks is fused according to the value of a pixel matrix on the other hand, and finally the two kinds of fused feature information are added to obtain global feature information.
The improved VGG network structure comprises: embedding the pooling layers into the common convolutional layers to obtain a VGG network structure, wherein the pooling layers comprise 10 common convolutional layers and 3 pooling layers; the process of the module processing image data comprises: firstly, extracting 64 channel feature information by using 2-layer convolution and one-layer pooling, then extracting 128 channel feature information by using 2-layer convolution and one-layer pooling, then extracting 512 channel feature information by using 3-layer convolution and one-layer pooling, and finally recovering the 512 channel information to 64 channels by using 3-layer convolution; wherein the pooling layer maintains the feature dimension unchanged using padding.
And extracting global features by using non-local cavity convolution, and up-sampling the extracted feature information to expand the feature information into a size output result required by people. The cavity convolution can enlarge the receptive field without increasing parameters by setting the expansion rate, and the receptive field is embedded into the non-local convolution, so that the calculated amount can be obviously reduced, and meanwhile, global information can be obtained from different scales, and the extraction of the features is more comprehensive.
FNLHC=HNLHC(FSA)
Wherein HNLHCConvolution operations representing convolution of non-local holes, FNLHCRepresenting after convolution with non-local holesAnd acquiring the characteristic information. After the final feature information is subjected to upsampling, the final feature information is output as a corresponding high-definition reconstructed image, namely the formula of the reconstructed image is as follows:
FUp=HUp(FNLHC)
wherein HUpRepresenting convolution operations of the upsampling, FUpRepresenting the upsampled output characteristics.
The loss function expression of the image super-resolution reconstruction model is as follows:
where θ represents the number of parameters of the model, C
HRThe super-resolution calculation equation is expressed,
and
respectively representing the ith low-resolution image and the ith corresponding high-resolution image, N representing the number of images in the data set, HR representing the high resolution, and LR representing the low resolution.
The expression of the super-resolution calculation equation is as follows:
CHR=FUP(FNLHC(FSA(FCA(FRBC(FIC(ILR))))))
wherein, FUPRepresenting the up-sampled output information, FNLHCInformation representing convolution extraction of non-local holes, FSAInformation extracted representing a spatial attention mechanism, FCAInformation representing the channel attention mechanism extraction, FRBCInformation representing the extraction of residual blocks, FICRepresenting the information output by the cascaded modules.
Peak signal-to-noise ratio (PSNR) and Structural Similarity (SSIM) were used as the result evaluation indices:
where MSE represents the mean square error, MAX represents the maximum value in the pixel values, μXAnd muYMeans, σ, representing the mean of the pixels of image X, image YXAnd σYStandard values, σ, representing pixels of image X, YXYRepresenting the covariance of image X and image Y.
The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.