CN114897718B

CN114897718B - Low-light image enhancement method capable of balancing context information and space detail simultaneously

Info

Publication number: CN114897718B
Application number: CN202210472649.8A
Authority: CN
Inventors: 王勇; 蒋莉君
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2023-09-19
Anticipated expiration: 2042-04-29
Also published as: CN114897718A

Abstract

The invention discloses a low-light image enhancement method capable of balancing context information and space details simultaneously, and relates to the technical field of image processing. The invention uses CIRNet sub-network and SIRNet sub-network to recover the context information and space detail of the image, and uses context-space feature fusion module to fuse the two parts of information, comprising the following steps: s1, constructing a paired data set, wherein the data set comprises a low-light image and a normal-light image, and each low-light image I _low Normal illumination image I corresponding to the same scene _ref The method comprises the steps of carrying out a first treatment on the surface of the S2, inputting a low-light image I _low Into the network; s3, extracting a low-light image I _low Is a shallow feature of (c). According to the invention, through the new context-space feature fusion module, the module enhances the point-to-point relationship between the multi-scale image semantic features learned by the coding and decoding network and the input image and the output image learned by the full-resolution network, so that the recovery effect of the model is obviously improved.

Description

Low-light image enhancement method capable of balancing context information and space detail simultaneously

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a low-light image enhancement method capable of balancing context information and space details simultaneously.

Background

In daily life, the image quality is directly or indirectly influenced by external factors, low illumination is one of a plurality of unreliability factors, and images shot in the environment with insufficient illumination often have low brightness, serious noise and the like, so that the visual quality of the images is greatly reduced, and the execution of subsequent advanced visual tasks such as image recognition, image segmentation, image classification and the like can be influenced.

To enhance images captured in low-light environments, some conventional image enhancement algorithms are used to recover the content and details of the low-light image, but these methods still have some problems; for example, the gray level conversion method directly stretches lower gray level values in the image in a function mapping mode to improve the brightness of the image, and can enhance the overall contrast of the low-light image, but the method does not consider the condition that the gray level distribution of the low-light image is uneven, and the restored image is easy to lose details;

the histogram equalization method utilizes the histogram to count the number of pixels in each pixel value and adjusts the dynamic range of the image pixels by equalizing the distribution condition of the number of pixels, but ignores the noise treatment and is easy to generate overexposure phenomenon, and the traditional method based on the Retinex theory utilizes Gaussian to check the original image to carry out convolution so as to predict the illumination of the image, so that the brightness of the image can be better enhanced and the edge of the image can be reserved, but the halation phenomenon can occur when the area with larger difference of part of brightness is enhanced;

in recent years, with the rapid development of deep learning in the field of image processing, low-light image enhancement algorithms based on deep learning are largely emerging. Methods using multi-branch networks and multi-scale input images are popular, and these methods can effectively enhance the brightness of low-light images, however, since most networks only use single-branch or multi-branch codec networks or original resolution networks to recover the context information and spatial details of the images, the balance of the two contents is not considered, limiting the image enhancement effect; and the methods monitor each sub-network, which may lead to the fact that the contents of each sub-network cannot be well matched during aggregation, so that the gap is further enlarged, and the problems of image halation, image artifact, even color shift and the like occur.

Disclosure of Invention

The present invention aims to provide a low-light image enhancement method capable of balancing both context information and spatial details, so as to solve the above-mentioned problems in the background art.

In order to solve the technical problems, the invention is realized by the following technical scheme:

the invention relates to a low-light image enhancement method capable of balancing context information and space details simultaneously.

The method utilizes CIRNet sub-network and SIRNet sub-network to recover the context information and space detail of the image, and utilizes a context-space feature fusion module to fuse the two parts of information, and comprises the following steps:

s1, constructing a paired data set, wherein the data set comprises a low-illumination image and a normal-illumination image, and each low-illumination imageImage I _low Normal illumination image I corresponding to the same scene _ref ；

S2, inputting a low-light image I _low Into the network;

s3, extracting a low-light image I _low Is a shallow feature of (2);

s4. The CIRNet branch skillfully extracts the context semantic information by utilizing a multi-scale feature learning mode, and the encoding and decoding network comprises 3 encoders (EB ₁ ，EB ₂ And EB (electron beam) ₃ ) And 3 decoders (DB ₁ ，DB ₂ And DB ₃ ). Each coding module takes the characteristics of different scales as input, and each decoding module outputs the restored images of different scales;

s5, the SIRNet branch uses a full resolution sub-network to reserve point-to-point position information from an input image to an output image;

s6, a context-space feature fusion module CSFF fuses output features of the two branches;

s7, outputting the final enhanced low-light image.

Further, in S3, the shallow feature extraction module uses a convolution module and a channel attention module, and the formula is as follows:

f _shallow ＝f _CAB (f _Conv (I _low ))。

further, in S4, the CIRNet branch restores context information of the image by adopting a UNet-like network, and the encoder and the decoder are both composed of two channel attention modules;

the coding stage consists of three encodings, wherein the first encodings take the extracted shallow features as input, the input of more than one encodings of the other two encodings and the features of the Focus module for pixel-by-pixel downsampling are taken as input;

the multi-scale shallow features and Focus sampling features are fused by a feature attention module and then input into an encoder,

the feature attention module takes two features as input, the features after compression of useless information are input into a convolution module to obtain refined features, the refined features and the original input features are fused to obtain the features finally input into an encoder, and the formula is as follows:

f _concat ＝[f _Focus ,f _shallow ]

wherein each decoder outputs an enhanced multi-scale restored image through the supervision and attention module before, and simultaneously transmits useful information in the current feature to each next decoder, and the formula is as follows:

further, in S5, the SIRNet branch consists of three original resolution units, each original resolution unit consisting of 1 channel attention module;

all features in the network are the same as the input image in size to restore the fine position relationship of the input image to the point-to-point of the output image, and the formula is as follows:

further, in S6, refined fusion features are obtained by passing through a 1×1 convolution layer, a ReLu activation function layer and a 3×3 convolution layer through the preliminary fusion features, the features are respectively passed through a designed attention module for extracting context information and an attention module for extracting spatial position details, are respectively passed through a sigmoid activation function, and are obtained as re-calibrate features, and are weighted into original features through pixel multiplication to refine the context information and the spatial features of the original input features respectively.

Further, the context-space feature fusion module mainly comprises:

fusing the features of the initial different semantic contents by using pixel addition;

weighting the fusion features by a contextual information attention module and a spatial detail attention module;

the initial input features are enhanced using the weighting information.

Further, the context information attention module processes the fused features through a global average pooling layer and a global maximum pooling layer respectively, and obtains the weight capable of paying attention to the context information through a convolution layer and an activation layer.

Further, the spatial detail attention module inputs the fused features to the convolution layer, the activation layer and the convolution layer in sequence to obtain the weight for paying attention to the spatial detail.

Further, the system also comprises a low optical enhancement network, and the loss function of the low optical enhancement network is as follows:

L _SSIM ＝1-SSIM(I _enhanced ,I _ref )

the invention has the following beneficial effects:

the invention enhances the point-to-point relationship between the input image and the output image by utilizing the multi-scale image semantic features learned by the encoding and decoding network and the full-resolution network through the context-space feature fusion module, thereby obviously improving the recovery effect of the model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an overall network framework of the present invention;

FIG. 2 is a schematic diagram of pixel-by-pixel downsampling in accordance with the present invention;

FIG. 3 is a block diagram of a downsampling feature fusion module employed in the present invention;

FIG. 4 is a block diagram of a context-space feature fusion module of the present invention;

FIG. 5 is a graph comparing images of the present invention with other algorithms after recovery.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Respectively restoring the context information and the space detail of the image through different double-branch sub-networks, wherein the design of the two branches affects the final recovery result;

the effective fusion of the information of the two branches can be used for assisting the overall restoration process of an image, the context information of the image contains rich semantic content of an original image, the space detail of the image keeps the point-to-point relation between an input image and an output image, when a low-light image is enhanced, the content of the original image hidden in darkness is required to be restored, the semantic content of the image is enabled to be complete, the information of each pixel point is required to be restored, noise is reduced, the existing majority of deep learning models only use a single-branch/multi-branch coding and decoding network to pay attention to the restoration of the context information of the image or the restoration of the space detail of the full-resolution network to emphasize the image, and the balance of the context information and the space detail is not fully considered, so that the enhanced image has the problems of image noise expansion, low overall and local brightness, serious color distortion and the like;

specifically, please refer to fig. 1, which is a schematic diagram of an overall network framework of the present invention;

wherein SIRNet is space detail recovery sub-network, CIRNet is context information recovery sub-network; the CIRNet adopts a common UNet-like structure, and the structure can effectively restore the semantic content of the image; to enhance the ability of the network to extract features, each encoder takes as input an image of a corresponding scale; the input of other encoders except the first encoder fuses the output of the last encoder and the shallow layer characteristics of the input image with corresponding scale; since the input of low resolution adopts bilinear downsampling, partial detail loss is easy to cause, and the pixel-by-pixel downsampling characteristic is used for compensating the shallow layer characteristic of the low resolution image; SIRNet adopts an original resolution network, i.e. all features of the network are the same as the input image size; to balance the two branches, the performance of the network for enhancing the low-light image is improved;

the invention adopts two strategies: information extracted by cirnet flows to SIRNet;2. designing a new context-space feature fusion module;

the whole flow of the invention is as follows:

1. constructing paired data sets; the dataset should consist of low-light images and normal-light images, wherein each low-light image I _low Corresponding to a normal illumination image I under the same scene _ref ；

2. Inputting the constructed low-light image, taking the normal-light image as the group trunk, and inputting the low-light image I _low Into the network;

3. shallow features of an input image are extracted by adopting a 3X 3 convolution layer and a channel attention module, and a low-light image I is extracted _low Is shared by both SIRNet and CIRNet branches;

f _shallow ＝f _CAB (f _Conv (I _low ))

4. inputting the extracted shallow features into a CIRNet branch to extract the context features so as to restore the context information; the CIRNet branch adopts a network in the form of an encoder-decoder, and the network in the form can effectively extract up and down Wen Yuyi letters of an image by downsampling the characteristics in an encoding stage and upsampling the characteristics in a decoding stage;

wherein each of the encoder and the decoder is composed of two channel attention modules that focus on useful information in each of the feature channels; the coding stage consists of three encodings, wherein the first encodings take the extracted shallow features as input, the input of more than one encodings of the other two encodings and the pixel-by-pixel downsampling feature of the Focus module are taken as input; since the general bilinear downsampling can cause the loss of the details of the image part, the Focus downsampling feature is used for compensating the information loss caused by the bilinear downsampling; as shown in fig. 2, the Focus module takes three-channel RGB images as input, and obtains a twelve-channel feature map by sampling pixels one by one, wherein the feature map contains all pixel points of the original input RGB images, and no information is lost; the multi-scale shallow features and Focus sampling features are fused through a feature attention module and then input into an encoder; as shown in fig. 3, the feature attention module takes two features as input, inputs the features after compression of garbage into a 3×3 convolution to obtain refined features, and fuses the refined features with the original input features to obtain features finally input into an encoder;

f _concat ＝[f _Focus ,f _shallow ]

wherein each decoder outputs an enhanced multi-scale restored image through a supervision and attention module before, and simultaneously transmits useful information in the current characteristics to each next decoder;

the sirnet branch consists of three original resolution units, each original resolution unit further consisting of 1 channel attention module; all features in the sub-network are the same as the input image in size so as to restore the fine position relationship from the input image to the point to point of the output image; in order to balance the feature extraction capacity of two branch networks, the features extracted by an encoder and a decoder in CIRNet are fused and then transmitted into each original resolution unit of the branch for fusion;

6. inputting the final features extracted by CIRNet and SIRNet into a context-space feature fusion module, so that the two parts of features are fully fused, and the network performance is improved;

the context-space feature fusion module is shown in fig. 4, and the output features of the two branches firstly pass through a 1×1 convolution layer, a ReLu activation function layer and a 3×3 convolution layer to obtain refined fusion features; then, the refined features are respectively subjected to a designed attention module for extracting the context information and an attention module for extracting the space position details, and then subjected to a sigmoid activation function to obtain the weight of the re-calibrate features; finally, weighting the original features by the obtained weights to obtain refined context information and spatial features; obtaining output characteristics through the two characteristics obtained by the fusion of the characteristic attention module;

the context information attention module processes the fused features through a global average pooling layer and a global maximum pooling layer respectively, and obtains the weight capable of paying attention to the context information through a convolution layer and an activation layer;

the space detail attention module inputs the fused features into the convolution layer, the activation layer and the convolution layer in sequence to obtain the weight for paying attention to the space detail;

7. the output characteristics are convolved with a 3 x 3 through a channel attention module to obtain a final enhanced image;

8. the low light enhancement network adopts the combination of a pixel level restoration loss function and a structural similarity loss function, and the combination is as follows:

L _SSIM ＝1-SSIM(I _enhanced ,I _ref )

in one embodiment:

the invention is realized on Nvidia Tesla T4 GPU and Intel (R) Xeon (R) Silver 42142.20GHz CPU by using a PyTorch framework; to converge the model, 1000 epochs were trained on the LOL dataset and 100 epochs were trained on the MIT-Adobe FiveK dataset; wherein, 8 images are randomly sampled by each epoch, and then cut into patches with 256X 256 resolutions; vertical and horizontal flipping for data enhancement; initial learning rate was 2×10 using Adam optimizer ^-4 Using cosine annealingStrategy of attenuation to 1X 10 ^-6 ；

The invention compares the low-light image data sets such as LOL, MIT-Adobe FiveK, LIME, MEF, NPE, synthetic data and the like with the existing numerous low-light image enhancement algorithms in qualitative and quantitative experiments, and tests the low-light image enhancement algorithms by using commonly used measurement index values (PSNR, SSIM, LPIPS) and sensory quality in the image processing field, and achieves good effects;

the LOL data set is a common real data set in the field of low-light image enhancement, and comprises a large number of indoor and outdoor low-light scene pictures, wherein the total number of the LOL data set is 500 pairs of low-light/normal-light images; on a representative LOL data set, compared with advanced algorithms in the field, the invention obtains the best peak signal-to-noise ratio (PSNR), structural Similarity (SSIM) value and Learning Perceived Image Patch Similarity (LPIPS) (the larger the PSNR, the better the SSIM is close to 1, the better the LPIPS is smaller); as shown in Table 1 below, our method achieved the best performance on PSNR, SSIM and LPIPS; the result of the invention is psnr=24.64 db, ssim=0.867, PSNR values exceeding MIRNet 0.5db, ssim values exceeding 0.035, lpips values exceeding MIRNet0.022;

TABLE 1 partial quantitative index results on LOL dataset

The MIT-Adobe FiveK data set comprises 5000 photographed images, and the images after color rendering by 5 experts can be used as label images; as with other low-light image enhancement algorithms trained by the data set, the invention adopts the color rendering result of expert C as the group trunk, and adopts the front 4500 pairs as the training set and the rest 500 pairs as the test set; compared with the existing advanced algorithm, the PSNR of the invention obtains the optimal result, the SSIM obtains the suboptimal result, and the peak signal-to-noise ratio and the structural similarity respectively reach 25.85dB (surpassing MIRNet 2.12 dB) and 0.918 (only lower than MIRNet 0.007);

TABLE 2 quantitative indicator results on MIT-Adobe FiveK dataset

Referring to fig. 5, it can be seen that: the algorithm obtains better visual effect through a plurality of qualitative comparisons; most algorithmically enhanced images also have significant amounts of noise, such as RetinexNet, MIRNet; some algorithms enhance the image with excessive smoothing of the image partial areas, such as KinD, kinD++; still other algorithms recover insufficient image brightness improvement, such as ZeroDCE; in addition, the images restored by DLN and EnLightenGAN have artifacts in dark areas; in contrast, the algorithm of the invention can effectively remove noise in the dark while improving the overall brightness of the image, and restore the color saturation and contrast of the image;

compared with a CIRNet network with only a single branch, the proposed dual-branch network achieves better effect on quantitative indexes; for example, on the LOL dataset, without psnr=23.87 dB, ssim=0.862, increasing SIRNet can restore spatial detail of the image more effectively, improving by 0.77dB, 0.005 on the index, respectively, as shown in table 3;

table 3 results of whether or not there is spatial information recovery sub-network

A new feature fusion module and a strategy for compensating information loss caused by bilinear downsampling are proposed, as shown in table 4; in order to verify the effectiveness of the feature fusion module and the compensation strategy, a quantitative ablation experiment is carried out on the LOL data set; the common feature fusion method comprises the following steps: the performances of Add and Cat are lower than that of the fusion method provided by the invention; meanwhile, the network performance can be improved by adopting a Focus compensation strategy;

table 4 quantitative results of ablation experiments for various components of the network

CSFF	Cat	Add	Focus	PSNR	SSIM
						√			24.10	0.860
	√		√	23.63	0.861
							√	√	23.90	0.866
√			√	24.64	0.867

Table 5 shows a comparison of the quantitative index and the parameter of the present invention and the baseline model, and it can be seen that the model proposed by the present invention is significantly superior to the baseline model in both aspects;

TABLE 5 Restore Net quantitative index and parameter results

In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The preferred embodiments of the invention disclosed above are intended only to assist in the explanation of the invention. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. The invention is limited only by the claims and the full scope and equivalents thereof.

Claims

1. A low-light image enhancement method capable of balancing context information and space detail simultaneously is characterized in that the context information and the space detail of an image are restored by using a CIRNet sub-network and a SIRNet sub-network, and the two parts of information are fused by using a context-space feature fusion module, comprising the following steps:

s1, constructing a paired data set, wherein the data set comprises a low-light image and a normal-light image, and each low-light image I _low Normal illumination image I corresponding to the same scene _ref ；

S2, inputting a low-light image I _low Into the network;

s3, extracting a low-light image I _low Is a shallow feature of (2);

s4, the CIRNet branch skillfully extracts context semantic information by utilizing a multi-scale feature learning mode, wherein the coding and decoding network comprises a coding module encoder and a decoding module encoder, each coding module takes features with different scales as input, and each decoding module outputs recovery images with different scales;

the CIRNet branch adopts a UNet-like network to recover the context information of the image, and the encoder and the decoder are composed of two channel attention modules;

f _concat ＝[f _Focus ,f _shallow ]

s7, outputting the final enhanced low-light image.

2. A low-light image enhancement method capable of balancing both contextual information and spatial detail as recited in claim 1, wherein: in S3, the shallow feature extraction module uses a convolution module and a channel attention module, and the formula is as follows:

f _shallow ＝f _CAB (f _Conv (I _low ))。

3. a low-light image enhancement method capable of balancing both contextual information and spatial detail as recited in claim 1, wherein: in S5, the SIRNet branch consists of three original resolution units, each original resolution unit consisting of 1 channel attention module;

4. a low-light image enhancement method capable of balancing both contextual information and spatial detail as recited in claim 1, wherein: in S6, refined fusion features are obtained through the primary fusion features through a 1X 1 convolution layer, a ReLu activation function layer and a 3X 3 convolution layer, the features are respectively subjected to a designed attention module for extracting context information and an attention module for extracting space position details, are respectively subjected to a sigmoid activation function, are obtained as re-calibrate features, and are weighted into original features through pixel multiplication to respectively refine the context information and the space features of the original input features.

5. A low light image enhancement method capable of balancing both contextual information and spatial detail as recited in claim 4, wherein: the context-space feature fusion module comprises:

the initial input features are enhanced using the weighting information.

6. A low-light image enhancement method capable of balancing both contextual information and spatial detail as recited in claim 5, wherein: the context information attention module processes the fused features through a global average pooling layer and a global maximum pooling layer respectively, and obtains the weight capable of paying attention to the context information through a convolution layer and an activation layer.

7. A low-light image enhancement method capable of balancing both contextual information and spatial detail as recited in claim 5, wherein: the spatial detail attention module inputs the fused features to the convolution layer, the activation layer and the convolution layer in sequence, and the weight for paying attention to the spatial detail is obtained.

8. A low-light image enhancement method capable of balancing both contextual information and spatial detail as recited in claim 1, wherein: the system also comprises a low optical enhancement network, wherein the loss function of the low optical enhancement network is as follows:

L _SSIM ＝1-SSIM(I _enhanced ,I _ref )