CN111429433A

CN111429433A - Multi-exposure image fusion method based on attention generation countermeasure network

Info

Publication number: CN111429433A
Application number: CN202010219045.3A
Authority: CN
Inventors: 李晓光; 吴超玮; 黄江鲁; 卓力; 李嘉锋
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-03-25
Filing date: 2020-03-25
Publication date: 2020-07-17

Abstract

The invention provides a multi-exposure image fusion method based on an attention generation countermeasure network. The idea of the attention mechanism is highly matched with the detail weighting problem in multi-exposure fusion, and the weights of all input images can be adaptively selected by applying channel attention, and the weights of different spatial positions can be adaptively selected by using spatial attention. The technology has wide application prospect in various multimedia vision fields. The algorithm designs a new attention generation countermeasure network for a multi-exposure image fusion task, and by introducing a visual attention mechanism into the generation network, the algorithm can help the network to adaptively learn the weights of different input images and different spatial positions so as to achieve a better fusion effect.

Description

Multi-exposure image fusion method based on attention generation countermeasure network

Technical Field

The invention belongs to the field of digital image/video signal processing, and particularly relates to a multi-exposure image fusion method based on an attention generation countermeasure network.

Background

With the development of computer and multimedia technologies, various multimedia applications have placed a wide demand for high quality images. High quality images can provide rich information and a realistic visual experience. However, in the image acquisition process, due to the influence of factors such as image acquisition equipment, acquisition environment, noise and the like, the image presented on the display terminal is often a low-quality image. Therefore, how to reconstruct a high quality image from a low quality image has been a challenge in the field of image processing.

From bright sunlight to dim starlight, the illumination intensity in a natural scene can span a very large dynamic range, with a brightness contrast that can exceed 14 orders of magnitude. A common digital camera can only capture 8-bit images on each color channel. The existing image brightness level does not match the natural scene brightness dynamic range. The limited dynamic range of brightness makes the digital image have limited capability to display high-contrast natural scenes, and the problems of over-exposure of bright areas or under-exposure of dark areas can occur. The dynamic range of the image brightness is enhanced, so that the expressive ability of the image to a high-contrast scene can be effectively improved, and the visual quality of the image is improved.

Various multimedia applications place extensive demands on high dynamic range images and video. If the network video company wants to improve the subjective quality of the video content by improving the dynamic range of the video, the mobile phone manufacturer can take the high dynamic range content shooting performance of the camera as a selling point for promotion. Therefore, the high dynamic range image has wide application requirements and important commercial value in the field of visual media.

Aiming at the problem of enhancing the dynamic range of images, the multi-exposure fusion method generates a high dynamic range image with enhanced details by fusing the detail information of different exposure images. Among them, how to select detailed information from images exposed differently is a challenging problem.

By rapidly scanning the global image, the human visual system can acquire a target area that needs to be focused (often referred to as focus), and then devote more attention to that area to obtain more detailed information about the focus area. This is the ability of humans to quickly screen out high value information from a large amount of information with limited resources. Human visual attention greatly improves the efficiency and accuracy of visual information processing.

Inspired by human visual attention, the concept of attention mechanism is introduced into deep learning. In recent years, methods based on deep learning have had great success in many computer vision tasks and some low-level image processing problems. Among other things, attention mechanisms have become functional in a variety of applications. We have observed that the attention mechanism is suitable for solving the multi-exposure fusion problem, and that the attention mechanism can be used to adaptively select weights.

The invention provides a multi-exposure image fusion method based on an attention generation countermeasure network. The idea of the attention mechanism is highly matched with the detail weighting problem in multi-exposure fusion, and the weights of all input images can be adaptively selected by applying channel attention, and the weights of different spatial positions can be adaptively selected by using spatial attention. The technology has wide application prospect in various multimedia vision fields.

Disclosure of Invention

The invention aims to overcome the defects that the traditional multi-exposure image fusion method depends on different calculation modes for artificially defining fusion weights, and aims at the problem of enhancing the dynamic range of an image based on the multi-exposure image fusion method, and provides a multi-exposure fusion method based on an attention generation countermeasure network.

The invention is realized by adopting the following technical means:

a multi-exposure image fusion method based on an attention generation countermeasure network. Firstly, a plurality of images with different exposures are fused into a generation network of an attention mechanism to obtain a multi-exposure fusion image; and then, sending the fused image and the target group-judge image into a judging network for judgment, and training to obtain a multi-exposure fusion generating network with enhanced details and dynamic range in the mutual game of the generating network and the judging network. The whole network of the method is shown as the attached figure 1 and is divided into two parts: the generated network and the discriminant network are shown in fig. 2.

The invention introduces a visual attention mechanism in a designed generating network, adaptively selects weights, and extracts image detail information by adopting a residual block.

The invention is realized by adopting the following technical means: a multi-exposure image fusion method based on an attention generation countermeasure network comprises three parts, namely establishment of a generation countermeasure network structure based on an attention mechanism, countertraining of the multi-exposure image fusion generation network and a discrimination network, and multi-exposure image fusion testing.

Firstly, the first part is to build a generation countermeasure network based on an attention mechanism, the overall network is composed of a generation network and a discrimination network, and the attention mechanism is introduced into the generation network. The network construction specifically comprises the following steps:

1) generating network structure build

The generated network structure is shown in a figure drawing and is formed by combining a feature extraction mechanism and an attention mechanism, wherein the feature extraction part comprises 3 × 3 convolution with the output channel number of 32, PRe L U activation operation, 5 residual block modules with the input and output channels of 32, 3 × 3 convolution and PRe L U activation operation with the output channel number of 32, and the obtained feature maps are added with the corresponding positions of the feature maps subjected to the first layer convolution and activation operation, namely, the feature extraction operation is completed on one image to obtain 32 feature maps of the one image, and the same feature extraction is performed on each image in N images in a training pair to obtain 32 feature maps of the N images, and the 32 feature maps are cascaded to obtain N × 32 feature maps.

Each residual block operation comprises sequential 1-layer 3 × 3 convolution, batch normalization operation and PRe L U activation, then 1-layer 3 × 3 convolution and batch normalization operation, and finally the result characteristic diagram of the operation is added with the corresponding position of the input characteristic diagram, so that the result of the residual block can be obtained.

The attention module in the invention is designed as a cascade mixed attention module, namely, the channel attention operation is firstly carried out on an input feature diagram, and the channel attention weight is multiplied by the channel feature diagram channel by channel to complete the channel attention operation; then, performing space attention operation on the feature map with the adjusted channel attention, calculating the weight of each space position, and multiplying the weight by the feature map element by element to complete the space attention operation; through the sequential operation of channel attention and spatial attention, a mixed attention operation is completed.

Wherein, the channel attention operation mainly extracts the attention parameter by performing two pooling operations based on the channel plane. Respectively calculating the global Average value and the maximum value of each channel of the input feature map to obtain feature vectors with the same scale as the input feature map and the same number of channels, then respectively carrying out linear addition on the two feature vectors by a multilayer perceptron with shared weight, and then carrying out sigmoid activation operation to obtain a channel attention result, namely obtaining the weight of each feature map; and multiplying the channel attention weight value by the corresponding channel to obtain a characteristic diagram after the channel attention is adjusted.

The spatial attention operation is that averageposing and Max posing are carried out on feature graphs of all channels by taking spatial positions as units, the feature graphs are spliced together according to channel dimensions to obtain 2 weight matrixes consistent with the input feature graph in size, then 7 × 7 convolution operation is carried out on the obtained feature graphs to obtain a spatial attention weight matrix consistent with the input feature graph in size, namely the weight of each spatial position is obtained, element-by-element multiplication of the feature graphs and the spatial attention weight is carried out after the channel attention operation, and mixed attention operation is completed.

After the attention operation, a convolution operation of 3 × 3 is performed, and the output fusion result is obtained through the tanh activation function.

2) Distinguishing network structure building

The discrimination network is connected with the generation network, receives the result of the generation network and generates a group-route corresponding to the network input image, and is used for judging the truth of two input images, the structure parameters of the discrimination network are shown in attached table 1 of the drawing, the discrimination network comprises 10 convolution layers, the size of each filter is 3 × 3, the number of the filters is continuously increased and is increased from 64 to 1024, and is increased by one time every 2 times, in the 2 nd to 8 th convolution operation layers, each convolution layer comprises 1 convolution operation, 1 batch normalization, 1L eakyRe L U activation, only the 1 st convolution layer has no batch normalization operation, next, the 512 feature maps are sequentially subjected to average pooling operation, convolution operation, L eakyRe L U activation, convolution operation again, and finally, the discrimination result is activated by a sigmoid function to output.

The second part is the countertraining of the multi-exposure image fusion generation network and the discrimination network.

The method comprises the steps of firstly preparing training data, processing a disclosed multi-exposure image data set to obtain a training data set, wherein the original data set comprises 589 samples, each sample comprises 2-7 low dynamic range images with different exposures and a corresponding group-channel image, the size of the image is about 3000 × 5000 pixels, based on the data set, 440 image pairs are selected to serve as the training set, each sample selects 3 low dynamic range images with different exposures and the corresponding group-channel image to serve as a training sample pair, downsampling and then segmenting are carried out, each pair of images is divided into 6 blocks, and finally 2640 pairs of image blocks are obtained to serve as the training data set.

The specific countermeasure training method comprises the steps of alternately training a generation network and a discrimination network, firstly performing generation network training once by using generation loss and performing back propagation, then performing discrimination network training once by using discrimination loss and performing back propagation, and thus alternately training all the time. The overall loss function is shown in equation (1):

min_Gmax_Df(G,D)， (1)

so as to achieve Nash equilibrium and complete training.

The loss function of the generated network designed by the invention consists of four parts, namely image loss (l)_mse) Loss of perception (l)_pe) To combat the loss (l)_ad) And TV loss (l)_tv)。

Adding the 4 losses according to a certain proportion generates the network loss, and the specific loss function is shown as formula (2):

l_mef＝αl_mse+βl_pe+γl_ad+l_tv， (2)

and finally, a multi-exposure image fusion testing part.

The tested data set is selected by using 97 pairs of data left after the training part is selected, down sampling processing is carried out, cutting is not carried out, a test program is input, and a multi-exposure fusion image is generated. The test program applies two parts of multi-exposure image fusion to generate a network and judgment network countermeasure training result, parameters obtained by countermeasure training are input into the test program to perform multi-exposure image fusion to generate a multi-exposure fusion image, and subjective visual effect, objective peak signal-to-noise ratio (PSNR) and Structural Similarity (SSIM) index are applied to evaluation.

The algorithm designs a new attention generation countermeasure network for a multi-exposure image fusion task, and by introducing a visual attention mechanism into the generation network, the algorithm can help the network to adaptively learn the weights of different input images and different spatial positions so as to realize better fusion effect;

description of the drawings:

FIG. 1, a network architecture diagram;

FIG. 2, generating a network structure diagram;

FIG. 3, attention Module Structure diagram;

FIG. 4 is a chart comparing subjective results of the method of the present invention and the prior art, wherein the top row is low, medium and high exposure input and label images from left to right, and the bottom row is L i, Ma, Kou from left to right, and the result chart of the method of the present invention

The specific implementation mode is as follows:

the following description of the embodiments of the present invention is provided in conjunction with the accompanying drawings:

firstly, establishing a generation countermeasure network based on an attention mechanism, wherein the generation countermeasure network comprises a generation network and a judgment network, and the attention mechanism is introduced into the generation network; secondly, fusing a multi-explosion light image to generate a network and a judgment network for confrontation training, and performing successive alternate confrontation training on the generated network and the judgment network through a training sample set to obtain network parameters of the generated network; and finally, a testing stage, wherein in the multi-exposure fusion stage, 3 different exposure images are used as input, and the multi-exposure image fusion is realized through a trained generation network. The specific process is described below.

(1) Network construction

a) Generating networks

The generation network is mainly divided into 2 stages, the first half is a feature extraction and connection network, and the second half is an attention operation and fusion network.

In the feature extraction stage, firstly convolution of 3 × 3 is carried out, then PRe L U activation operation is carried out, and then 5 residual block operations are carried out, wherein each residual block operation comprises 1 convolution of 3 × 3, 1 batch of normalization operation, 1 PRe L U activation operation, 1 convolution of 3 × 3 and 1 batch of normalization operation, then a feature map input initially by a residual block and a feature map of the last batch of normalization operation are added to obtain an output result of the residual block, after the 5 residual block operations, the obtained 32 feature maps are sequentially subjected to convolution of 3 × 3 and 1 PRe L U activation operation, and then the obtained 32 feature maps are added to the feature map obtained by the first convolution operation, namely, the feature extraction of each input image is completed, and 32 feature maps are obtained.

And a connection stage, namely connecting the feature maps obtained in the feature extraction stage of the 3 different exposure input images input into the network to obtain 3 × 32 feature maps.

The attention operation stage is a mixed attention operation, namely, the channel attention operation and the space attention operation are sequentially completed, wherein the channel attention operation is to pool an input feature map based on each channel, Average focusing and Max focusing operations are performed, then 2 feature vectors are added through a multilayer perceptron, finally, a channel attention result is obtained through a sigmoid operation, the result is multiplied by the input feature map, namely, the channel attention operation is completed, 3 × 32 feature maps are obtained, then, the feature map with the channel attention operation completed is pooled based on space, the feature map is compressed on channel dimensions through the Average focusing and Max operations, a space attention result of each space position is obtained, the result is multiplied by the input feature map, namely, the space attention operation is completed, the two parts of operations are completed, namely, the attention operation is completed, and 3 × 32 feature maps are obtained.

The fusion phase, i.e. the convolution of 3 × 32 feature maps subjected to attention operations, again 3 × 3, finally activates the output using the tanh function.

b) Discriminating network

Judging whether the network is connected to the back of the generating network, receiving a multi-exposure fusion result of the generating network and a group-route corresponding to the input end of the generating network as two inputs, wherein the size of the pixels is 400 × 400 × 3, and the pixels are used for judging whether the generated image and the group-route belong to the same category;

the method comprises the steps of judging that a network sequentially conducts convolution with convolution kernel of 3 × 3 step size 1, then L eakyRe L U operation to obtain 64 feature maps, then completing 1 convolution with convolution kernel of 3L step size 2, 1 batch normalization operation, 1L eakyRe L U operation to obtain 64 feature maps, then completing 1 convolution with convolution kernel of 3L step size 1, 1 batch normalization operation, 1L eakyRe L U operation, obtaining 128 feature maps, completing 1 convolution with convolution kernel of 3L step size 2, 1 batch normalization operation, 1L eakyRe L U operation, obtaining 128 feature maps, completing 1 convolution with convolution kernel of 3L step size 1, 1 batch normalization operation, 1 367 eakyRe L U operation, L operation of L step size 1, L normalization operation, L operation of L is achieved, L, the initial convolution with convolution operation of 3 step size 2, L is achieved by L, the initial convolution operation of L operation, L operation of L is achieved by L, and the initial convolution with a L operation of L is achieved by L operation, and the initial normalization operation of L is achieved by L a specific convolution with a step size 1 operation of L, and L a L is achieved by L a normalization operation of L operation.

(2) Counter training

a) Training data set preparation

The method comprises the steps of adopting a disclosed multi-exposure image data set, preprocessing to obtain a training data set and a testing data set, wherein the original data set comprises 589 samples, each sample comprises 2-7 low dynamic range images with different exposures and a corresponding group-channel image, the image size is about 3000 × 5000 pixels, 440 sample pairs are selected to form the training data set based on the data set, each sample selects 3 low dynamic range images with different exposures and the corresponding group-channel image as a training sample pair, then the spatial resolution of the sample is uniformly reduced to 1200 × 800 pixels, so that more details are reserved and the contrast of an input image is reserved to the maximum extent, each image is correspondingly divided into image blocks with 400 × 400 pixels, and therefore the training data set comprises 2640 pairs of image blocks.

b) Loss function

During the training process of the network, the total loss function is shown in formula (1):

min_Gmax_Df(G,D)， (1)

and alternately training the generation network and the decision network to achieve Nash equilibrium.

The definition of the loss function is crucial to the generation network, and the loss function of the generation network designed by the invention consists of four parts, namely image loss (l)_mse) Loss of perception (l)_pe) To combat the loss (l)_ad) And TV loss (l)_tv)。

l_mef＝αl_mse+βl_pe+γl_ad+l_tv， (2)

wherein α -1, β -6 × 10^-3，γ＝10^-3，＝2×10^-8。

In particular, l_mseFor computationally generating networksMean square loss between the generated fusion result and Ground-truth, and_pethe method is a perceptual loss, and is used for calculating the mean square loss between feature maps obtained after a pre-trained VGG network generates a fusion result generated by a generating network and a group-route, as shown in formulas (3) and (4):

wherein W and H are dimensions indicating the width and height of the input image, respectively, F_iRefers to generating a network-generated fused result, GT refers to the group-route corresponding to the input, and V_ggThe invention selects the output result of the first 30 layers of the pre-trained VGG network to calculate, corresponding to the operation of the pre-trained VGG network.

l_tvUsually in a very small proportion (not more than 10)^-8Of this order) for use with other losses for suppressing noise in the generation process,/_tvAs shown in equation (5):

wherein,

u is used to refer to the image being calculated, D_uRefers to the domain of support for the image.

c) Confrontational training process

The specific process of training is that the generation network and the discrimination network are alternately carried out, namely, the generation network is generated and then reversely propagated after training for 1 time, and then the discrimination network is updated for 1 time and then reversely propagated. During training, the batch size is set to be 1, and the convergence effect can be achieved after about 100 rounds of training.

(3) Multiple exposure fusion test

The test data set applied by the test procedure was 97 pairs of data selected from the original data set with the remainder of the training data set removed, down-sampled to 400 × 400 × 3 size, without cropping, each pair containing 3 differently exposed images as input and 1 labeled group-try image for comparative evaluation with the generated multi-exposure fused image.

The multi-exposure fusion test process intercepts a generation network part for generating a countermeasure network, utilizes network parameters obtained by the last training in a countermeasure training stage to load the network parameters into the generation network part, and inputs a test data set into the network to obtain a multi-exposure fusion image, and the specific process is to input 3 images with different exposures and the size of 400 × 400 × 3 into the generation network, and obtain 1 fusion image with the detail definition and the enhanced contrast of 400 × 400 × 3 through feature extraction, connection, attention operation and fusion.

In order to verify the effectiveness of the invention, subjective visual effect and objective numerical index are adopted to evaluate and generate fusion effect. The subjective visual effect pair of the method of the invention and other existing methods is shown in figure 4, for example, while the objective index adopts two common image quality evaluation indexes, namely peak signal-to-noise ratio (PSNR) and Structural Similarity (SSIM), and the results are shown in table 2. From the subjective visual effect, the result details of the method are clearer and have stronger contrast, and from the objective numerical value, the numerical value of the method is higher, so that the result of the method is better than that of the existing method from the two aspects of subjectivity and objectivity.

Table 1 network parameter table for discriminating network

TABLE 2 comparison of objective results for the method of the present invention and the prior art methods

	Average PSNR (dB)	Average SSIM
			Li	15.9827	0.5234
Ma	15.9915	0.5350
			Kou	16.4513	0.5469
The invention	17.6994	0.5489

Claims

1. A multi-exposure image fusion method based on an attention generation countermeasure network is characterized in that: the method comprises three parts of generation countermeasure network structure construction based on an attention mechanism, multi-exposure image fusion generation network, judgment network countermeasure training and multi-exposure image fusion testing;

firstly, a first part is to build a generation countermeasure network based on an attention mechanism, the overall network is composed of a generation network and a judgment network, and the attention mechanism is introduced into the generation network; the network construction specifically comprises the following steps:

1) generating network structure build

The method comprises the steps of generating a network structure, wherein the network structure is formed by combining a feature extraction mechanism and an attention mechanism, the feature extraction part is completed by 3 × 3 convolution with the output channel number of 32 and PRe L U activation operation, 5 residual block modules with the input and output channels of 32 are respectively used, the residual block modules are further completed by 3 × 3 convolution with the output channel number of 32 and PRe L U activation operation, the obtained feature graphs are added with the corresponding positions of the feature graphs subjected to the first layer convolution and activation operation, namely, the feature extraction operation is completed on one image to obtain 32 feature graphs of one image, simultaneously, the same feature extraction is performed on each image in N input images in a training pair to obtain 32 feature graphs of the N images, and the 32 feature graphs are cascaded to obtain N × 32 feature graphs;

each residual block operation comprises sequential 1-layer 3 × 3 convolution, batch normalization operation and PRe L U activation, then 1-layer 3 × 3 convolution and batch normalization operation, and finally the result characteristic diagram of the operation is added with the corresponding position of the input characteristic diagram to obtain the result of the primary residual block;

the attention module is designed as a cascaded mixed attention module, namely, the channel attention operation is firstly carried out on the input feature diagram, and the channel attention weight is multiplied by the channel feature diagram channel by channel to complete the channel attention operation; then, performing space attention operation on the feature map with the adjusted channel attention, calculating the weight of each space position, and multiplying the weight by the feature map element by element to complete the space attention operation; through the sequential operation of channel attention and spatial attention, the mixed attention operation is completed;

wherein the channel attention operation is to perform two pooling operations based on a channel plane to extract an attention parameter; respectively calculating the global Average value and the maximum value of each channel of the input feature map to obtain feature vectors with the same scale as the input feature map and the same number of channels, then respectively carrying out linear addition on the two feature vectors by a multilayer perceptron with shared weight, and then carrying out sigmoid activation operation to obtain a channel attention result, namely obtaining the weight of each feature map; multiplying the channel attention weight value by the corresponding channel to obtain a characteristic diagram after the channel attention is adjusted;

the spatial attention operation is that Average value and Max value of all channel feature maps are carried out by taking spatial position as unit, and the Average value and Max value are spliced together according to channel dimension to obtain 2 weight matrixes consistent with the input feature map scale, then 7 × 7 convolution operation is carried out on the obtained feature maps to obtain a spatial attention weight matrix consistent with the input feature map scale, and the weight of each spatial position is obtained;

after the attention operation, performing a 3 × 3 convolution operation, and obtaining an output fusion result through a tanh activation function;

2) distinguishing network structure building

The judging network is connected with the generating network and used for receiving the result of the generating network and generating a group-route corresponding to the network input image and judging the truth of the two input images, wherein the judging network comprises 10 convolutional layers, the size of each filter is 3 ×, the number of the filters is continuously increased and is increased from 64 to 1024, and is doubled every 2 times;

the second part is that the multi-exposure image is fused to generate a network and is subjected to confrontation training with a discrimination network;

firstly, training data is prepared, down-sampling and then segmentation are carried out, and each pair of images is divided into 6 blocks;

the confrontation training method is to alternately train the generation network and the discrimination network, firstly, the generation loss is used for carrying out the generation network training for one time, the back propagation is carried out, then, the discrimination loss is used for carrying out the discrimination network training for one time, and then, the back propagation is carried out, thus, the alternate training is always carried out; the overall loss function is shown in equation (1):

min_Gmax_Df(G,D)， (1)

so as to achieve Nash equilibrium and complete training;

the loss function of the designed generation network consists of four parts, respectively image loss (l)_mse) Loss of perception (l)_pe) To combat the loss (l)_ad) And TV loss (l)_tv)；

l_mef＝αl_mse+βl_pe+γl_ad+l_tv， (2)

the tested data set is selected by using the data left after the training part is selected, down-sampling processing is carried out, cutting is not carried out, a test program is input, and a multi-exposure fusion image is generated; and the test program generates the result of the network and the discrimination network countertraining by fusing the second part of the multi-exposure images, and inputs the parameters of the generated network obtained by the countertraining into the test program for multi-exposure image fusion to generate the multi-exposure fused image.