CN115170915A

CN115170915A - Infrared and visible light image fusion method based on end-to-end attention network

Info

Publication number: CN115170915A
Application number: CN202210954041.9A
Authority: CN
Inventors: 江旻珊; 朱永飞; 常敏; 张学典
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2022-08-10
Filing date: 2022-08-10
Publication date: 2022-10-11

Abstract

The invention provides an infrared and visible light image fusion method based on an end-to-end attention network, which comprises the steps of preprocessing an infrared and visible light image; constructing an end-to-end attention network; wherein the attention network comprises: a self-coding network and a channel-space dual attention fusion layer, the self-coding network including an encoder-decoder joined in a hopping connection; and fusing the preprocessed infrared and visible light images based on the attention network. The invention overcomes the repeated defect of extracting features by adopting the same method aiming at different source images in the traditional fusion method, simultaneously reduces the limitation of manually designing a fusion strategy, and finally generates a single fusion image containing the feature information of a plurality of source images.

Description

Infrared and visible light image fusion method based on end-to-end attention network

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to an infrared and visible light image fusion method based on an end-to-end attention network.

Background

Image fusion is an image processing technology for image information enhancement, and due to the theoretical and technical limitations of hardware devices, a single sensor cannot effectively and comprehensively describe scene information under a specific shooting setting, for example, a visible light image contains more detailed texture information, and an infrared image contains more amplitude information, so that the image fusion is to combine complementary information between different source images under the same scene to generate a single image with more abundant information, so as to apply the single image to many fields, such as photographic visualization, target tracking, medical diagnosis and remote sensing monitoring.

In general, image fusion algorithms can be classified into the following two types: the traditional method and the deep learning method, the early image fusion method all adopt the way of mathematical transformation to generate the activity level map, and the fusion rule is designed in the space domain and the transformation domain, the representative traditional image fusion method includes: multi-scale transform based methods, sparse representation based methods, subspace based methods, saliency based methods, total variance based methods. On one hand, the traditional fusion algorithms adopt the same feature extraction and reconstruction methods for different source images, and the feature difference between the different source images is not considered, so that the fusion effect is poor, and on the other hand, the traditional fusion algorithms are designed by manpower, the fusion rules are too simple, the fusion performance is limited seriously, and the fusion algorithms can only be applied to specific fusion tasks. In recent years, with the continuous development of deep learning technology, the application of the deep learning technology in the aspect of image fusion is wider and wider, at present, a fusion algorithm based on deep learning mainly comprises three types, namely a generation antagonistic network based on GAN, a network based on an Automatic Encoder (AE), and a conventional Convolutional Neural Network (CNN). Secondly, based on a deep learning method, a loss function can be designed by self, and network parameters are updated through reverse gradient propagation to obtain a more reasonable feature fusion strategy, so that self-adaptive feature fusion is realized. Thanks to these advantages, deep learning promotes a great progress in image fusion, achieving performance far exceeding that of conventional methods.

Although the deep learning has achieved satisfactory results in the image fusion field, there are some disadvantages that (1) these deep learning network architectures do not fully consider the intermediate feature layer, and only design the loss function from the final fusion image and the source image, (2) most fusion algorithms only use the deep learning model in the stages of feature extraction and feature reconstruction, and feature fusion is still the traditional method, such as feature map addition, maximum value and mean value, and (3) some deep learning models use the two-stage training method, which is time-consuming and difficult to train.

Disclosure of Invention

In order to solve the technical problem, the invention provides an infrared and visible light image fusion method based on an end-to-end attention network, so as to improve the image fusion effect.

In order to achieve the above object, the present invention provides an infrared and visible light image fusion method based on an end-to-end attention network, comprising:

preprocessing the infrared image and the visible light image;

constructing an end-to-end attention network; wherein the attention network comprises: a self-coding network and a channel-space dual attention fusion layer, the self-coding network comprising an encoder-decoder joined in a hopping connection;

and fusing the preprocessed infrared image and the preprocessed visible light image based on the attention network.

Optionally, the preprocessing the infrared image and the visible light image comprises: and converting the infrared image and the visible light image into a gray scale image, and performing center cutting.

Optionally, an encoder in the self-encoding network is configured to extract a multi-scale deep semantic feature of the preprocessed image, and output an infrared feature map and a visible light feature map; and the decoder in the self-coding network is used for reconstructing the infrared characteristic diagram and the visible light characteristic diagram into a final fusion image.

Optionally, the encoder includes a plurality of maximum pooled downsampled layers and a plurality of convolution blocks, the number of input channels of the encoder is set based on a first preset number, the number of output channels of the encoder is set based on a second preset number, and each convolution block of the encoder includes a BatchNorm regularization function and a RELU activation function;

the decoder comprises a plurality of upper sampling layers and a plurality of convolution blocks, the number of input channels of the decoder is set based on a third preset number, the number of output channels of the decoder is set based on a fourth preset number, and each convolution block of the decoder contains a BatchNorm regularization function and a RELU activation function.

Optionally, joining the jump connection to the self-coding network comprises:

and connecting the input of each maximum pooling layer in the encoder with the output of the up-sampling layer in the decoder, adding a denseblock in a connecting path, forming the denseblock by using different numbers of convolution blocks in different connecting paths, and setting the output channels of the convolution blocks based on a fifth preset number.

Optionally, the channel-space dual attention fusion layer comprises: a channel attention module and a spatial attention module;

in the channel-space double-attention fusion layer, the infrared characteristic diagram and the visible light characteristic diagram are connected in channel dimensions, the spliced images are respectively input to the space attention module and the channel attention module to obtain a space weight diagram and a channel weight diagram, the space weight diagram and the channel weight diagram are multiplied by the infrared characteristic diagram and the visible light characteristic diagram, and then the space and channel attention fusion characteristics are added to obtain an intermediate fusion image. On one hand, an intermediate fusion image is obtained through the intermediate fusion layer, and on the other hand, the intermediate fusion layer can enable the network to pay attention to the place where the network needs more attention. And then sending the intermediate fusion image into a decoder of a self-coding network to obtain a final fusion image.

Optionally, constructing the attention network further comprises: setting a loss function;

setting the loss function includes: adding an SSIM structure similarity measurement function, introducing a gradient operator, introducing L2 regularization, designing a target feature enhancement loss function, and finally performing weighted calculation on each loss.

Optionally, the SSIM structural similarity metric function includes: a brightness function, a contrast function and a structure comparison function;

the brightness function is:

wherein, mu _x 、μ _y The average luminance of the two images is represented separately,

n is the number of pixels of the picture, X _i X, y represent two different images, C, for pixel value size ₁ To prevent the denominator from being 0, C ₁ ＝(k ₁ *L) ² ,k ₁ Taking 0.01, taking 255 from L;

the contrast function is:

wherein the content of the first and second substances,

σ _x and σ _y Respectively, the standard deviations of the two images, C ₂ To prevent the denominator from being 0, C ₂ ＝(k ₂ *L) ² ,k ₂ Taking 0.03, taking 255 from L;

the structural comparison function is:

wherein the content of the first and second substances,

σ _xy representing the covariance of two pictures, C ₃ To prevent the denominator from being 0, C ₃ ＝C ₂ /2；

The SSIM structure similarity measurement function is as follows:

optionally, the gradient operator is:

wherein, V is the visible light source image,

in order to obtain the final fused image,

is a gradient operator, | | | | non-conducting phosphor ₁ Represents the L1 norm;

the L2 regularization is:

wherein, X is an unknown number of the setting, representing a visible light gray scale image and an infrared gray scale image, | | | calving ₂ Represents the L2 norm;

the target feature enhancement loss function is:

where M denotes the fusion process at different scales, w _e For the weights at the different scales of the scale,

as a result of the fusion of the mth layer feature map,

and

a visible light characteristic layer and an infrared characteristic layer, w, of the m-th layer respectively _vi Weight of visible light feature layer, w _ir And F is the Frobenius norm and is the weight of the infrared characteristic layer.

Optionally, the final loss function is:

wherein I is an infrared source image, V is a visible light source image,

for the final fused image, L1 is

α ₁ 、α ₂ 、α ₃ Respectively, the weight of each loss function.

Compared with the prior art, the invention has the following advantages and technical effects:

1. the invention uses the network structure of coding-decoding, the multi-scale deep features of the input picture are fully extracted in the coding stage, the multi-scale deep features are effectively reconstructed in the decoding stage, then the jump connection is further introduced into the self-coding network, the gradient disappearance is effectively slowed down, and the multi-scale feature layer is multiplexed, thus the capability of extracting and reconstructing the features by the network can be effectively enhanced. Meanwhile, different semantic information contained in the features under different scales is considered, and the features are not suitable for direct connection, so that different numbers of convolution blocks are selected among different connection layers to eliminate and balance the difference.

2. The invention uses a channel-space double attention neural network fusion structure, is different from the fusion strategy of the traditional manual design, and can effectively store the amplitude information of the infrared image and the detail texture information of the visible light image.

3. The invention designs a brand-new loss function, introduces an SSIM structure similarity function, an L2 regularization function, a gradient operator and a target feature enhancement loss function, can effectively extract the obvious and detailed features of the source image, adopts an end-to-end network structure, and abandons an intermediate fusion layer and a two-stage training strategy of manual design, so that the training is faster, and the fusion result is more effective.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, are included to provide a further understanding of the application, and the description of the exemplary embodiments of the application are intended to be illustrative of the application and are not intended to limit the application. In the drawings:

fig. 1 is a schematic flow chart of an infrared and visible light image fusion method based on an end-to-end attention network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an attention network structure according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a spatial-channel dual attention fusion layer structure according to an embodiment of the present invention;

FIG. 4 is an infrared image of an embodiment of the present invention;

FIG. 5 is a visible light image according to an embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than here.

Examples

As shown in fig. 1, the present embodiment provides an infrared and visible light image fusion method based on an end-to-end attention network, including:

preprocessing the infrared image and the visible light image;

constructing an end-to-end attention network; wherein the attention network comprises: a self-coding network and a channel-space dual attention fusion layer, the self-coding network including an encoder-decoder joined in a hopping connection;

Further, preprocessing the infrared image and the visible light image includes: and converting the infrared and visible light images into grey-scale images, and performing center cutting.

Further, the self-encoding network comprises: an encoder and a decoder;

the encoder is used for extracting the multi-scale deep semantic features of the preprocessed image and outputting an infrared feature map and a visible light feature map; and the decoder is used for reconstructing a final fusion image according to the infrared characteristic diagram and the visible light characteristic diagram.

Furthermore, the encoder is composed of a plurality of maximum pooled downsampled layers and a plurality of convolution blocks, the number of input channels of the encoder is set based on a first preset number, the number of output channels of the encoder is set based on a second preset number, and each convolution block of the encoder contains a BatchNorm regularization function and a RELU activation function;

the decoder is composed of a plurality of upper sampling layers and a plurality of convolution blocks, the number of input channels of the decoder is set based on a third preset number, the number of output channels of the decoder is set based on a fourth preset number, and a BatchNorm regularization function and a RELU activation function are contained behind each convolution block of the decoder.

Further, joining the hopping connection to the self-coding network comprises:

and connecting the input of each maximum pooling layer in the encoder with the output of the up-sampling layer in the decoder, adding a denseblock in a connecting path, forming the denseblock by using different numbers of rolling blocks in different connecting paths, and setting output channels of the rolling blocks based on a fifth preset number.

Further, the channel-space dual attention fusion layer includes: a channel attention module and a spatial attention module;

in the channel-space double-attention fusion layer, the infrared feature map and the visible light feature map are connected in channel dimension, the spliced images are respectively input to the space attention module and the channel attention module to obtain a space weight map and a channel weight map, the space weight map and the channel weight map are multiplied by the infrared feature map and the visible light feature map, and then the space and channel attention fusion features are added to obtain an intermediate fusion image. On one hand, an intermediate fusion image is obtained through the intermediate fusion layer, and on the other hand, the intermediate fusion layer can enable the network to pay attention to the place where the network needs more attention. And then sending the intermediate fusion image into a decoder of a self-coding network to obtain a final fusion image.

Further, constructing the attention network further comprises: setting a loss function;

The embodiment provides an infrared and visible light image fusion method, which aims to more comprehensively describe an imaging scene in order to fuse complementary and beneficial information in images of different modes, and comprises the following steps: the method comprises the steps of (1) constructing an automatic coding network of an encoder-decoder, extracting deep semantic information of an input image and reconstructing a fusion image, (2) adding jump connection in the self-coding network, introducing a denseblock in the jump connection to reduce the difference of the abundance degree of the semantic information between connection layers, (3) constructing a double attention fusion layer of a channel and a space, further retaining texture information of a visible light image and amplitude information of an infrared image, (4) designing a proper loss function, and selecting a related data set to train and test the performance of the fusion network. The invention overcomes the repeated defect of extracting features by adopting the same method aiming at different source images in the traditional fusion method, simultaneously reduces the limitation of manually designing a fusion strategy, and finally generates a single fusion image containing the feature information of a plurality of source images. The specific implementation steps are as follows:

step 1, constructing a self-encoder network of an encoder and a decoder, wherein the encoder is used for extracting deep features of an input image, and the decoder network is used for reconstructing the extracted deep features into a final fusion image.

The encoder network is composed of 3 max-posing down-sampling layers and 9 common convolution blocks, the decoder network is composed of 3 up-sampling layers and 7 common convolution blocks, which are connected with each other layer by layer, as shown in fig. 2, the first convolution block in the self-encoding network adopts 1 × 1 convolution kernel and adopts reflection filling (reflection filling pad) to prevent the occurrence of edge artifacts of the fused image, the number of input channels is set to 1, the number of output channels is set to 16, the other common convolution modules all adopt 3 × 3 convolution kernels, the step length is set to 1, the filling is performed by 0, the image resolution is not changed, the number of input channels of the encoder is respectively set to 16, 64, 64, 128, 128, 256, 256, 256, 256, the number of output channels is respectively set to 64, 64, 128, 128, 256, 256, 256, three max-pooling down-sampling layers are set in the encoder stage, the step size is set to be 2, the decoder stage samples the size of the feature map twice in a bilinear interpolation mode, the number of input channels of the convolution blocks in the decoding stage is respectively 512, 256, 256, 128, 128, 64, and 64, the number of output channels is respectively 256, 128, 128, 64, 64, and each convolution block is followed by a Batchnorm regularization function and a RELU activation function.

And 2, step: the jump connection is added into the self-coding network, so that the gradient disappearance is slowed down, information loss caused by an up-sampling process and a down-sampling process is further compensated, and the gradient disappearance problem of a deep neural network is slowed down. Meanwhile, the difference of semantic information between connection layers is considered, direct connection is not suitable, and therefore different rolling blocks are used on different connection layers to realize jump connection.

As shown in fig. 2, the input of the first maxporoling layer and the output of the third upsampling layer are connected in channel dimension through a skip connection, specifically, 4 convolution blocks are adopted, the input channel of each convolution block is set to 64, 64, 128, and 192, and the output channels are all set to 64, specifically, the connection mode is: setting the four convolution blocks as A1, A2, A3 and A4 respectively, then the output of A1 is used as the input of A2, the channels of A1 and A2 are connected as the input of A3, the channels of A1, A2 and A3 are connected as the input of A4, each convolution block is set as 3 × 3 convolution kernel, 0 padding is adopted, padding is set as 1, the image resolution is not changed, and then BatchNorm regularization and RELU activation functions are followed. The input of the second max-posing layer is connected to the output of the second upsampling layer in the same way as described above, but to balance the difference in semantic information between the deep layer and the shallow layer, we use three convolution blocks, the input of the third max-posing layer is connected to the output of the third upsampling layer, and 2 convolution blocks are selected in the same way as described above.

And 3, step 3: constructing a channel-space double-attention fusion layer (as shown in figure 3), respectively extracting amplitude and detail texture information of the infrared and visible light images, splicing the feature information of the two source images extracted by the encoder on channel dimensions, respectively sending the feature information into the channel and the space attention layer to obtain corresponding feature maps, and further adding to obtain a middle feature layer fusion image.

Calculate the weight of channel attention: sending the spliced image (S1) into a GlobavalgePoolic layer, successively passing through two full-connection layers, wherein the number of output channels of the H-swish activation function is 1/4 of the number of S1 channels after the first full-connection layer, the number of the output channels of the H-swish activation function is obtained by the aid of a sigmoid activation function after the second convolution layer, the weight within the range of 0-1 numerical value is obtained, the number of the output channels of the S1 channel is equal to the number of the S1 channels, and finally multiplying the obtained weight with a visible light characteristic layer to further keep detailed information of the visible light image. Calculating the weight of spatial attention: the spliced images are respectively sent to an Avergepoiling layer and a Max Pooling layer, max and average sampling is carried out on channel dimension, image resolution is not changed, two output characteristic layers are spliced on the channel dimension, then a 7-by-7 convolution layer is sent to the channel dimension, 0 filling is adopted, padding is set to be 3, image resolution is not changed, a sigmoid activation function is connected to the channel dimension, a weight distribution graph within a range of 0-1 is obtained, the weight distribution graph is multiplied by an infrared characteristic layer, and amplitude information of an infrared source image is further kept. Finally, the feature maps obtained on the two attention structures are added to obtain an intermediate fusion image.

And 4, step 4: designing a loss function: an SSIM structure similarity measurement function is added, in order to further retain detailed texture information of visible light, a gradient operator is introduced, L2 regularization is introduced, and finally a target feature enhancement loss function is designed.

The SSIM structure similarity function can better reflect the judgment of the similarity of two images by human vision, and the similarity function consists of three aspects, namely the brightness, the contrast and the structure comparison function of the images, wherein the brightness similarity is as follows:

μ _x 、μ _y the average luminance of the two images is represented separately,

n is the number of pixels of the picture, X _i X, y represent two different images, C, for pixel value size ₁ To prevent the denominator from being 0, C ₁ ＝(k ₁ *L) ² ,k ₁ Taking 0.01, taking 255 from L; the following is the picture contrast, which indicates the intensity of the image brightness change, and the contrast similarity function is set as:

wherein the content of the first and second substances,

the structure comparison function is:

wherein the covariance

To obtain finally

Wherein C is ₁ 、C ₂ 、C ₃ All for preventing the generation of a denominator of 0, C ₃ ＝C ₂ /2. The value range of SSIM is-1 to 1, and the larger the SSIM value is, the higher the picture similarity is, so that the final loss function of SSIM measurement is L _SSIM ＝1-SSIM。

Operator of upper complaint gradient

Where V represents the image of the visible light source,

the final fused image is represented as a result of the fusion,

is a gradient operator, | | | | non-conducting phosphor ₁ Represents the L1 norm; because the visible image has abundant texture information, the reconstruction of the visible image is regularized through a gradient penalty to ensure texture consistency.

The L2 regularization set to

The method mainly comprises the steps of measuring the strength consistency between a source image and a fusion image, wherein X is an unknown number set to represent a visible light gray map and an infrared gray map, and | | | | charging ₂ Representing the L2 norm.

The target feature enhancement loss function is set as

w _vi Weight of visible light feature layer, w _ir And F is the Frobenius norm and is the weight of the infrared characteristic layer. Since infrared images have more prominent target features than visible images, we design this loss function L ₂ And is used to constrain the depth features of the fused image so that the salient features are preserved. M is set to 4 and respectively represents the fusion process under different scales, w _e Representing the weights at different scales, we will refer to w due to the difference in amplitude at different scales _e Are respectively set as [1,10,100,1000],

Represents the fusion result of the characteristic diagram of the mth layer,

and

w represents the visible and infrared feature layers of the mth layer, respectively, since the loss function is mainly to preserve the significant features of the infrared image _vi Set ratio w _ir And small, 3,6 respectively.

The above-mentioned weighted calculation is performed for each loss function, wherein

Wherein

The final loss function is

Wherein alpha is ₁ ，α ₂ And alpha ₃ Respectively, the weight ratio of each loss function, alpha ₁ ，α ₂ And alpha ₃ Set to 2,2, 10, λ to 5,w _e Is set as [1,10,100,1000 ]]，w _vi And w _ir Are respectively set to 3,6.

And 5: data sets were selected experiments were performed on three data sets, including TNO, NIR and FLIR.

180 pairs of images, one of which is infrared and visible, are randomly selected as training samples in the FLIR dataset, such as shown in fig. 4 and 5. Prior to training, all images were converted to grayscale. At the same time, it is center clipped with 128 × 128 pixels. Then, a plurality of pairs of infrared and visible light images are sent into the network for training, loss is calculated according to the loss function, and then network parameters are updated through inverse gradient propagation, wherein the epoch of the training is set to 120, an Adam optimizer is adopted, and the learning rate is set to 10 ^-3 The MultiStepLR learning rate adjustment strategy is adopted, and the learning rate is reduced by 10 times every 40 epochs. After training is completed, the remaining FLIR data, TNO (40) data set and NIR balance data set are used to verify the fusion effect of the model.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An infrared and visible light image fusion method based on an end-to-end attention network is characterized by comprising the following steps:

preprocessing the infrared image and the visible light image;

2. The end-to-end attention network-based infrared and visible image fusion method according to claim 1, wherein the preprocessing the infrared image and the visible image comprises: and converting the infrared image and the visible light image into a gray scale image, and performing center cutting.

3. The end-to-end attention network based infrared and visible image fusion method of claim 1,

the encoder in the self-encoding network is used for extracting the multi-scale deep semantic features of the preprocessed image and outputting an infrared feature map and a visible light feature map; and the decoder in the self-coding network is used for reconstructing the infrared characteristic diagram and the visible light characteristic diagram into a final fusion image.

4. The end-to-end attention network-based infrared and visible light image fusion method according to claim 3,

the encoder comprises a plurality of maximum pooled downsampled layers and a plurality of convolution blocks, the number of input channels of the encoder is set based on a first preset number, the number of output channels of the encoder is set based on a second preset number, and each convolution block of the encoder comprises a Batchnorm regularization function and a RELU activation function;

the decoder comprises a plurality of upsampling layers and a plurality of convolution blocks, the number of input channels of the decoder is set based on a third preset number, the number of output channels of the decoder is set based on a fourth preset number, and a Batchnorm regularization function and a RELU activation function are contained behind each convolution block of the decoder.

5. The end-to-end attention network-based infrared and visible light image fusion method of claim 4, wherein joining the jump connection to the self-coding network comprises:

6. The end-to-end attention network based infrared and visible image fusion method of claim 1, wherein the channel-space dual attention fusion layer comprises: a channel attention module and a spatial attention module;

in the channel-space double-attention fusion layer, the infrared feature map and the visible light feature map are connected in channel dimension, the spliced images are respectively input to the space attention module and the channel attention module to obtain a space weight map and a channel weight map, the space weight map and the channel weight map are multiplied by the infrared feature map and the visible light feature map, and then the space and channel attention fusion features are added to obtain an intermediate fusion image.

7. The end-to-end attention network-based infrared and visible light image fusion method according to claim 1, wherein constructing the attention network further comprises: setting a loss function;

8. The end-to-end attention network-based infrared and visible light image fusion method of claim 7, wherein the SSIM structural similarity metric function comprises: a brightness function, a contrast function and a structure comparison function;

the brightness function is:

the contrast function is:

wherein the content of the first and second substances,

σ _x and σ _y Respectively, the standard deviations, C, of the two images ₂ To prevent the denominator from being 0, C ₂ ＝(k ₂ *L) ² ,k ₂ Taking 0.03, taking 255 from L;

the structure comparison function is: