CN115170915A - Infrared and visible light image fusion method based on end-to-end attention network - Google Patents

Infrared and visible light image fusion method based on end-to-end attention network Download PDF

Info

Publication number
CN115170915A
CN115170915A CN202210954041.9A CN202210954041A CN115170915A CN 115170915 A CN115170915 A CN 115170915A CN 202210954041 A CN202210954041 A CN 202210954041A CN 115170915 A CN115170915 A CN 115170915A
Authority
CN
China
Prior art keywords
image
infrared
visible light
fusion
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210954041.9A
Other languages
Chinese (zh)
Inventor
江旻珊
朱永飞
常敏
张学典
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202210954041.9A priority Critical patent/CN115170915A/en
Publication of CN115170915A publication Critical patent/CN115170915A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides an infrared and visible light image fusion method based on an end-to-end attention network, which comprises the steps of preprocessing an infrared and visible light image; constructing an end-to-end attention network; wherein the attention network comprises: a self-coding network and a channel-space dual attention fusion layer, the self-coding network including an encoder-decoder joined in a hopping connection; and fusing the preprocessed infrared and visible light images based on the attention network. The invention overcomes the repeated defect of extracting features by adopting the same method aiming at different source images in the traditional fusion method, simultaneously reduces the limitation of manually designing a fusion strategy, and finally generates a single fusion image containing the feature information of a plurality of source images.

Description

Infrared and visible light image fusion method based on end-to-end attention network
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to an infrared and visible light image fusion method based on an end-to-end attention network.
Background
Image fusion is an image processing technology for image information enhancement, and due to the theoretical and technical limitations of hardware devices, a single sensor cannot effectively and comprehensively describe scene information under a specific shooting setting, for example, a visible light image contains more detailed texture information, and an infrared image contains more amplitude information, so that the image fusion is to combine complementary information between different source images under the same scene to generate a single image with more abundant information, so as to apply the single image to many fields, such as photographic visualization, target tracking, medical diagnosis and remote sensing monitoring.
In general, image fusion algorithms can be classified into the following two types: the traditional method and the deep learning method, the early image fusion method all adopt the way of mathematical transformation to generate the activity level map, and the fusion rule is designed in the space domain and the transformation domain, the representative traditional image fusion method includes: multi-scale transform based methods, sparse representation based methods, subspace based methods, saliency based methods, total variance based methods. On one hand, the traditional fusion algorithms adopt the same feature extraction and reconstruction methods for different source images, and the feature difference between the different source images is not considered, so that the fusion effect is poor, and on the other hand, the traditional fusion algorithms are designed by manpower, the fusion rules are too simple, the fusion performance is limited seriously, and the fusion algorithms can only be applied to specific fusion tasks. In recent years, with the continuous development of deep learning technology, the application of the deep learning technology in the aspect of image fusion is wider and wider, at present, a fusion algorithm based on deep learning mainly comprises three types, namely a generation antagonistic network based on GAN, a network based on an Automatic Encoder (AE), and a conventional Convolutional Neural Network (CNN). Secondly, based on a deep learning method, a loss function can be designed by self, and network parameters are updated through reverse gradient propagation to obtain a more reasonable feature fusion strategy, so that self-adaptive feature fusion is realized. Thanks to these advantages, deep learning promotes a great progress in image fusion, achieving performance far exceeding that of conventional methods.
Although the deep learning has achieved satisfactory results in the image fusion field, there are some disadvantages that (1) these deep learning network architectures do not fully consider the intermediate feature layer, and only design the loss function from the final fusion image and the source image, (2) most fusion algorithms only use the deep learning model in the stages of feature extraction and feature reconstruction, and feature fusion is still the traditional method, such as feature map addition, maximum value and mean value, and (3) some deep learning models use the two-stage training method, which is time-consuming and difficult to train.
Disclosure of Invention
In order to solve the technical problem, the invention provides an infrared and visible light image fusion method based on an end-to-end attention network, so as to improve the image fusion effect.
In order to achieve the above object, the present invention provides an infrared and visible light image fusion method based on an end-to-end attention network, comprising:
preprocessing the infrared image and the visible light image;
constructing an end-to-end attention network; wherein the attention network comprises: a self-coding network and a channel-space dual attention fusion layer, the self-coding network comprising an encoder-decoder joined in a hopping connection;
and fusing the preprocessed infrared image and the preprocessed visible light image based on the attention network.
Optionally, the preprocessing the infrared image and the visible light image comprises: and converting the infrared image and the visible light image into a gray scale image, and performing center cutting.
Optionally, an encoder in the self-encoding network is configured to extract a multi-scale deep semantic feature of the preprocessed image, and output an infrared feature map and a visible light feature map; and the decoder in the self-coding network is used for reconstructing the infrared characteristic diagram and the visible light characteristic diagram into a final fusion image.
Optionally, the encoder includes a plurality of maximum pooled downsampled layers and a plurality of convolution blocks, the number of input channels of the encoder is set based on a first preset number, the number of output channels of the encoder is set based on a second preset number, and each convolution block of the encoder includes a BatchNorm regularization function and a RELU activation function;
the decoder comprises a plurality of upper sampling layers and a plurality of convolution blocks, the number of input channels of the decoder is set based on a third preset number, the number of output channels of the decoder is set based on a fourth preset number, and each convolution block of the decoder contains a BatchNorm regularization function and a RELU activation function.
Optionally, joining the jump connection to the self-coding network comprises:
and connecting the input of each maximum pooling layer in the encoder with the output of the up-sampling layer in the decoder, adding a denseblock in a connecting path, forming the denseblock by using different numbers of convolution blocks in different connecting paths, and setting the output channels of the convolution blocks based on a fifth preset number.
Optionally, the channel-space dual attention fusion layer comprises: a channel attention module and a spatial attention module;
in the channel-space double-attention fusion layer, the infrared characteristic diagram and the visible light characteristic diagram are connected in channel dimensions, the spliced images are respectively input to the space attention module and the channel attention module to obtain a space weight diagram and a channel weight diagram, the space weight diagram and the channel weight diagram are multiplied by the infrared characteristic diagram and the visible light characteristic diagram, and then the space and channel attention fusion characteristics are added to obtain an intermediate fusion image. On one hand, an intermediate fusion image is obtained through the intermediate fusion layer, and on the other hand, the intermediate fusion layer can enable the network to pay attention to the place where the network needs more attention. And then sending the intermediate fusion image into a decoder of a self-coding network to obtain a final fusion image.
Optionally, constructing the attention network further comprises: setting a loss function;
setting the loss function includes: adding an SSIM structure similarity measurement function, introducing a gradient operator, introducing L2 regularization, designing a target feature enhancement loss function, and finally performing weighted calculation on each loss.
Optionally, the SSIM structural similarity metric function includes: a brightness function, a contrast function and a structure comparison function;
the brightness function is:
Figure BDA0003790436960000041
wherein, mu x 、μ y The average luminance of the two images is represented separately,
Figure BDA0003790436960000042
n is the number of pixels of the picture, X i X, y represent two different images, C, for pixel value size 1 To prevent the denominator from being 0, C 1 =(k 1 *L) 2 ,k 1 Taking 0.01, taking 255 from L;
the contrast function is:
Figure BDA0003790436960000051
wherein the content of the first and second substances,
Figure BDA0003790436960000052
Figure BDA0003790436960000053
σ x and σ y Respectively, the standard deviations of the two images, C 2 To prevent the denominator from being 0, C 2 =(k 2 *L) 2 ,k 2 Taking 0.03, taking 255 from L;
the structural comparison function is:
Figure BDA0003790436960000054
wherein the content of the first and second substances,
Figure BDA0003790436960000055
σ xy representing the covariance of two pictures, C 3 To prevent the denominator from being 0, C 3 =C 2 /2;
The SSIM structure similarity measurement function is as follows:
Figure BDA0003790436960000056
optionally, the gradient operator is:
Figure BDA0003790436960000057
wherein, V is the visible light source image,
Figure BDA0003790436960000058
in order to obtain the final fused image,
Figure BDA0003790436960000059
is a gradient operator, | | | | non-conducting phosphor 1 Represents the L1 norm;
the L2 regularization is:
Figure BDA00037904369600000510
wherein, X is an unknown number of the setting, representing a visible light gray scale image and an infrared gray scale image, | | | calving 2 Represents the L2 norm;
the target feature enhancement loss function is:
Figure BDA0003790436960000061
where M denotes the fusion process at different scales, w e For the weights at the different scales of the scale,
Figure BDA0003790436960000062
as a result of the fusion of the mth layer feature map,
Figure BDA0003790436960000063
and
Figure BDA0003790436960000064
a visible light characteristic layer and an infrared characteristic layer, w, of the m-th layer respectively vi Weight of visible light feature layer, w ir And F is the Frobenius norm and is the weight of the infrared characteristic layer.
Optionally, the final loss function is:
Figure BDA0003790436960000065
wherein I is an infrared source image, V is a visible light source image,
Figure BDA0003790436960000066
for the final fused image, L1 is
Figure BDA0003790436960000067
α 1 、α 2 、α 3 Respectively, the weight of each loss function.
Compared with the prior art, the invention has the following advantages and technical effects:
1. the invention uses the network structure of coding-decoding, the multi-scale deep features of the input picture are fully extracted in the coding stage, the multi-scale deep features are effectively reconstructed in the decoding stage, then the jump connection is further introduced into the self-coding network, the gradient disappearance is effectively slowed down, and the multi-scale feature layer is multiplexed, thus the capability of extracting and reconstructing the features by the network can be effectively enhanced. Meanwhile, different semantic information contained in the features under different scales is considered, and the features are not suitable for direct connection, so that different numbers of convolution blocks are selected among different connection layers to eliminate and balance the difference.
2. The invention uses a channel-space double attention neural network fusion structure, is different from the fusion strategy of the traditional manual design, and can effectively store the amplitude information of the infrared image and the detail texture information of the visible light image.
3. The invention designs a brand-new loss function, introduces an SSIM structure similarity function, an L2 regularization function, a gradient operator and a target feature enhancement loss function, can effectively extract the obvious and detailed features of the source image, adopts an end-to-end network structure, and abandons an intermediate fusion layer and a two-stage training strategy of manual design, so that the training is faster, and the fusion result is more effective.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, are included to provide a further understanding of the application, and the description of the exemplary embodiments of the application are intended to be illustrative of the application and are not intended to limit the application. In the drawings:
fig. 1 is a schematic flow chart of an infrared and visible light image fusion method based on an end-to-end attention network according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an attention network structure according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a spatial-channel dual attention fusion layer structure according to an embodiment of the present invention;
FIG. 4 is an infrared image of an embodiment of the present invention;
FIG. 5 is a visible light image according to an embodiment of the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than here.
Examples
As shown in fig. 1, the present embodiment provides an infrared and visible light image fusion method based on an end-to-end attention network, including:
preprocessing the infrared image and the visible light image;
constructing an end-to-end attention network; wherein the attention network comprises: a self-coding network and a channel-space dual attention fusion layer, the self-coding network including an encoder-decoder joined in a hopping connection;
and fusing the preprocessed infrared image and the preprocessed visible light image based on the attention network.
Further, preprocessing the infrared image and the visible light image includes: and converting the infrared and visible light images into grey-scale images, and performing center cutting.
Further, the self-encoding network comprises: an encoder and a decoder;
the encoder is used for extracting the multi-scale deep semantic features of the preprocessed image and outputting an infrared feature map and a visible light feature map; and the decoder is used for reconstructing a final fusion image according to the infrared characteristic diagram and the visible light characteristic diagram.
Furthermore, the encoder is composed of a plurality of maximum pooled downsampled layers and a plurality of convolution blocks, the number of input channels of the encoder is set based on a first preset number, the number of output channels of the encoder is set based on a second preset number, and each convolution block of the encoder contains a BatchNorm regularization function and a RELU activation function;
the decoder is composed of a plurality of upper sampling layers and a plurality of convolution blocks, the number of input channels of the decoder is set based on a third preset number, the number of output channels of the decoder is set based on a fourth preset number, and a BatchNorm regularization function and a RELU activation function are contained behind each convolution block of the decoder.
Further, joining the hopping connection to the self-coding network comprises:
and connecting the input of each maximum pooling layer in the encoder with the output of the up-sampling layer in the decoder, adding a denseblock in a connecting path, forming the denseblock by using different numbers of rolling blocks in different connecting paths, and setting output channels of the rolling blocks based on a fifth preset number.
Further, the channel-space dual attention fusion layer includes: a channel attention module and a spatial attention module;
in the channel-space double-attention fusion layer, the infrared feature map and the visible light feature map are connected in channel dimension, the spliced images are respectively input to the space attention module and the channel attention module to obtain a space weight map and a channel weight map, the space weight map and the channel weight map are multiplied by the infrared feature map and the visible light feature map, and then the space and channel attention fusion features are added to obtain an intermediate fusion image. On one hand, an intermediate fusion image is obtained through the intermediate fusion layer, and on the other hand, the intermediate fusion layer can enable the network to pay attention to the place where the network needs more attention. And then sending the intermediate fusion image into a decoder of a self-coding network to obtain a final fusion image.
Further, constructing the attention network further comprises: setting a loss function;
setting the loss function includes: adding an SSIM structure similarity measurement function, introducing a gradient operator, introducing L2 regularization, designing a target feature enhancement loss function, and finally performing weighted calculation on each loss.
The embodiment provides an infrared and visible light image fusion method, which aims to more comprehensively describe an imaging scene in order to fuse complementary and beneficial information in images of different modes, and comprises the following steps: the method comprises the steps of (1) constructing an automatic coding network of an encoder-decoder, extracting deep semantic information of an input image and reconstructing a fusion image, (2) adding jump connection in the self-coding network, introducing a denseblock in the jump connection to reduce the difference of the abundance degree of the semantic information between connection layers, (3) constructing a double attention fusion layer of a channel and a space, further retaining texture information of a visible light image and amplitude information of an infrared image, (4) designing a proper loss function, and selecting a related data set to train and test the performance of the fusion network. The invention overcomes the repeated defect of extracting features by adopting the same method aiming at different source images in the traditional fusion method, simultaneously reduces the limitation of manually designing a fusion strategy, and finally generates a single fusion image containing the feature information of a plurality of source images. The specific implementation steps are as follows:
step 1, constructing a self-encoder network of an encoder and a decoder, wherein the encoder is used for extracting deep features of an input image, and the decoder network is used for reconstructing the extracted deep features into a final fusion image.
The encoder network is composed of 3 max-posing down-sampling layers and 9 common convolution blocks, the decoder network is composed of 3 up-sampling layers and 7 common convolution blocks, which are connected with each other layer by layer, as shown in fig. 2, the first convolution block in the self-encoding network adopts 1 × 1 convolution kernel and adopts reflection filling (reflection filling pad) to prevent the occurrence of edge artifacts of the fused image, the number of input channels is set to 1, the number of output channels is set to 16, the other common convolution modules all adopt 3 × 3 convolution kernels, the step length is set to 1, the filling is performed by 0, the image resolution is not changed, the number of input channels of the encoder is respectively set to 16, 64, 64, 128, 128, 256, 256, 256, 256, the number of output channels is respectively set to 64, 64, 128, 128, 256, 256, 256, three max-pooling down-sampling layers are set in the encoder stage, the step size is set to be 2, the decoder stage samples the size of the feature map twice in a bilinear interpolation mode, the number of input channels of the convolution blocks in the decoding stage is respectively 512, 256, 256, 128, 128, 64, and 64, the number of output channels is respectively 256, 128, 128, 64, 64, and each convolution block is followed by a Batchnorm regularization function and a RELU activation function.
And 2, step: the jump connection is added into the self-coding network, so that the gradient disappearance is slowed down, information loss caused by an up-sampling process and a down-sampling process is further compensated, and the gradient disappearance problem of a deep neural network is slowed down. Meanwhile, the difference of semantic information between connection layers is considered, direct connection is not suitable, and therefore different rolling blocks are used on different connection layers to realize jump connection.
As shown in fig. 2, the input of the first maxporoling layer and the output of the third upsampling layer are connected in channel dimension through a skip connection, specifically, 4 convolution blocks are adopted, the input channel of each convolution block is set to 64, 64, 128, and 192, and the output channels are all set to 64, specifically, the connection mode is: setting the four convolution blocks as A1, A2, A3 and A4 respectively, then the output of A1 is used as the input of A2, the channels of A1 and A2 are connected as the input of A3, the channels of A1, A2 and A3 are connected as the input of A4, each convolution block is set as 3 × 3 convolution kernel, 0 padding is adopted, padding is set as 1, the image resolution is not changed, and then BatchNorm regularization and RELU activation functions are followed. The input of the second max-posing layer is connected to the output of the second upsampling layer in the same way as described above, but to balance the difference in semantic information between the deep layer and the shallow layer, we use three convolution blocks, the input of the third max-posing layer is connected to the output of the third upsampling layer, and 2 convolution blocks are selected in the same way as described above.
And 3, step 3: constructing a channel-space double-attention fusion layer (as shown in figure 3), respectively extracting amplitude and detail texture information of the infrared and visible light images, splicing the feature information of the two source images extracted by the encoder on channel dimensions, respectively sending the feature information into the channel and the space attention layer to obtain corresponding feature maps, and further adding to obtain a middle feature layer fusion image.
Calculate the weight of channel attention: sending the spliced image (S1) into a GlobavalgePoolic layer, successively passing through two full-connection layers, wherein the number of output channels of the H-swish activation function is 1/4 of the number of S1 channels after the first full-connection layer, the number of the output channels of the H-swish activation function is obtained by the aid of a sigmoid activation function after the second convolution layer, the weight within the range of 0-1 numerical value is obtained, the number of the output channels of the S1 channel is equal to the number of the S1 channels, and finally multiplying the obtained weight with a visible light characteristic layer to further keep detailed information of the visible light image. Calculating the weight of spatial attention: the spliced images are respectively sent to an Avergepoiling layer and a Max Pooling layer, max and average sampling is carried out on channel dimension, image resolution is not changed, two output characteristic layers are spliced on the channel dimension, then a 7-by-7 convolution layer is sent to the channel dimension, 0 filling is adopted, padding is set to be 3, image resolution is not changed, a sigmoid activation function is connected to the channel dimension, a weight distribution graph within a range of 0-1 is obtained, the weight distribution graph is multiplied by an infrared characteristic layer, and amplitude information of an infrared source image is further kept. Finally, the feature maps obtained on the two attention structures are added to obtain an intermediate fusion image.
And 4, step 4: designing a loss function: an SSIM structure similarity measurement function is added, in order to further retain detailed texture information of visible light, a gradient operator is introduced, L2 regularization is introduced, and finally a target feature enhancement loss function is designed.
The SSIM structure similarity function can better reflect the judgment of the similarity of two images by human vision, and the similarity function consists of three aspects, namely the brightness, the contrast and the structure comparison function of the images, wherein the brightness similarity is as follows:
Figure BDA0003790436960000131
μ x 、μ y the average luminance of the two images is represented separately,
Figure BDA0003790436960000132
n is the number of pixels of the picture, X i X, y represent two different images, C, for pixel value size 1 To prevent the denominator from being 0, C 1 =(k 1 *L) 2 ,k 1 Taking 0.01, taking 255 from L; the following is the picture contrast, which indicates the intensity of the image brightness change, and the contrast similarity function is set as:
Figure BDA0003790436960000133
wherein the content of the first and second substances,
Figure BDA0003790436960000134
σ x and σ y Respectively, the standard deviations of the two images, C 2 To prevent the denominator from being 0, C 2 =(k 2 *L) 2 ,k 2 Taking 0.03, taking 255 from L;
the structure comparison function is:
Figure BDA0003790436960000135
wherein the covariance
Figure BDA0003790436960000136
To obtain finally
Figure BDA0003790436960000137
Wherein C is 1 、C 2 、C 3 All for preventing the generation of a denominator of 0, C 3 =C 2 /2. The value range of SSIM is-1 to 1, and the larger the SSIM value is, the higher the picture similarity is, so that the final loss function of SSIM measurement is L SSIM =1-SSIM。
Operator of upper complaint gradient
Figure BDA0003790436960000141
Where V represents the image of the visible light source,
Figure BDA0003790436960000142
the final fused image is represented as a result of the fusion,
Figure BDA0003790436960000143
is a gradient operator, | | | | non-conducting phosphor 1 Represents the L1 norm; because the visible image has abundant texture information, the reconstruction of the visible image is regularized through a gradient penalty to ensure texture consistency.
The L2 regularization set to
Figure BDA0003790436960000144
The method mainly comprises the steps of measuring the strength consistency between a source image and a fusion image, wherein X is an unknown number set to represent a visible light gray map and an infrared gray map, and | | | | charging 2 Representing the L2 norm.
The target feature enhancement loss function is set as
Figure BDA0003790436960000145
w vi Weight of visible light feature layer, w ir And F is the Frobenius norm and is the weight of the infrared characteristic layer. Since infrared images have more prominent target features than visible images, we design this loss function L 2 And is used to constrain the depth features of the fused image so that the salient features are preserved. M is set to 4 and respectively represents the fusion process under different scales, w e Representing the weights at different scales, we will refer to w due to the difference in amplitude at different scales e Are respectively set as [1,10,100,1000],
Figure BDA0003790436960000146
Represents the fusion result of the characteristic diagram of the mth layer,
Figure BDA0003790436960000147
and
Figure BDA0003790436960000148
w represents the visible and infrared feature layers of the mth layer, respectively, since the loss function is mainly to preserve the significant features of the infrared image vi Set ratio w ir And small, 3,6 respectively.
The above-mentioned weighted calculation is performed for each loss function, wherein
Figure BDA0003790436960000149
Figure BDA00037904369600001410
Wherein
Figure BDA00037904369600001411
The final loss function is
Figure BDA00037904369600001412
Figure BDA0003790436960000151
Wherein alpha is 1 ,α 2 And alpha 3 Respectively, the weight ratio of each loss function, alpha 1 ,α 2 And alpha 3 Set to 2,2, 10, λ to 5,w e Is set as [1,10,100,1000 ]],w vi And w ir Are respectively set to 3,6.
And 5: data sets were selected experiments were performed on three data sets, including TNO, NIR and FLIR.
180 pairs of images, one of which is infrared and visible, are randomly selected as training samples in the FLIR dataset, such as shown in fig. 4 and 5. Prior to training, all images were converted to grayscale. At the same time, it is center clipped with 128 × 128 pixels. Then, a plurality of pairs of infrared and visible light images are sent into the network for training, loss is calculated according to the loss function, and then network parameters are updated through inverse gradient propagation, wherein the epoch of the training is set to 120, an Adam optimizer is adopted, and the learning rate is set to 10 -3 The MultiStepLR learning rate adjustment strategy is adopted, and the learning rate is reduced by 10 times every 40 epochs. After training is completed, the remaining FLIR data, TNO (40) data set and NIR balance data set are used to verify the fusion effect of the model.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. An infrared and visible light image fusion method based on an end-to-end attention network is characterized by comprising the following steps:
preprocessing the infrared image and the visible light image;
constructing an end-to-end attention network; wherein the attention network comprises: a self-coding network and a channel-space dual attention fusion layer, the self-coding network including an encoder-decoder joined in a hopping connection;
and fusing the preprocessed infrared image and the preprocessed visible light image based on the attention network.
2. The end-to-end attention network-based infrared and visible image fusion method according to claim 1, wherein the preprocessing the infrared image and the visible image comprises: and converting the infrared image and the visible light image into a gray scale image, and performing center cutting.
3. The end-to-end attention network based infrared and visible image fusion method of claim 1,
the encoder in the self-encoding network is used for extracting the multi-scale deep semantic features of the preprocessed image and outputting an infrared feature map and a visible light feature map; and the decoder in the self-coding network is used for reconstructing the infrared characteristic diagram and the visible light characteristic diagram into a final fusion image.
4. The end-to-end attention network-based infrared and visible light image fusion method according to claim 3,
the encoder comprises a plurality of maximum pooled downsampled layers and a plurality of convolution blocks, the number of input channels of the encoder is set based on a first preset number, the number of output channels of the encoder is set based on a second preset number, and each convolution block of the encoder comprises a Batchnorm regularization function and a RELU activation function;
the decoder comprises a plurality of upsampling layers and a plurality of convolution blocks, the number of input channels of the decoder is set based on a third preset number, the number of output channels of the decoder is set based on a fourth preset number, and a Batchnorm regularization function and a RELU activation function are contained behind each convolution block of the decoder.
5. The end-to-end attention network-based infrared and visible light image fusion method of claim 4, wherein joining the jump connection to the self-coding network comprises:
and connecting the input of each maximum pooling layer in the encoder with the output of the up-sampling layer in the decoder, adding a denseblock in a connecting path, forming the denseblock by using different numbers of convolution blocks in different connecting paths, and setting the output channels of the convolution blocks based on a fifth preset number.
6. The end-to-end attention network based infrared and visible image fusion method of claim 1, wherein the channel-space dual attention fusion layer comprises: a channel attention module and a spatial attention module;
in the channel-space double-attention fusion layer, the infrared feature map and the visible light feature map are connected in channel dimension, the spliced images are respectively input to the space attention module and the channel attention module to obtain a space weight map and a channel weight map, the space weight map and the channel weight map are multiplied by the infrared feature map and the visible light feature map, and then the space and channel attention fusion features are added to obtain an intermediate fusion image.
7. The end-to-end attention network-based infrared and visible light image fusion method according to claim 1, wherein constructing the attention network further comprises: setting a loss function;
setting the loss function includes: adding an SSIM structure similarity measurement function, introducing a gradient operator, introducing L2 regularization, designing a target feature enhancement loss function, and finally performing weighted calculation on each loss.
8. The end-to-end attention network-based infrared and visible light image fusion method of claim 7, wherein the SSIM structural similarity metric function comprises: a brightness function, a contrast function and a structure comparison function;
the brightness function is:
Figure FDA0003790436950000031
wherein, mu x 、μ y The average luminance of the two images is represented separately,
Figure FDA0003790436950000032
n is the number of pixels of the picture, X i X, y represent two different images, C, for pixel value size 1 To prevent the denominator from being 0, C 1 =(k 1 *L) 2 ,k 1 Taking 0.01, taking 255 from L;
the contrast function is:
Figure FDA0003790436950000033
wherein the content of the first and second substances,
Figure FDA0003790436950000034
Figure FDA0003790436950000035
σ x and σ y Respectively, the standard deviations, C, of the two images 2 To prevent the denominator from being 0, C 2 =(k 2 *L) 2 ,k 2 Taking 0.03, taking 255 from L;
the structure comparison function is:
Figure FDA0003790436950000041
wherein the content of the first and second substances,
Figure FDA0003790436950000042
σ xy representing the covariance of two pictures, C 3 To prevent the denominator from being 0, C 3 =C 2 /2;
The SSIM structure similarity measurement function is as follows:
Figure FDA0003790436950000043
9. the end-to-end attention network-based infrared and visible light image fusion method according to claim 8, wherein the gradient operator is:
Figure FDA0003790436950000044
wherein, V is the visible light source image,
Figure FDA0003790436950000045
in order to be the final fused image,
Figure FDA0003790436950000046
is a gradient operator, | | | | non-conducting phosphor 1 Represents the L1 norm;
the L2 regularization is:
Figure FDA0003790436950000047
wherein, X is an unknown number of the setting, representing a visible light gray scale image and an infrared gray scale image, | | | calving 2 Represents the L2 norm;
the target feature enhancement loss function is:
Figure FDA0003790436950000048
where M denotes the fusion process at different scales, w e For the weights at the different scales of the scale,
Figure FDA0003790436950000049
as a result of the fusion of the mth layer feature map,
Figure FDA00037904369500000410
and
Figure FDA00037904369500000411
a visible light characteristic layer and an infrared characteristic layer, w, of the m-th layer respectively vi Weight of visible light feature layer, w ir And F is the Frobenius norm and is the weight of the infrared characteristic layer.
10. The end-to-end attention network-based infrared and visible light image fusion method of claim 9, wherein the loss function is:
Figure FDA0003790436950000051
wherein I is an infrared source image, V is a visible light source image,
Figure FDA0003790436950000052
for the final fused image, L1 is
Figure FDA0003790436950000053
α 1 、α 2 、α 3 Respectively, the weight of each loss function.
CN202210954041.9A 2022-08-10 2022-08-10 Infrared and visible light image fusion method based on end-to-end attention network Pending CN115170915A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210954041.9A CN115170915A (en) 2022-08-10 2022-08-10 Infrared and visible light image fusion method based on end-to-end attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210954041.9A CN115170915A (en) 2022-08-10 2022-08-10 Infrared and visible light image fusion method based on end-to-end attention network

Publications (1)

Publication Number Publication Date
CN115170915A true CN115170915A (en) 2022-10-11

Family

ID=83479073

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210954041.9A Pending CN115170915A (en) 2022-08-10 2022-08-10 Infrared and visible light image fusion method based on end-to-end attention network

Country Status (1)

Country Link
CN (1) CN115170915A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631428A (en) * 2022-11-01 2023-01-20 西南交通大学 Unsupervised image fusion method and system based on structural texture decomposition
CN116664462A (en) * 2023-05-19 2023-08-29 兰州交通大学 Infrared and visible light image fusion method based on MS-DSC and I_CBAM
CN117036893A (en) * 2023-10-08 2023-11-10 南京航空航天大学 Image fusion method based on local cross-stage and rapid downsampling
CN117115065A (en) * 2023-10-25 2023-11-24 宁波纬诚科技股份有限公司 Fusion method of visible light and infrared image based on focusing loss function constraint

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631428A (en) * 2022-11-01 2023-01-20 西南交通大学 Unsupervised image fusion method and system based on structural texture decomposition
CN115631428B (en) * 2022-11-01 2023-08-11 西南交通大学 Unsupervised image fusion method and system based on structural texture decomposition
CN116664462A (en) * 2023-05-19 2023-08-29 兰州交通大学 Infrared and visible light image fusion method based on MS-DSC and I_CBAM
CN116664462B (en) * 2023-05-19 2024-01-19 兰州交通大学 Infrared and visible light image fusion method based on MS-DSC and I_CBAM
CN117036893A (en) * 2023-10-08 2023-11-10 南京航空航天大学 Image fusion method based on local cross-stage and rapid downsampling
CN117036893B (en) * 2023-10-08 2023-12-15 南京航空航天大学 Image fusion method based on local cross-stage and rapid downsampling
CN117115065A (en) * 2023-10-25 2023-11-24 宁波纬诚科技股份有限公司 Fusion method of visible light and infrared image based on focusing loss function constraint
CN117115065B (en) * 2023-10-25 2024-01-23 宁波纬诚科技股份有限公司 Fusion method of visible light and infrared image based on focusing loss function constraint

Similar Documents

Publication Publication Date Title
CN112507997B (en) Face super-resolution system based on multi-scale convolution and receptive field feature fusion
CN109447907B (en) Single image enhancement method based on full convolution neural network
CN115170915A (en) Infrared and visible light image fusion method based on end-to-end attention network
CN110717857A (en) Super-resolution image reconstruction method and device
CN110599401A (en) Remote sensing image super-resolution reconstruction method, processing device and readable storage medium
CN111986084A (en) Multi-camera low-illumination image quality enhancement method based on multi-task fusion
CN112819910A (en) Hyperspectral image reconstruction method based on double-ghost attention machine mechanism network
CN112465718A (en) Two-stage image restoration method based on generation of countermeasure network
CN112614061A (en) Low-illumination image brightness enhancement and super-resolution method based on double-channel coder-decoder
CN114782298B (en) Infrared and visible light image fusion method with regional attention
CN114219719A (en) CNN medical CT image denoising method based on dual attention and multi-scale features
CN110599585A (en) Single-image human body three-dimensional reconstruction method and device based on deep learning
CN113724134A (en) Aerial image blind super-resolution reconstruction method based on residual distillation network
Liu et al. Learning noise-decoupled affine models for extreme low-light image enhancement
CN116486074A (en) Medical image segmentation method based on local and global context information coding
CN116934592A (en) Image stitching method, system, equipment and medium based on deep learning
CN115641391A (en) Infrared image colorizing method based on dense residual error and double-flow attention
CN115272072A (en) Underwater image super-resolution method based on multi-feature image fusion
CN117197627B (en) Multi-mode image fusion method based on high-order degradation model
CN117408924A (en) Low-light image enhancement method based on multiple semantic feature fusion network
Kumar et al. Underwater image enhancement using deep learning
CN114862699A (en) Face repairing method, device and storage medium based on generation countermeasure network
CN113870162A (en) Low-light image enhancement method integrating illumination and reflection
CN112150566A (en) Dense residual error network image compressed sensing reconstruction method based on feature fusion
CN117710216B (en) Image super-resolution reconstruction method based on variation self-encoder

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination