CN115170915A - Infrared and visible light image fusion method based on end-to-end attention network - Google Patents
Infrared and visible light image fusion method based on end-to-end attention network Download PDFInfo
- Publication number
- CN115170915A CN115170915A CN202210954041.9A CN202210954041A CN115170915A CN 115170915 A CN115170915 A CN 115170915A CN 202210954041 A CN202210954041 A CN 202210954041A CN 115170915 A CN115170915 A CN 115170915A
- Authority
- CN
- China
- Prior art keywords
- image
- infrared
- visible light
- fusion
- attention
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007500 overflow downdraw method Methods 0.000 title claims abstract description 23
- 230000004927 fusion Effects 0.000 claims abstract description 72
- 230000009977 dual effect Effects 0.000 claims abstract description 8
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 238000010586 diagram Methods 0.000 claims description 17
- 230000004913 activation Effects 0.000 claims description 12
- 238000005070 sampling Methods 0.000 claims description 11
- 238000005259 measurement Methods 0.000 claims description 7
- 238000011176 pooling Methods 0.000 claims description 5
- 239000000126 substance Substances 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- OAICVXFJPJFONN-UHFFFAOYSA-N Phosphorus Chemical compound [P] OAICVXFJPJFONN-UHFFFAOYSA-N 0.000 claims description 3
- 238000005520 cutting process Methods 0.000 claims description 3
- 238000007499 fusion processing Methods 0.000 claims description 3
- 238000005304 joining Methods 0.000 claims description 3
- 238000000034 method Methods 0.000 abstract description 17
- 230000007547 defect Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 57
- 238000013135 deep learning Methods 0.000 description 8
- 238000012549 training Methods 0.000 description 8
- 238000013461 design Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 230000008034 disappearance Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000005096 rolling process Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000003042 antagnostic effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention provides an infrared and visible light image fusion method based on an end-to-end attention network, which comprises the steps of preprocessing an infrared and visible light image; constructing an end-to-end attention network; wherein the attention network comprises: a self-coding network and a channel-space dual attention fusion layer, the self-coding network including an encoder-decoder joined in a hopping connection; and fusing the preprocessed infrared and visible light images based on the attention network. The invention overcomes the repeated defect of extracting features by adopting the same method aiming at different source images in the traditional fusion method, simultaneously reduces the limitation of manually designing a fusion strategy, and finally generates a single fusion image containing the feature information of a plurality of source images.
Description
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to an infrared and visible light image fusion method based on an end-to-end attention network.
Background
Image fusion is an image processing technology for image information enhancement, and due to the theoretical and technical limitations of hardware devices, a single sensor cannot effectively and comprehensively describe scene information under a specific shooting setting, for example, a visible light image contains more detailed texture information, and an infrared image contains more amplitude information, so that the image fusion is to combine complementary information between different source images under the same scene to generate a single image with more abundant information, so as to apply the single image to many fields, such as photographic visualization, target tracking, medical diagnosis and remote sensing monitoring.
In general, image fusion algorithms can be classified into the following two types: the traditional method and the deep learning method, the early image fusion method all adopt the way of mathematical transformation to generate the activity level map, and the fusion rule is designed in the space domain and the transformation domain, the representative traditional image fusion method includes: multi-scale transform based methods, sparse representation based methods, subspace based methods, saliency based methods, total variance based methods. On one hand, the traditional fusion algorithms adopt the same feature extraction and reconstruction methods for different source images, and the feature difference between the different source images is not considered, so that the fusion effect is poor, and on the other hand, the traditional fusion algorithms are designed by manpower, the fusion rules are too simple, the fusion performance is limited seriously, and the fusion algorithms can only be applied to specific fusion tasks. In recent years, with the continuous development of deep learning technology, the application of the deep learning technology in the aspect of image fusion is wider and wider, at present, a fusion algorithm based on deep learning mainly comprises three types, namely a generation antagonistic network based on GAN, a network based on an Automatic Encoder (AE), and a conventional Convolutional Neural Network (CNN). Secondly, based on a deep learning method, a loss function can be designed by self, and network parameters are updated through reverse gradient propagation to obtain a more reasonable feature fusion strategy, so that self-adaptive feature fusion is realized. Thanks to these advantages, deep learning promotes a great progress in image fusion, achieving performance far exceeding that of conventional methods.
Although the deep learning has achieved satisfactory results in the image fusion field, there are some disadvantages that (1) these deep learning network architectures do not fully consider the intermediate feature layer, and only design the loss function from the final fusion image and the source image, (2) most fusion algorithms only use the deep learning model in the stages of feature extraction and feature reconstruction, and feature fusion is still the traditional method, such as feature map addition, maximum value and mean value, and (3) some deep learning models use the two-stage training method, which is time-consuming and difficult to train.
Disclosure of Invention
In order to solve the technical problem, the invention provides an infrared and visible light image fusion method based on an end-to-end attention network, so as to improve the image fusion effect.
In order to achieve the above object, the present invention provides an infrared and visible light image fusion method based on an end-to-end attention network, comprising:
preprocessing the infrared image and the visible light image;
constructing an end-to-end attention network; wherein the attention network comprises: a self-coding network and a channel-space dual attention fusion layer, the self-coding network comprising an encoder-decoder joined in a hopping connection;
and fusing the preprocessed infrared image and the preprocessed visible light image based on the attention network.
Optionally, the preprocessing the infrared image and the visible light image comprises: and converting the infrared image and the visible light image into a gray scale image, and performing center cutting.
Optionally, an encoder in the self-encoding network is configured to extract a multi-scale deep semantic feature of the preprocessed image, and output an infrared feature map and a visible light feature map; and the decoder in the self-coding network is used for reconstructing the infrared characteristic diagram and the visible light characteristic diagram into a final fusion image.
Optionally, the encoder includes a plurality of maximum pooled downsampled layers and a plurality of convolution blocks, the number of input channels of the encoder is set based on a first preset number, the number of output channels of the encoder is set based on a second preset number, and each convolution block of the encoder includes a BatchNorm regularization function and a RELU activation function;
the decoder comprises a plurality of upper sampling layers and a plurality of convolution blocks, the number of input channels of the decoder is set based on a third preset number, the number of output channels of the decoder is set based on a fourth preset number, and each convolution block of the decoder contains a BatchNorm regularization function and a RELU activation function.
Optionally, joining the jump connection to the self-coding network comprises:
and connecting the input of each maximum pooling layer in the encoder with the output of the up-sampling layer in the decoder, adding a denseblock in a connecting path, forming the denseblock by using different numbers of convolution blocks in different connecting paths, and setting the output channels of the convolution blocks based on a fifth preset number.
Optionally, the channel-space dual attention fusion layer comprises: a channel attention module and a spatial attention module;
in the channel-space double-attention fusion layer, the infrared characteristic diagram and the visible light characteristic diagram are connected in channel dimensions, the spliced images are respectively input to the space attention module and the channel attention module to obtain a space weight diagram and a channel weight diagram, the space weight diagram and the channel weight diagram are multiplied by the infrared characteristic diagram and the visible light characteristic diagram, and then the space and channel attention fusion characteristics are added to obtain an intermediate fusion image. On one hand, an intermediate fusion image is obtained through the intermediate fusion layer, and on the other hand, the intermediate fusion layer can enable the network to pay attention to the place where the network needs more attention. And then sending the intermediate fusion image into a decoder of a self-coding network to obtain a final fusion image.
Optionally, constructing the attention network further comprises: setting a loss function;
setting the loss function includes: adding an SSIM structure similarity measurement function, introducing a gradient operator, introducing L2 regularization, designing a target feature enhancement loss function, and finally performing weighted calculation on each loss.
Optionally, the SSIM structural similarity metric function includes: a brightness function, a contrast function and a structure comparison function;
the brightness function is:
wherein, mu x 、μ y The average luminance of the two images is represented separately,n is the number of pixels of the picture, X i X, y represent two different images, C, for pixel value size 1 To prevent the denominator from being 0, C 1 =(k 1 *L) 2 ,k 1 Taking 0.01, taking 255 from L;
the contrast function is:
wherein the content of the first and second substances, σ x and σ y Respectively, the standard deviations of the two images, C 2 To prevent the denominator from being 0, C 2 =(k 2 *L) 2 ,k 2 Taking 0.03, taking 255 from L;
the structural comparison function is:
wherein the content of the first and second substances,σ xy representing the covariance of two pictures, C 3 To prevent the denominator from being 0, C 3 =C 2 /2;
The SSIM structure similarity measurement function is as follows:
wherein, V is the visible light source image,in order to obtain the final fused image,is a gradient operator, | | | | non-conducting phosphor 1 Represents the L1 norm;
the L2 regularization is:wherein, X is an unknown number of the setting, representing a visible light gray scale image and an infrared gray scale image, | | | calving 2 Represents the L2 norm;
the target feature enhancement loss function is:
where M denotes the fusion process at different scales, w e For the weights at the different scales of the scale,as a result of the fusion of the mth layer feature map,anda visible light characteristic layer and an infrared characteristic layer, w, of the m-th layer respectively vi Weight of visible light feature layer, w ir And F is the Frobenius norm and is the weight of the infrared characteristic layer.
Optionally, the final loss function is:
wherein I is an infrared source image, V is a visible light source image,for the final fused image, L1 isα 1 、α 2 、α 3 Respectively, the weight of each loss function.
Compared with the prior art, the invention has the following advantages and technical effects:
1. the invention uses the network structure of coding-decoding, the multi-scale deep features of the input picture are fully extracted in the coding stage, the multi-scale deep features are effectively reconstructed in the decoding stage, then the jump connection is further introduced into the self-coding network, the gradient disappearance is effectively slowed down, and the multi-scale feature layer is multiplexed, thus the capability of extracting and reconstructing the features by the network can be effectively enhanced. Meanwhile, different semantic information contained in the features under different scales is considered, and the features are not suitable for direct connection, so that different numbers of convolution blocks are selected among different connection layers to eliminate and balance the difference.
2. The invention uses a channel-space double attention neural network fusion structure, is different from the fusion strategy of the traditional manual design, and can effectively store the amplitude information of the infrared image and the detail texture information of the visible light image.
3. The invention designs a brand-new loss function, introduces an SSIM structure similarity function, an L2 regularization function, a gradient operator and a target feature enhancement loss function, can effectively extract the obvious and detailed features of the source image, adopts an end-to-end network structure, and abandons an intermediate fusion layer and a two-stage training strategy of manual design, so that the training is faster, and the fusion result is more effective.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, are included to provide a further understanding of the application, and the description of the exemplary embodiments of the application are intended to be illustrative of the application and are not intended to limit the application. In the drawings:
fig. 1 is a schematic flow chart of an infrared and visible light image fusion method based on an end-to-end attention network according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an attention network structure according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a spatial-channel dual attention fusion layer structure according to an embodiment of the present invention;
FIG. 4 is an infrared image of an embodiment of the present invention;
FIG. 5 is a visible light image according to an embodiment of the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than here.
Examples
As shown in fig. 1, the present embodiment provides an infrared and visible light image fusion method based on an end-to-end attention network, including:
preprocessing the infrared image and the visible light image;
constructing an end-to-end attention network; wherein the attention network comprises: a self-coding network and a channel-space dual attention fusion layer, the self-coding network including an encoder-decoder joined in a hopping connection;
and fusing the preprocessed infrared image and the preprocessed visible light image based on the attention network.
Further, preprocessing the infrared image and the visible light image includes: and converting the infrared and visible light images into grey-scale images, and performing center cutting.
Further, the self-encoding network comprises: an encoder and a decoder;
the encoder is used for extracting the multi-scale deep semantic features of the preprocessed image and outputting an infrared feature map and a visible light feature map; and the decoder is used for reconstructing a final fusion image according to the infrared characteristic diagram and the visible light characteristic diagram.
Furthermore, the encoder is composed of a plurality of maximum pooled downsampled layers and a plurality of convolution blocks, the number of input channels of the encoder is set based on a first preset number, the number of output channels of the encoder is set based on a second preset number, and each convolution block of the encoder contains a BatchNorm regularization function and a RELU activation function;
the decoder is composed of a plurality of upper sampling layers and a plurality of convolution blocks, the number of input channels of the decoder is set based on a third preset number, the number of output channels of the decoder is set based on a fourth preset number, and a BatchNorm regularization function and a RELU activation function are contained behind each convolution block of the decoder.
Further, joining the hopping connection to the self-coding network comprises:
and connecting the input of each maximum pooling layer in the encoder with the output of the up-sampling layer in the decoder, adding a denseblock in a connecting path, forming the denseblock by using different numbers of rolling blocks in different connecting paths, and setting output channels of the rolling blocks based on a fifth preset number.
Further, the channel-space dual attention fusion layer includes: a channel attention module and a spatial attention module;
in the channel-space double-attention fusion layer, the infrared feature map and the visible light feature map are connected in channel dimension, the spliced images are respectively input to the space attention module and the channel attention module to obtain a space weight map and a channel weight map, the space weight map and the channel weight map are multiplied by the infrared feature map and the visible light feature map, and then the space and channel attention fusion features are added to obtain an intermediate fusion image. On one hand, an intermediate fusion image is obtained through the intermediate fusion layer, and on the other hand, the intermediate fusion layer can enable the network to pay attention to the place where the network needs more attention. And then sending the intermediate fusion image into a decoder of a self-coding network to obtain a final fusion image.
Further, constructing the attention network further comprises: setting a loss function;
setting the loss function includes: adding an SSIM structure similarity measurement function, introducing a gradient operator, introducing L2 regularization, designing a target feature enhancement loss function, and finally performing weighted calculation on each loss.
The embodiment provides an infrared and visible light image fusion method, which aims to more comprehensively describe an imaging scene in order to fuse complementary and beneficial information in images of different modes, and comprises the following steps: the method comprises the steps of (1) constructing an automatic coding network of an encoder-decoder, extracting deep semantic information of an input image and reconstructing a fusion image, (2) adding jump connection in the self-coding network, introducing a denseblock in the jump connection to reduce the difference of the abundance degree of the semantic information between connection layers, (3) constructing a double attention fusion layer of a channel and a space, further retaining texture information of a visible light image and amplitude information of an infrared image, (4) designing a proper loss function, and selecting a related data set to train and test the performance of the fusion network. The invention overcomes the repeated defect of extracting features by adopting the same method aiming at different source images in the traditional fusion method, simultaneously reduces the limitation of manually designing a fusion strategy, and finally generates a single fusion image containing the feature information of a plurality of source images. The specific implementation steps are as follows:
The encoder network is composed of 3 max-posing down-sampling layers and 9 common convolution blocks, the decoder network is composed of 3 up-sampling layers and 7 common convolution blocks, which are connected with each other layer by layer, as shown in fig. 2, the first convolution block in the self-encoding network adopts 1 × 1 convolution kernel and adopts reflection filling (reflection filling pad) to prevent the occurrence of edge artifacts of the fused image, the number of input channels is set to 1, the number of output channels is set to 16, the other common convolution modules all adopt 3 × 3 convolution kernels, the step length is set to 1, the filling is performed by 0, the image resolution is not changed, the number of input channels of the encoder is respectively set to 16, 64, 64, 128, 128, 256, 256, 256, 256, the number of output channels is respectively set to 64, 64, 128, 128, 256, 256, 256, three max-pooling down-sampling layers are set in the encoder stage, the step size is set to be 2, the decoder stage samples the size of the feature map twice in a bilinear interpolation mode, the number of input channels of the convolution blocks in the decoding stage is respectively 512, 256, 256, 128, 128, 64, and 64, the number of output channels is respectively 256, 128, 128, 64, 64, and each convolution block is followed by a Batchnorm regularization function and a RELU activation function.
And 2, step: the jump connection is added into the self-coding network, so that the gradient disappearance is slowed down, information loss caused by an up-sampling process and a down-sampling process is further compensated, and the gradient disappearance problem of a deep neural network is slowed down. Meanwhile, the difference of semantic information between connection layers is considered, direct connection is not suitable, and therefore different rolling blocks are used on different connection layers to realize jump connection.
As shown in fig. 2, the input of the first maxporoling layer and the output of the third upsampling layer are connected in channel dimension through a skip connection, specifically, 4 convolution blocks are adopted, the input channel of each convolution block is set to 64, 64, 128, and 192, and the output channels are all set to 64, specifically, the connection mode is: setting the four convolution blocks as A1, A2, A3 and A4 respectively, then the output of A1 is used as the input of A2, the channels of A1 and A2 are connected as the input of A3, the channels of A1, A2 and A3 are connected as the input of A4, each convolution block is set as 3 × 3 convolution kernel, 0 padding is adopted, padding is set as 1, the image resolution is not changed, and then BatchNorm regularization and RELU activation functions are followed. The input of the second max-posing layer is connected to the output of the second upsampling layer in the same way as described above, but to balance the difference in semantic information between the deep layer and the shallow layer, we use three convolution blocks, the input of the third max-posing layer is connected to the output of the third upsampling layer, and 2 convolution blocks are selected in the same way as described above.
And 3, step 3: constructing a channel-space double-attention fusion layer (as shown in figure 3), respectively extracting amplitude and detail texture information of the infrared and visible light images, splicing the feature information of the two source images extracted by the encoder on channel dimensions, respectively sending the feature information into the channel and the space attention layer to obtain corresponding feature maps, and further adding to obtain a middle feature layer fusion image.
Calculate the weight of channel attention: sending the spliced image (S1) into a GlobavalgePoolic layer, successively passing through two full-connection layers, wherein the number of output channels of the H-swish activation function is 1/4 of the number of S1 channels after the first full-connection layer, the number of the output channels of the H-swish activation function is obtained by the aid of a sigmoid activation function after the second convolution layer, the weight within the range of 0-1 numerical value is obtained, the number of the output channels of the S1 channel is equal to the number of the S1 channels, and finally multiplying the obtained weight with a visible light characteristic layer to further keep detailed information of the visible light image. Calculating the weight of spatial attention: the spliced images are respectively sent to an Avergepoiling layer and a Max Pooling layer, max and average sampling is carried out on channel dimension, image resolution is not changed, two output characteristic layers are spliced on the channel dimension, then a 7-by-7 convolution layer is sent to the channel dimension, 0 filling is adopted, padding is set to be 3, image resolution is not changed, a sigmoid activation function is connected to the channel dimension, a weight distribution graph within a range of 0-1 is obtained, the weight distribution graph is multiplied by an infrared characteristic layer, and amplitude information of an infrared source image is further kept. Finally, the feature maps obtained on the two attention structures are added to obtain an intermediate fusion image.
And 4, step 4: designing a loss function: an SSIM structure similarity measurement function is added, in order to further retain detailed texture information of visible light, a gradient operator is introduced, L2 regularization is introduced, and finally a target feature enhancement loss function is designed.
The SSIM structure similarity function can better reflect the judgment of the similarity of two images by human vision, and the similarity function consists of three aspects, namely the brightness, the contrast and the structure comparison function of the images, wherein the brightness similarity is as follows:
μ x 、μ y the average luminance of the two images is represented separately,n is the number of pixels of the picture, X i X, y represent two different images, C, for pixel value size 1 To prevent the denominator from being 0, C 1 =(k 1 *L) 2 ,k 1 Taking 0.01, taking 255 from L; the following is the picture contrast, which indicates the intensity of the image brightness change, and the contrast similarity function is set as:
wherein the content of the first and second substances,σ x and σ y Respectively, the standard deviations of the two images, C 2 To prevent the denominator from being 0, C 2 =(k 2 *L) 2 ,k 2 Taking 0.03, taking 255 from L;
the structure comparison function is:
wherein the covarianceTo obtain finallyWherein C is 1 、C 2 、C 3 All for preventing the generation of a denominator of 0, C 3 =C 2 /2. The value range of SSIM is-1 to 1, and the larger the SSIM value is, the higher the picture similarity is, so that the final loss function of SSIM measurement is L SSIM =1-SSIM。
Operator of upper complaint gradientWhere V represents the image of the visible light source,the final fused image is represented as a result of the fusion,is a gradient operator, | | | | non-conducting phosphor 1 Represents the L1 norm; because the visible image has abundant texture information, the reconstruction of the visible image is regularized through a gradient penalty to ensure texture consistency.
The L2 regularization set toThe method mainly comprises the steps of measuring the strength consistency between a source image and a fusion image, wherein X is an unknown number set to represent a visible light gray map and an infrared gray map, and | | | | charging 2 Representing the L2 norm.
The target feature enhancement loss function is set as
w vi Weight of visible light feature layer, w ir And F is the Frobenius norm and is the weight of the infrared characteristic layer. Since infrared images have more prominent target features than visible images, we design this loss function L 2 And is used to constrain the depth features of the fused image so that the salient features are preserved. M is set to 4 and respectively represents the fusion process under different scales, w e Representing the weights at different scales, we will refer to w due to the difference in amplitude at different scales e Are respectively set as [1,10,100,1000],Represents the fusion result of the characteristic diagram of the mth layer,andw represents the visible and infrared feature layers of the mth layer, respectively, since the loss function is mainly to preserve the significant features of the infrared image vi Set ratio w ir And small, 3,6 respectively.
The above-mentioned weighted calculation is performed for each loss function, wherein WhereinThe final loss function is Wherein alpha is 1 ,α 2 And alpha 3 Respectively, the weight ratio of each loss function, alpha 1 ,α 2 And alpha 3 Set to 2,2, 10, λ to 5,w e Is set as [1,10,100,1000 ]],w vi And w ir Are respectively set to 3,6.
And 5: data sets were selected experiments were performed on three data sets, including TNO, NIR and FLIR.
180 pairs of images, one of which is infrared and visible, are randomly selected as training samples in the FLIR dataset, such as shown in fig. 4 and 5. Prior to training, all images were converted to grayscale. At the same time, it is center clipped with 128 × 128 pixels. Then, a plurality of pairs of infrared and visible light images are sent into the network for training, loss is calculated according to the loss function, and then network parameters are updated through inverse gradient propagation, wherein the epoch of the training is set to 120, an Adam optimizer is adopted, and the learning rate is set to 10 -3 The MultiStepLR learning rate adjustment strategy is adopted, and the learning rate is reduced by 10 times every 40 epochs. After training is completed, the remaining FLIR data, TNO (40) data set and NIR balance data set are used to verify the fusion effect of the model.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (10)
1. An infrared and visible light image fusion method based on an end-to-end attention network is characterized by comprising the following steps:
preprocessing the infrared image and the visible light image;
constructing an end-to-end attention network; wherein the attention network comprises: a self-coding network and a channel-space dual attention fusion layer, the self-coding network including an encoder-decoder joined in a hopping connection;
and fusing the preprocessed infrared image and the preprocessed visible light image based on the attention network.
2. The end-to-end attention network-based infrared and visible image fusion method according to claim 1, wherein the preprocessing the infrared image and the visible image comprises: and converting the infrared image and the visible light image into a gray scale image, and performing center cutting.
3. The end-to-end attention network based infrared and visible image fusion method of claim 1,
the encoder in the self-encoding network is used for extracting the multi-scale deep semantic features of the preprocessed image and outputting an infrared feature map and a visible light feature map; and the decoder in the self-coding network is used for reconstructing the infrared characteristic diagram and the visible light characteristic diagram into a final fusion image.
4. The end-to-end attention network-based infrared and visible light image fusion method according to claim 3,
the encoder comprises a plurality of maximum pooled downsampled layers and a plurality of convolution blocks, the number of input channels of the encoder is set based on a first preset number, the number of output channels of the encoder is set based on a second preset number, and each convolution block of the encoder comprises a Batchnorm regularization function and a RELU activation function;
the decoder comprises a plurality of upsampling layers and a plurality of convolution blocks, the number of input channels of the decoder is set based on a third preset number, the number of output channels of the decoder is set based on a fourth preset number, and a Batchnorm regularization function and a RELU activation function are contained behind each convolution block of the decoder.
5. The end-to-end attention network-based infrared and visible light image fusion method of claim 4, wherein joining the jump connection to the self-coding network comprises:
and connecting the input of each maximum pooling layer in the encoder with the output of the up-sampling layer in the decoder, adding a denseblock in a connecting path, forming the denseblock by using different numbers of convolution blocks in different connecting paths, and setting the output channels of the convolution blocks based on a fifth preset number.
6. The end-to-end attention network based infrared and visible image fusion method of claim 1, wherein the channel-space dual attention fusion layer comprises: a channel attention module and a spatial attention module;
in the channel-space double-attention fusion layer, the infrared feature map and the visible light feature map are connected in channel dimension, the spliced images are respectively input to the space attention module and the channel attention module to obtain a space weight map and a channel weight map, the space weight map and the channel weight map are multiplied by the infrared feature map and the visible light feature map, and then the space and channel attention fusion features are added to obtain an intermediate fusion image.
7. The end-to-end attention network-based infrared and visible light image fusion method according to claim 1, wherein constructing the attention network further comprises: setting a loss function;
setting the loss function includes: adding an SSIM structure similarity measurement function, introducing a gradient operator, introducing L2 regularization, designing a target feature enhancement loss function, and finally performing weighted calculation on each loss.
8. The end-to-end attention network-based infrared and visible light image fusion method of claim 7, wherein the SSIM structural similarity metric function comprises: a brightness function, a contrast function and a structure comparison function;
the brightness function is:
wherein, mu x 、μ y The average luminance of the two images is represented separately,n is the number of pixels of the picture, X i X, y represent two different images, C, for pixel value size 1 To prevent the denominator from being 0, C 1 =(k 1 *L) 2 ,k 1 Taking 0.01, taking 255 from L;
the contrast function is:
wherein the content of the first and second substances, σ x and σ y Respectively, the standard deviations, C, of the two images 2 To prevent the denominator from being 0, C 2 =(k 2 *L) 2 ,k 2 Taking 0.03, taking 255 from L;
the structure comparison function is:
wherein the content of the first and second substances,σ xy representing the covariance of two pictures, C 3 To prevent the denominator from being 0, C 3 =C 2 /2;
The SSIM structure similarity measurement function is as follows:
9. the end-to-end attention network-based infrared and visible light image fusion method according to claim 8, wherein the gradient operator is:
wherein, V is the visible light source image,in order to be the final fused image,is a gradient operator, | | | | non-conducting phosphor 1 Represents the L1 norm;
the L2 regularization is:wherein, X is an unknown number of the setting, representing a visible light gray scale image and an infrared gray scale image, | | | calving 2 Represents the L2 norm;
the target feature enhancement loss function is:
where M denotes the fusion process at different scales, w e For the weights at the different scales of the scale,as a result of the fusion of the mth layer feature map,anda visible light characteristic layer and an infrared characteristic layer, w, of the m-th layer respectively vi Weight of visible light feature layer, w ir And F is the Frobenius norm and is the weight of the infrared characteristic layer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210954041.9A CN115170915A (en) | 2022-08-10 | 2022-08-10 | Infrared and visible light image fusion method based on end-to-end attention network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210954041.9A CN115170915A (en) | 2022-08-10 | 2022-08-10 | Infrared and visible light image fusion method based on end-to-end attention network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115170915A true CN115170915A (en) | 2022-10-11 |
Family
ID=83479073
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210954041.9A Pending CN115170915A (en) | 2022-08-10 | 2022-08-10 | Infrared and visible light image fusion method based on end-to-end attention network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115170915A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115631428A (en) * | 2022-11-01 | 2023-01-20 | 西南交通大学 | Unsupervised image fusion method and system based on structural texture decomposition |
CN116664462A (en) * | 2023-05-19 | 2023-08-29 | 兰州交通大学 | Infrared and visible light image fusion method based on MS-DSC and I_CBAM |
CN117036893A (en) * | 2023-10-08 | 2023-11-10 | 南京航空航天大学 | Image fusion method based on local cross-stage and rapid downsampling |
CN117115065A (en) * | 2023-10-25 | 2023-11-24 | 宁波纬诚科技股份有限公司 | Fusion method of visible light and infrared image based on focusing loss function constraint |
-
2022
- 2022-08-10 CN CN202210954041.9A patent/CN115170915A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115631428A (en) * | 2022-11-01 | 2023-01-20 | 西南交通大学 | Unsupervised image fusion method and system based on structural texture decomposition |
CN115631428B (en) * | 2022-11-01 | 2023-08-11 | 西南交通大学 | Unsupervised image fusion method and system based on structural texture decomposition |
CN116664462A (en) * | 2023-05-19 | 2023-08-29 | 兰州交通大学 | Infrared and visible light image fusion method based on MS-DSC and I_CBAM |
CN116664462B (en) * | 2023-05-19 | 2024-01-19 | 兰州交通大学 | Infrared and visible light image fusion method based on MS-DSC and I_CBAM |
CN117036893A (en) * | 2023-10-08 | 2023-11-10 | 南京航空航天大学 | Image fusion method based on local cross-stage and rapid downsampling |
CN117036893B (en) * | 2023-10-08 | 2023-12-15 | 南京航空航天大学 | Image fusion method based on local cross-stage and rapid downsampling |
CN117115065A (en) * | 2023-10-25 | 2023-11-24 | 宁波纬诚科技股份有限公司 | Fusion method of visible light and infrared image based on focusing loss function constraint |
CN117115065B (en) * | 2023-10-25 | 2024-01-23 | 宁波纬诚科技股份有限公司 | Fusion method of visible light and infrared image based on focusing loss function constraint |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112507997B (en) | Face super-resolution system based on multi-scale convolution and receptive field feature fusion | |
CN109447907B (en) | Single image enhancement method based on full convolution neural network | |
CN115170915A (en) | Infrared and visible light image fusion method based on end-to-end attention network | |
CN110717857A (en) | Super-resolution image reconstruction method and device | |
CN110599401A (en) | Remote sensing image super-resolution reconstruction method, processing device and readable storage medium | |
CN111986084A (en) | Multi-camera low-illumination image quality enhancement method based on multi-task fusion | |
CN112819910A (en) | Hyperspectral image reconstruction method based on double-ghost attention machine mechanism network | |
CN112465718A (en) | Two-stage image restoration method based on generation of countermeasure network | |
CN112614061A (en) | Low-illumination image brightness enhancement and super-resolution method based on double-channel coder-decoder | |
CN114782298B (en) | Infrared and visible light image fusion method with regional attention | |
CN114219719A (en) | CNN medical CT image denoising method based on dual attention and multi-scale features | |
CN110599585A (en) | Single-image human body three-dimensional reconstruction method and device based on deep learning | |
CN113724134A (en) | Aerial image blind super-resolution reconstruction method based on residual distillation network | |
Liu et al. | Learning noise-decoupled affine models for extreme low-light image enhancement | |
CN116486074A (en) | Medical image segmentation method based on local and global context information coding | |
CN116934592A (en) | Image stitching method, system, equipment and medium based on deep learning | |
CN115641391A (en) | Infrared image colorizing method based on dense residual error and double-flow attention | |
CN115272072A (en) | Underwater image super-resolution method based on multi-feature image fusion | |
CN117197627B (en) | Multi-mode image fusion method based on high-order degradation model | |
CN117408924A (en) | Low-light image enhancement method based on multiple semantic feature fusion network | |
Kumar et al. | Underwater image enhancement using deep learning | |
CN114862699A (en) | Face repairing method, device and storage medium based on generation countermeasure network | |
CN113870162A (en) | Low-light image enhancement method integrating illumination and reflection | |
CN112150566A (en) | Dense residual error network image compressed sensing reconstruction method based on feature fusion | |
CN117710216B (en) | Image super-resolution reconstruction method based on variation self-encoder |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |