CN114639002A - Infrared and visible light image fusion method based on multi-mode characteristics - Google Patents

Infrared and visible light image fusion method based on multi-mode characteristics Download PDF

Info

Publication number
CN114639002A
CN114639002A CN202210244332.9A CN202210244332A CN114639002A CN 114639002 A CN114639002 A CN 114639002A CN 202210244332 A CN202210244332 A CN 202210244332A CN 114639002 A CN114639002 A CN 114639002A
Authority
CN
China
Prior art keywords
fusion
image
infrared
feature
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210244332.9A
Other languages
Chinese (zh)
Inventor
刘向增
高豪杰
苗启广
宋建锋
纪建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202210244332.9A priority Critical patent/CN114639002A/en
Publication of CN114639002A publication Critical patent/CN114639002A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses an infrared and visible light image fusion method based on multi-mode characteristics, which comprises the following steps: 1. the method comprises the steps of extracting a coder-decoder network of the multi-mode features, 2, measuring the multi-mode features by using entropy, gradient and significance and designing an adaptive loss function, 3, constructing a fusion weight learning model embedded with a Transformer fusion strategy, 4, taking a significance map of an infrared image as a label, adding the label as an optimized region selection of the fusion network, and 5, cascading the fusion weight learning model embedded with the Transformer fusion strategy with a coder-decoder to construct an infrared and visible light image fusion network and train the infrared and visible light image fusion network. The method utilizes a Transformer to capture the global relevance of multi-scale features, gives consideration to local and global information, and improves the overall visual effect of a fused image; and by utilizing a multi-mode self-adaptive fusion strategy, multi-mode characteristic information of the image is reserved, and the quality of the fused image is improved.

Description

Infrared and visible light image fusion method based on multi-mode characteristics
Technical Field
The application belongs to the field of image fusion, and particularly relates to an infrared and visible light image fusion method based on multi-mode characteristics.
Background
Image fusion refers to combining images obtained by different types of sensors to generate a robust or information-rich image, so as to facilitate subsequent processing or decision making. Complex applications require comprehensive information about a particular scene to enhance a comprehensive understanding of the scene. The single-mode sensor can only sense single scene information of the target, and cannot sense the target in multiple granularities. Therefore, fusion techniques play an increasingly important role in modern applications and computer vision.
Due to the limitations of physical sensors, scene information captured by infrared and visible light images is very different. The visible light image captures the reflected light, and such images generally have high spatial resolution, rich color, texture detail and high contrast characteristics, are suitable for human visual perception, but are easily affected by illumination, such as in poor light scenes, such as bad weather or night, and the image quality of the images is greatly reduced. The infrared image captures heat radiation, and the infrared image describing the heat radiation of the object can resist interference such as severe weather and insufficient light, but generally has low spatial resolution and lacks information such as image texture and color. The image fusion of the infrared and the visible light refers to the combination of the infrared image and the visible light image in the same scene, and the fusion image with strong robustness and large information quantity is generated by utilizing the complementarity of the two images. The infrared and visible light image fusion technology is widely applied to the fields of target detection, image enhancement, video monitoring, remote sensing and the like.
The image fusion method of infrared and visible light is mainly divided into a traditional method and a deep learning method. Traditional image fusion methods mainly use multi-scale transforms (MST), Sparse Representations (SR), saliency-based, mixed models, and other methods. The methods have achieved good fusion performance, but the problems of manual manufacturing of features, high calculation complexity and the like still exist. Among deep learning-based methods, models such as fusion gan, Attention FGAN, and Nestfuse improve the shortcomings of the conventional methods, but have certain limitations. First, deep learning networks usually extract feature maps directly from previous convolution layers, ignoring global information, resulting in poor quality fusion results. Second, in the encoder-decoder model with the fusion strategy, a simple fusion strategy may make the image edges unclear. Finally, the design of the loss function also affects the effect of network convergence. Improper loss function design not only can slow down convergence speed, but also can cause the fusion result to have the problems of artifacts, fuzzy boundaries and the like.
Disclosure of Invention
The invention aims to provide an infrared and visible light image fusion method based on multi-mode characteristics, which solves the problem of lack of global information in the fusion process, performs self-adaptive fusion on the multi-mode characteristics, combines significance information and finally realizes good fusion.
In order to realize the task, the invention adopts the following technical scheme to realize the following steps:
an infrared and visible light image fusion method based on multi-mode features is characterized by comprising the following steps:
step 1, constructing a feature extraction and image reconstruction network, and optimally generating a multi-mode feature encoder-decoder network through the guidance of a loss function based on a multi-scale convolution network;
and 2, extracting infrared and visible light multi-mode characteristics through the encoder-decoder network, measuring the multi-mode characteristics by using entropy, gradient and significance, and designing multi-mode adaptive loss.
Step 3, constructing a fusion weight learning model embedded with a Transformer fusion strategy, and assigning weights of the fusion model;
step 4, acquiring a saliency map of the infrared image as label, and adding the saliency label as area selection for optimizing the fusion network;
and 5, cascading the fusion weight learning model embedded with the Transformer fusion strategy with a coder decoder to construct an infrared and visible light image fusion network, and training the infrared and visible light image fusion network by adopting the saliency label and the multimode loss.
According to the invention, the structure of the encoder in step 1 is as follows: contains 1 x 1 convolutional layer and 4 coding convolution modules ECB10, ECB20, ECB30 and ECB40, each containing 2 x 3 convolutional layers and one max pooling layer.
The structure of the decoder in step 1 is as follows: contains 1 x 1 convolutional layer and 6 decoding convolutional modules DCB30, DCB20, DCB21, DCB10, DCB11 and DCB12, each of which contains two 3 x 3 convolutional layers.
Specifically, the specific connection mode of the decoder in step 1 is as follows: and adopting transverse dense jump connection in the first scale and the second scale, adopting a channel connection mode, connecting the final fused feature jump of the second scale to the input of the DBC21, connecting the final fused feature jump of the first scale to the inputs of the DCB11 and the DCB12, and connecting the output jump of the DCB10 to the input of the DCB 12. Through transverse dense skip connection, the depth features of all the middle layers are used for feature reconstruction, and the reconstruction capability of the multi-scale depth features is improved; in the decoding sub-network, longitudinal dense connection is established in all scales, the final fused feature of the fourth scale is connected to the input of DCB30, the final fused feature of the third scale is connected to the input of DCB20, the final fused feature of the second scale is connected to the input of DCB10, the output of DCB30 is connected to the input of DCB21, the output of DCB20 is connected to the input of DCB11, the output of DCB21 is connected to the input of DCB12, and all scale features are used for feature reconstruction through the longitudinal dense upsampling connection, so that the reconstruction capability of the multi-scale depth feature is further improved.
Further, the loss function L of the encoder-decoder networkEDWhich is the pixel consistency and structure between the input image and the output imageSimilarity, as shown in equation (1):
LED=Lp+βLssin (1)
wherein L ispIs the pixel consistency loss, LssinIs structural similarity loss;
pixel uniformity lossLpAs shown in equation (2):
Figure BDA0003544410180000041
structural similarity lossLssinAs shown in equation (3):
Lssim=1-ssim(O,I) (3)
wherein, O is a network output image, and I is an input image.
Further, the measuring the multi-modal features using entropy, gradient, and significance in step 2 comprises the steps of:
step 2.1, calculating the entropy of the features output by the encoder, and comparing the entropy values of the features of all scales, wherein the features with the highest entropy contain the most contents and details and are classified into the content features;
step 2.2, calculating the gradient of the input image of the encoder by using a Sobel gradient operator, performing down-sampling on the gradient, then subtracting the gradient from each feature, and calculating the average value, wherein the feature with the minimum average value contains more structural features such as outlines, edges and the like, and is classified into edge structural features;
step 2.3, calculating a saliency map of the input image of the encoder by using a saliency extraction algorithm, carrying out down-sampling on the saliency map, then carrying out difference on the saliency map and each feature, calculating an average value, and classifying the foreground target and the background by the feature with the minimum average value;
further, the multi-mode adaptive loss function in step 2 includes a content loss, a correlation loss and a class saliency loss, as shown in formula (4):
Lfea=Lcon+λLcorr+ρLsil-l (4)
wherein L isconIs content loss, LcorrIs the correlation loss, Lsil-lIs class significance loss, and lambda and rho are hyper-parameters and are used for balancing the weight of the three loss;
content lossLconEnhancing the fusion of features as shown in equation (5):
Figure BDA0003544410180000051
wherein, wirAnd wviIn order to adapt the weight of the weight to the weight,
Figure BDA0003544410180000052
wvi=1-wir
correlation lossLcorrEnhancing the fusion of the structural features of the edge, as shown in formula (6):
Figure BDA0003544410180000053
wherein cov (·) is a covariance function, and σ is a standard deviation function.
Class significance lossLsal-lEnhancing the fusion of plaque features, as shown in equation (7):
Figure BDA0003544410180000054
wherein phiirFor infrared characteristics, phiviCharacteristic of visible light, [ phi ]fFor fused infrared and visible features, MirAnd MviTo remove Mask of noise in the feature, as shown in equations (8) and (9):
Figure BDA0003544410180000055
Figure BDA0003544410180000056
where θ is a constant.
Further, the converged network structure in step 3 is as follows: the system comprises 4 Transformer modules, wherein each Transformer module consists of 21 multiplied by 1 convolution layers and 1 FocalTransormer module; and adjusting a characteristic channel by the 1 st convolutional layer, combining local information and global information by a FocalTransformer module to fuse characteristics, and increasing the nonlinear characteristic of a network by the 2 nd convolutional layer.
Preferably, said adding significant label in step 4 comprises the steps of:
step 4.1, detecting the input infrared image by using an LC significance extraction algorithm to obtain a significance map Msal
Step 4.2, carrying out normalization processing on the significance map to obtain
Figure BDA0003544410180000061
Step 4.3, for the normalized saliency map
Figure BDA0003544410180000062
Design significance loss, as shown in equation (10):
Figure BDA0003544410180000063
wherein F is a fused image, IirAnd IviRespectively infrared and visible light images.
Furthermore, the step 5 of concatenating the encoder, the fusion module, and the decoder to form a complete image fusion network is represented as follows:
extraction of multi-mode features phi of infrared and visible images by trained encoder EirAnd phiviWill be infrared characteristic phiirAnd the characteristic of visible light phiviInputting a feature fusion network F after splicing on a channel, and generating the feature fusion network FHas a fusion characteristic phifDecoding by trained decoder D to generate fused image IfThe entire fusion process can be formulated as equation (11):
If=D(F(E(Iir),E(Ivi))) (11)
wherein, IirAnd IviRespectively representing an infrared image and a visible light image; e (-) represents the encoder function, F (-) represents the feature fusion network function, and D (-) represents the decoder function.
Further, the training process of the fusion model in the step 5 is as follows: taking the infrared image, the visible light image and the saliency map of the infrared image as input, the total loss function is shown as formula (12):
L=Lssim+αLfea+βLsal (12)
wherein L isssimFor structural similarity loss, the calculation formula is Lssim=1-ssim(If,Ivi),LfeaAdaptive for multiple modes loss, LsalFor significance loss, α and β are hyperparameters to balance the weights of the three terms loss.
Compared with the prior art, the infrared and visible light image fusion method based on the multimode characteristics brings technical innovation that:
(1) a new infrared and visible light image fusion network is provided, the Focal transform module is used for fusing the extracted image characteristics, local and global information is considered, and the fusion performance is better;
(2) aiming at the multi-mode characteristics extracted by the encoder, multi-mode adaptive loss is designed to carry out optimization learning on the model, so that the information transmission in the fusion process is enhanced, and the problem that different characteristic information is weakened in the existing fusion method is effectively avoided;
(3) and adding significance information in the fusion, and adaptively improving the weight of texture details in the thermal target and the visible light image in the infrared image by the optimization model to finally obtain a fusion result with a good visual effect.
Drawings
FIG. 1 is a block diagram of a multi-mode feature encoder-decoder network;
FIG. 2 is a diagram of a network architecture during encoder-decoder training;
fig. 3 is a structure of a rolling block ECB in the encoder;
fig. 4 is a structure of a convolution block DCB in a decoder;
FIG. 5 is a graph showing the results of the first embodiment. Wherein, the figure (a) is the infrared image to be fused of the first embodiment; (b) the visible light image to be fused of the first embodiment is shown; (c) the graph is a fused image based on the Laplacian Pyramid (LP); (d) the figure is a fused image based on discrete wavelet Decomposition (DWT); (e) the graph is a fusion image based on a curvelet transform (CVT); (f) the figure is a fusion image of FusionGAN; (g) the diagram is a fused image of the DenseeFuses; (h) the figure is a fused image of the method of the invention.
FIG. 6 is a graph showing the results of the second embodiment. Wherein, (a) is the infrared image to be fused of the second embodiment; (b) the visible light image to be fused of the second embodiment is shown in the figure; (c) the graph is a fused image based on a Laplacian Pyramid (LP); (d) the figure is a fused image based on discrete wavelet Decomposition (DWT); (e) the graph is a fusion image based on a curvelet transform (CVT); (f) the figure is a fused image of fusingen; (g) the diagram is a fused image of the DenseeFuses; (h) the figure is a fused image of the method of the invention.
The invention is described in further detail below with reference to the figures and examples.
Detailed Description
The embodiment provides an infrared and visible light image fusion method based on multi-mode characteristics, which comprises the following steps:
step 1, constructing a feature extraction and image reconstruction network, and optimally generating a multi-mode feature encoder-decoder network based on a multi-scale convolution network through the guidance of a loss function, wherein the structure of the encoder-decoder network is shown in fig. 2. The encoder contains 1 x 1 convolutional layer and 4 coded convolutional modules ECB10, ECB20, ECB30 and ECB40, each containing 2 x 3 convolutional layers and one max-pooling layer. The ECB structure is shown in fig. 3. The decoder contains 1 × 1 convolutional layer and 6 decoding convolutional modules DCB30, DCB20, DCB21, DCB10, DCB11 and DCB12, each of which contains two 3 × 3 convolutional layers, and the DCB structure is shown in fig. 4. The specific connection of the decoder is as follows: and adopting transverse dense jump connection in the first scale and the second scale, adopting a channel connection mode to connect the final fused feature jump of the second scale to the input of the DBC21, connect the final fused feature jump of the first scale to the inputs of the DCB11 and the DCB12, and connect the output jump of the DCB10 to the input of the DCB 12. Through transverse dense skip connection, the depth features of all the middle layers are used for feature reconstruction, and the reconstruction capability of the multi-scale depth features is improved; in the decoding sub-network, longitudinal dense connection is established in all scales, the final fused feature of the fourth scale is connected to the input of DCB30, the final fused feature of the third scale is connected to the input of DCB20, the final fused feature of the second scale is connected to the input of DCB10, the output of DCB30 is connected to the input of DCB21, the output of DCB20 is connected to the input of DCB11, the output of DCB21 is connected to the input of DCB12, and all scale features are used for feature reconstruction through the longitudinal dense upsampling connection, so that the reconstruction capability of the multi-scale depth feature is further improved.
Loss function L of encoder-decoder networkEDWhich is the pixel consistency and structural similarity between the input image and the output image, as shown in equation (1):
LED=Lp+βLssin (1)
wherein L ispIs the pixel consistency loss, LssinIs the structural similarity loss;
pixel uniformity lossLpAs shown in equation (2):
Figure BDA0003544410180000091
structural similarity lossLssinAs shown in equation (3):
Lssim=1-ssim(O,I) (3)
wherein, O is a network output image, and I is an input image.
In the codingA device-decoder training phase for converting 4 features phi of the output of the encoder1、Φ2、Φ3And phi4Directly input into decoder, train the network with MS-COCO data set, select 8 ten thousand images, convert these images into gray scale, and then adjust to 256 × 256 as network input with loss function of LED. And freezing the network parameters after the training is finished.
And 2, extracting infrared and visible light multi-mode characteristics through the encoder-decoder network, measuring the multi-mode characteristics by using entropy, gradient and significance, and designing multi-mode adaptive loss. The step of measuring the multi-modal characteristics is:
step 2.1, calculating the entropy of the features output by the encoder, comparing the entropy values of the features of all scales, and classifying the features with the highest entropy as content features, wherein the features contain the most contents and details;
2.2, calculating the gradient of the input image of the encoder by using a Sobel gradient operator, performing down sampling on the gradient, then performing difference on the gradient and each feature, and calculating an average value, wherein the feature with the minimum average value contains more structural features such as contours, edges and the like, and is classified into edge structural features;
step 2.3, calculating a saliency map of the input image of the encoder by using a saliency extraction algorithm, carrying out down-sampling on the saliency map, then carrying out difference on the saliency map and each feature, calculating an average value, and classifying the foreground target and the background by the feature with the minimum average value;
the multi-mode adaptive loss function comprises content loss, correlation loss and class saliency loss, and is shown as a formula (4):
Lfea=Lcon+λLcorr+ρLsil-l (4)
wherein L isconIs content loss, LcorrIs the correlation loss, Lsil-lIs class significance loss, and lambda and rho are hyper-parameters and are used for balancing the weight of the three loss;
content lossLconEnhancing the fusion of features as shown in equation (5):
Figure BDA0003544410180000101
wherein wirAnd wviIn order to adapt the weights adaptively to each other,
Figure BDA0003544410180000102
wvi=1-wir
correlation lossLcorrEnhancing the fusion of the structural features of the edge, as shown in formula (6):
Figure BDA0003544410180000103
wherein cov (·) is a covariance function, and σ is a standard deviation function.
Class significance lossLsal-lEnhancing the fusion of plaque features, as shown in equation (7):
Figure BDA0003544410180000104
wherein phiirCharacteristic of infrared rays, [ phi ]viCharacteristic of visible light,. phifFor fused infrared and visible features, MirAnd MviTo remove Mask of noise in the feature, as shown in equations (8) and (9):
Figure BDA0003544410180000111
Figure BDA0003544410180000112
where θ is a constant.
And 3, constructing a fusion weight learning model embedded with a Transformer fusion strategy, and assigning weights of the fusion model.
The converged network structure is as follows: the system comprises 4 Transformer modules, wherein each Transformer module consists of 21 multiplied by 1 convolution layers and 1 Focal Transformer module; the 1 st convolutional layer adjusts a characteristic channel, a local Transformer module combines local and global information to fuse the characteristics, and the 2 nd convolutional layer increases the nonlinear characteristic of the network.
Step 4, acquiring a saliency map of the infrared image as label, and adding the saliency label as the optimized region selection of the fusion network comprises the following steps:
step 4.1, detecting the input infrared image by using an LC significance extraction algorithm to obtain a significance map Msal
Step 4.2, carrying out normalization processing on the significance map to obtain
Figure BDA0003544410180000113
Step 4.3, to the normalized saliency map
Figure BDA0003544410180000114
Design significance loss, as shown in equation (10):
Figure BDA0003544410180000115
wherein F is a fused image, IirAnd IviRespectively infrared and visible light images.
And 5, cascading the fusion weight learning model embedded with the Transformer fusion strategy with a coder decoder to construct an infrared and visible light image fusion network, and training the infrared and visible light image fusion network by adopting the saliency label and the multimode loss. Extraction of multi-mode features phi of infrared and visible images by trained encoder EirAnd phiviWill be infrared characteristic phiirAnd the characteristic of visible light phiviInputting a feature fusion network F after splicing on the channel, and generating a fusion feature phi by the feature fusion network FfIs trainedDecoding by the trained decoder D to generate a fused image IfThe entire fusion process can be formulated as equation (11):
If=D(F(E(Iir),E(Ivi))) (11)
wherein, IirAnd IviRespectively representing an infrared image and a visible light image; e (-) represents the encoder function, F (-) represents the feature fusion network function, and D (-) represents the decoder function.
Further, the training process of the fusion model in the step 5 is as follows: taking the infrared image, the visible light image and the saliency map of the infrared image as input, the total loss function is shown as formula (12):
L=Lssim+αLfea+βLsal (12)
wherein L isssimFor structural similarity loss, the calculation formula is Lssim=1-ssim(If,Ivi),LfeaAdaptive for multiple modes loss, LsalFor significance loss, α and β are hyperparameters to balance the weights of the three terms loss.
In the fusion network training phase, the network is trained using the FLIR dataset, 12000 images are selected, the images are converted to grayscale, and then resized to 256 × 256 as the network input, with a loss function of Ltotal
One specific embodiment is:
two sets of examples were chosen for the TNO dataset. The first set of data includes objects such as human bodies, bushes, houses, doors and windows, and the intensity difference of the same object in infrared and visible light is large. The second group of data comprises objects such as vehicles, houses, clouds and the like, and texture details in the infrared image are richer.
First group of embodiments
As shown in fig. 5, the input images of the first group of embodiments include human bodies, grass, houses, doors and windows, and the like, the infrared images have higher significant contrast, and the visible images have richer texture details. As can be seen from the analysis of the graph (c) in fig. 5 to the graph (h) in fig. 5: the LP method is unnatural in background brightness. Both the DWT method and the CVT method have serious artifacts and have poor fusion effect. The human target of the FusionGAN method is blurred and the texture detail page of the bush is lost. The DenseeFuse method has good fusion effect and is slightly dim as a whole. Compared with other methods, the infrared and visible light image fusion method based on the multimode characteristics can highlight the target information of the infrared image, simultaneously reserve more texture and detail information of the visible light image, and does not introduce artifacts. The fusion method based on the multi-mode characteristics has corresponding self-adaptive fusion strategies for various characteristics, so that more information of the infrared and visible light images can be reserved in the fused image.
Second group of embodiments
As shown in fig. 6, the second set of input images contains objects such as vehicles, houses, clouds, etc., and the texture details in the infrared images are richer. As can be seen from the analysis of (c) in fig. 6 to (h) in fig. 6: all fusion algorithms can perform good fusion on the infrared image and the visible light image on the whole. The image obtained by the LP method is low in brightness and is dull overall. Both the DWT and CVT methods produce fused images with artifacts. The background of FusionGAN is more blurred than the source image, and most of the texture information of the background is lost. The DenseeFuses have good fusion effect but higher brightness than the source image. The method can well fuse the characteristic information of the infrared image and the visible light image, completely reserve visible light textures of vehicle windows and the like and infrared textures of clouds, floor tiles and the like in the fused image, and has good visual effect.
In order to further verify the feasibility and effectiveness of the invention, 21 infrared and visible light images are selected to be subjected to fusion test and quantitatively compared with other five methods for evaluation. Quantitative evaluation refers to objective evaluation of fusion performance through some statistical indexes, and in this embodiment, 6 evaluation indexes widely used in the image fusion field, such as Standard Deviation (SD), Entropy (EN), Spatial Frequency (SF), Mutual Information (MI), Structural Similarity (SSIM), and Root Mean Square Error (RMSE), are selected. SD is a reflection of the fusion image contrast, with large SD values indicating good contrast. EN measures the amount of information of the fused image, and the larger the EN value is, the more information the fused image contains. The SF measures the overall detail richness of the fusion image, and the larger the SF is, the richer the texture contained in the fusion image is. MI measures the amount of information from the source image contained in the fused image, and a larger MI means that the fused image contains more information from the source image. The SSIM represents the structural correlation between the fused image and the source image, and the larger the SSIM is, the more similar the fused image and the source image are, and the smaller the distortion is. The RMSE measures the error between the fused image and the source image, and the smaller RMSE index indicates the better fusion performance, which means that the fused image is closer to the source image, and the error is the smallest in the fusion process.
Table 1 shows objective evaluation indexes of the experimental results of 21 selected pairs of infrared and visible light images under different fusion methods, wherein bold and underlined data respectively represent an optimal value and a suboptimal value of the evaluation indexes. As can be seen from the data in Table 1, the infrared and visible image fusion method based on multi-mode features provided in this embodiment achieves the best scores in the SD, EN, MI and RMSE indexes, and is inferior to the DWT and Denseuse methods in SF and SSIM, respectively. The results show that the method has the most information transmitted from the source image to the fusion image in the fusion process and can better keep the edge. The fused image has the highest contrast, contains the most information quantity, reserves the global structure and the edge characteristic of more source images and has better visual effect.
Table 1: objective evaluation index of infrared and visible light image fusion result
Figure BDA0003544410180000141

Claims (10)

1. An infrared and visible light image fusion method based on multi-mode features is characterized by comprising the following steps:
step 1, constructing a feature extraction and image reconstruction network, and optimally generating a multi-mode feature encoder-decoder network through the guidance of a loss function based on a multi-scale convolution network;
and 2, extracting infrared and visible light multi-mode characteristics through the encoder-decoder network, measuring the multi-mode characteristics by using entropy, gradient and significance, and designing multi-mode adaptive loss.
Step 3, constructing a fusion weight learning model embedded with a Transformer fusion strategy, and assigning weights of the fusion model;
step 4, acquiring a saliency map of the infrared image as label, and adding the saliency label as area selection for optimizing the fusion network;
and 5, cascading the fusion weight learning model embedded with the Transformer fusion strategy with a coder decoder to construct an infrared and visible light image fusion network, and training the infrared and visible light image fusion network by adopting the saliency label and the multimode loss.
2. The method of claim 1, wherein the structure of the encoder in step 1 comprises: 1 × 1 convolutional layer and 4 coded convolution modules ECB10, ECB20, ECB30, and ECB40, each coded convolution module containing 2 3 × 3 convolutional layers and one max pooling layer;
the structure of the decoder in step 1 comprises: 1 × 1 convolutional layer and 6 Decode convolution modules DCB30, DCB20, DCB21, DCB10, DCB11, and DCB12, each containing two 3 × 3 convolutional layers.
3. The method of claim 1, wherein the decoder network in step 1 is specifically connected as follows: adopting transverse dense jump connection in the first scale and the second scale, adopting a channel connection mode, jumping and connecting the final fused feature of the second scale to the input of the DBC21, jumping and connecting the final fused feature of the first scale to the inputs of the DCB11 and the DCB12, and jumping and connecting the output of the DCB10 to the input of the DCB 12; through transverse dense skip connection, the depth features of all the middle layers are used for feature reconstruction, and the reconstruction capability of the multi-scale depth features is improved; in the decoding sub-network, longitudinal dense connection is established in all scales, the final fused feature of the fourth scale is connected to the input of DCB30, the final fused feature of the third scale is connected to the input of DCB20, the final fused feature of the second scale is connected to the input of DCB10, the output of DCB30 is connected to the input of DCB21, the output of DCB20 is connected to the input of DCB11, the output of DCB21 is connected to the input of DCB12, and all scale features are used for feature reconstruction through the longitudinal dense upsampling connection, so that the reconstruction capability of the multi-scale depth feature is further improved.
4. The method of claim 1, wherein: loss function L for encoder-decoder networksEDWhich is the pixel consistency and structural similarity between the input image and the output image, as shown in equation (1):
LED=Lp+βLssin (1)
wherein L ispIs the pixel consistency loss, LssinIs the structural similarity loss;
pixel uniformity lossLpAs shown in equation (2):
Figure FDA0003544410170000021
structural similarity lossLssinAs shown in equation (3):
Lssim=1-ssim(O,I) (3)
wherein O is a network output image and I is an input image.
5. The method of claim 1, wherein said measuring the multi-modal features using entropy, gradient, and significance in step 2 comprises the steps of:
step 2.1, calculating the entropy of the features output by the encoder, and comparing the entropy values of the features of all scales, wherein the features with the highest entropy contain the most contents and details and are classified into the content features;
2.2, calculating the gradient of the input image of the encoder by using a Sobel gradient operator, performing down sampling on the gradient, then performing difference on the gradient and each feature, and calculating an average value, wherein the feature with the minimum average value contains more structural features such as contours, edges and the like, and is classified into edge structural features;
and 2.3, calculating a saliency map of the input image of the encoder by using a saliency extraction algorithm, carrying out down-sampling on the saliency map, carrying out difference on the saliency map and each feature, calculating an average value, and classifying the foreground target and the background by the feature with the minimum average value to obtain the patch feature.
6. The method of claim 1, wherein: the multi-mode adaptive loss function in the step 2 comprises content loss, correlation loss and class saliency loss, as shown in the formula (4):
Lfea=Lcon+λLcorr+ρLsil-l (4)
wherein L isconIs content loss, LcorrIs the correlation loss, Lsil-lIs class significance loss, and lambda and rho are hyper-parameters and are used for balancing the weight of the three loss;
content lossLconEnhancing the fusion of features as shown in equation (5):
Figure FDA0003544410170000031
wherein, wirAnd wviIn order to adapt the weights adaptively to each other,
Figure FDA0003544410170000032
wvi=1-wir
correlation lossLcorrEnhancing the fusion of the structural features of the edge, as shown in formula (6):
Figure FDA0003544410170000033
wherein cov (·) is a covariance function, and σ is a standard deviation function.
Class significance lossLsal-lEnhancing the fusion of plaque features as shown in equation (7):
Figure FDA0003544410170000034
wherein phiirFor infrared characteristics, phiviCharacteristic of visible light, [ phi ]fFor fused infrared and visible features, MirAnd MviTo remove Mask of noise in the feature, as shown in equations (8) and (9):
Figure FDA0003544410170000041
Figure FDA0003544410170000042
where θ is a constant.
7. The method of claim 1, wherein the converged network architecture of step 3 is as follows: the system comprises 4 Transformer modules, wherein each Transformer module consists of 21 multiplied by 1 convolution layers and 1 Focal Transformer module; the 1 st convolutional layer adjusts a characteristic channel, a local Transformer module combines local information and global information to fuse characteristics, and the 2 nd convolutional layer increases the nonlinear characteristic of a network.
8. The method of claim 1, wherein said adding significant label in step 4 comprises the steps of:
step 4.1, detecting the input infrared image by using an LC significance extraction algorithm to obtain a significance map Msal
Step 4.2, carrying out normalization processing on the significance map to obtain
Figure FDA0003544410170000043
Step 4.3, for the normalized saliency map
Figure FDA0003544410170000044
Design significance loss, as shown in equation (10):
Figure FDA0003544410170000045
wherein F is a fused image, IirAnd IviRespectively infrared and visible light images.
9. The method of claim 1, wherein the concatenation of the encoder, the fusion module, and the decoder into a complete image fusion network in step 5 is represented as follows:
extraction of multi-mode features phi of infrared and visible images by trained encoder EirAnd phiviWill be infrared characteristic phiirAnd the characteristic of visible light phiviInputting a feature fusion network F after splicing on the channels, and generating a fusion feature phi by the feature fusion network FfDecoding by trained decoder D to generate fused image IfThe entire fusion process can be formulated as equation (11):
If=D(F(E(Iir),E(Ivi))) (11)
wherein, IirAnd IviRespectively representing an infrared image and a visible light image; e (-) represents the encoder function, F (-) represents the feature fusion network function, and D (-) represents the decoder function.
10. The method of claim 1, wherein the fusion model training process in step 5 is: taking the infrared image, the visible light image and the saliency map of the infrared image as input, the total loss function is shown as formula (12):
L=Lssim+αLfea+βLsal (12)
wherein L isssimFor structural similarity loss, the calculation formula is Lssim=1-ssim(If,Ivi),
Lssim=1-ssim(If,Ivi),LfeaAdaptive for multiple modes loss, LsalFor significance loss, α and β are hyperparameters to balance the weights of the three losses.
CN202210244332.9A 2022-03-14 2022-03-14 Infrared and visible light image fusion method based on multi-mode characteristics Pending CN114639002A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210244332.9A CN114639002A (en) 2022-03-14 2022-03-14 Infrared and visible light image fusion method based on multi-mode characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210244332.9A CN114639002A (en) 2022-03-14 2022-03-14 Infrared and visible light image fusion method based on multi-mode characteristics

Publications (1)

Publication Number Publication Date
CN114639002A true CN114639002A (en) 2022-06-17

Family

ID=81947742

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210244332.9A Pending CN114639002A (en) 2022-03-14 2022-03-14 Infrared and visible light image fusion method based on multi-mode characteristics

Country Status (1)

Country Link
CN (1) CN114639002A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115205179A (en) * 2022-07-15 2022-10-18 小米汽车科技有限公司 Image fusion method and device, vehicle and storage medium
CN116823686A (en) * 2023-04-28 2023-09-29 长春理工大学重庆研究院 Night infrared and visible light image fusion method based on image enhancement

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115205179A (en) * 2022-07-15 2022-10-18 小米汽车科技有限公司 Image fusion method and device, vehicle and storage medium
CN116823686A (en) * 2023-04-28 2023-09-29 长春理工大学重庆研究院 Night infrared and visible light image fusion method based on image enhancement
CN116823686B (en) * 2023-04-28 2024-03-08 长春理工大学重庆研究院 Night infrared and visible light image fusion method based on image enhancement

Similar Documents

Publication Publication Date Title
CN110992275B (en) Refined single image rain removing method based on generation of countermeasure network
CN110097528B (en) Image fusion method based on joint convolution self-coding network
CN111625608B (en) Method and system for generating electronic map according to remote sensing image based on GAN model
CN110084108A (en) Pedestrian re-identification system and method based on GAN neural network
CN105825235A (en) Image identification method based on deep learning of multiple characteristic graphs
CN114639002A (en) Infrared and visible light image fusion method based on multi-mode characteristics
CN114943963A (en) Remote sensing image cloud and cloud shadow segmentation method based on double-branch fusion network
CN113569724B (en) Road extraction method and system based on attention mechanism and dilation convolution
CN115713679A (en) Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map
CN115330653A (en) Multi-source image fusion method based on side window filtering
CN111696136A (en) Target tracking method based on coding and decoding structure
CN116645569A (en) Infrared image colorization method and system based on generation countermeasure network
CN116091929A (en) Remote sensing image semantic segmentation method combining Unet and Transformer
CN114494081B (en) Unmanned aerial vehicle remote sensing mapping image enhancement method
CN113724308B (en) Cross-waveband stereo matching algorithm based on mutual attention of luminosity and contrast
Qu et al. Low illumination enhancement for object detection in self-driving
CN114926826A (en) Scene text detection system
CN117274759A (en) Infrared and visible light image fusion system based on distillation-fusion-semantic joint driving
CN112330562A (en) Heterogeneous remote sensing image transformation method and system
CN117058386A (en) Asphalt road crack detection method based on improved deep Labv3+ network
CN117173595A (en) Unmanned aerial vehicle aerial image target detection method based on improved YOLOv7
CN116206214A (en) Automatic landslide recognition method, system, equipment and medium based on lightweight convolutional neural network and double attention
CN116309213A (en) High-real-time multi-source image fusion method based on generation countermeasure network
Kumar et al. Underwater image enhancement using deep learning
CN113255704B (en) Pixel difference convolution edge detection method based on local binary pattern

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination