CN114639002A

CN114639002A - Infrared and visible light image fusion method based on multi-mode characteristics

Info

Publication number: CN114639002A
Application number: CN202210244332.9A
Authority: CN
Inventors: 刘向增; 高豪杰; 苗启广; 宋建锋; 纪建
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-06-17

Abstract

The invention discloses an infrared and visible light image fusion method based on multi-mode characteristics, which comprises the following steps: 1. the method comprises the steps of extracting a coder-decoder network of the multi-mode features, 2, measuring the multi-mode features by using entropy, gradient and significance and designing an adaptive loss function, 3, constructing a fusion weight learning model embedded with a Transformer fusion strategy, 4, taking a significance map of an infrared image as a label, adding the label as an optimized region selection of the fusion network, and 5, cascading the fusion weight learning model embedded with the Transformer fusion strategy with a coder-decoder to construct an infrared and visible light image fusion network and train the infrared and visible light image fusion network. The method utilizes a Transformer to capture the global relevance of multi-scale features, gives consideration to local and global information, and improves the overall visual effect of a fused image; and by utilizing a multi-mode self-adaptive fusion strategy, multi-mode characteristic information of the image is reserved, and the quality of the fused image is improved.

Description

Infrared and visible light image fusion method based on multi-mode characteristics

Technical Field

The application belongs to the field of image fusion, and particularly relates to an infrared and visible light image fusion method based on multi-mode characteristics.

Background

Image fusion refers to combining images obtained by different types of sensors to generate a robust or information-rich image, so as to facilitate subsequent processing or decision making. Complex applications require comprehensive information about a particular scene to enhance a comprehensive understanding of the scene. The single-mode sensor can only sense single scene information of the target, and cannot sense the target in multiple granularities. Therefore, fusion techniques play an increasingly important role in modern applications and computer vision.

Due to the limitations of physical sensors, scene information captured by infrared and visible light images is very different. The visible light image captures the reflected light, and such images generally have high spatial resolution, rich color, texture detail and high contrast characteristics, are suitable for human visual perception, but are easily affected by illumination, such as in poor light scenes, such as bad weather or night, and the image quality of the images is greatly reduced. The infrared image captures heat radiation, and the infrared image describing the heat radiation of the object can resist interference such as severe weather and insufficient light, but generally has low spatial resolution and lacks information such as image texture and color. The image fusion of the infrared and the visible light refers to the combination of the infrared image and the visible light image in the same scene, and the fusion image with strong robustness and large information quantity is generated by utilizing the complementarity of the two images. The infrared and visible light image fusion technology is widely applied to the fields of target detection, image enhancement, video monitoring, remote sensing and the like.

The image fusion method of infrared and visible light is mainly divided into a traditional method and a deep learning method. Traditional image fusion methods mainly use multi-scale transforms (MST), Sparse Representations (SR), saliency-based, mixed models, and other methods. The methods have achieved good fusion performance, but the problems of manual manufacturing of features, high calculation complexity and the like still exist. Among deep learning-based methods, models such as fusion gan, Attention FGAN, and Nestfuse improve the shortcomings of the conventional methods, but have certain limitations. First, deep learning networks usually extract feature maps directly from previous convolution layers, ignoring global information, resulting in poor quality fusion results. Second, in the encoder-decoder model with the fusion strategy, a simple fusion strategy may make the image edges unclear. Finally, the design of the loss function also affects the effect of network convergence. Improper loss function design not only can slow down convergence speed, but also can cause the fusion result to have the problems of artifacts, fuzzy boundaries and the like.

Disclosure of Invention

The invention aims to provide an infrared and visible light image fusion method based on multi-mode characteristics, which solves the problem of lack of global information in the fusion process, performs self-adaptive fusion on the multi-mode characteristics, combines significance information and finally realizes good fusion.

In order to realize the task, the invention adopts the following technical scheme to realize the following steps:

an infrared and visible light image fusion method based on multi-mode features is characterized by comprising the following steps:

step 1, constructing a feature extraction and image reconstruction network, and optimally generating a multi-mode feature encoder-decoder network through the guidance of a loss function based on a multi-scale convolution network;

and 2, extracting infrared and visible light multi-mode characteristics through the encoder-decoder network, measuring the multi-mode characteristics by using entropy, gradient and significance, and designing multi-mode adaptive loss.

Step 3, constructing a fusion weight learning model embedded with a Transformer fusion strategy, and assigning weights of the fusion model;

step 4, acquiring a saliency map of the infrared image as label, and adding the saliency label as area selection for optimizing the fusion network;

and 5, cascading the fusion weight learning model embedded with the Transformer fusion strategy with a coder decoder to construct an infrared and visible light image fusion network, and training the infrared and visible light image fusion network by adopting the saliency label and the multimode loss.

According to the invention, the structure of the encoder in step 1 is as follows: contains 1 x 1 convolutional layer and 4 coding convolution modules ECB10, ECB20, ECB30 and ECB40, each containing 2 x 3 convolutional layers and one max pooling layer.

The structure of the decoder in step 1 is as follows: contains 1 x 1 convolutional layer and 6 decoding convolutional modules DCB30, DCB20, DCB21, DCB10, DCB11 and DCB12, each of which contains two 3 x 3 convolutional layers.

Specifically, the specific connection mode of the decoder in step 1 is as follows: and adopting transverse dense jump connection in the first scale and the second scale, adopting a channel connection mode, connecting the final fused feature jump of the second scale to the input of the DBC21, connecting the final fused feature jump of the first scale to the inputs of the DCB11 and the DCB12, and connecting the output jump of the DCB10 to the input of the DCB 12. Through transverse dense skip connection, the depth features of all the middle layers are used for feature reconstruction, and the reconstruction capability of the multi-scale depth features is improved; in the decoding sub-network, longitudinal dense connection is established in all scales, the final fused feature of the fourth scale is connected to the input of DCB30, the final fused feature of the third scale is connected to the input of DCB20, the final fused feature of the second scale is connected to the input of DCB10, the output of DCB30 is connected to the input of DCB21, the output of DCB20 is connected to the input of DCB11, the output of DCB21 is connected to the input of DCB12, and all scale features are used for feature reconstruction through the longitudinal dense upsampling connection, so that the reconstruction capability of the multi-scale depth feature is further improved.

Further, the loss function L of the encoder-decoder network_EDWhich is the pixel consistency and structure between the input image and the output imageSimilarity, as shown in equation (1):

L_ED＝L_p+βL_ssin (1)

wherein L is_pIs the pixel consistency loss, L_ssinIs structural similarity loss;

pixel uniformity lossL_pAs shown in equation (2):

structural similarity lossL_ssinAs shown in equation (3):

L_ssim＝1-ssim(O，I) (3)

wherein, O is a network output image, and I is an input image.

Further, the measuring the multi-modal features using entropy, gradient, and significance in step 2 comprises the steps of:

step 2.1, calculating the entropy of the features output by the encoder, and comparing the entropy values of the features of all scales, wherein the features with the highest entropy contain the most contents and details and are classified into the content features;

step 2.2, calculating the gradient of the input image of the encoder by using a Sobel gradient operator, performing down-sampling on the gradient, then subtracting the gradient from each feature, and calculating the average value, wherein the feature with the minimum average value contains more structural features such as outlines, edges and the like, and is classified into edge structural features;

step 2.3, calculating a saliency map of the input image of the encoder by using a saliency extraction algorithm, carrying out down-sampling on the saliency map, then carrying out difference on the saliency map and each feature, calculating an average value, and classifying the foreground target and the background by the feature with the minimum average value;

further, the multi-mode adaptive loss function in step 2 includes a content loss, a correlation loss and a class saliency loss, as shown in formula (4):

L_fea＝L_con+λL_corr+ρL_sil-l (4)

wherein L is_conIs content loss, L_corrIs the correlation loss, L_sil-lIs class significance loss, and lambda and rho are hyper-parameters and are used for balancing the weight of the three loss;

content lossL_conEnhancing the fusion of features as shown in equation (5):

wherein, w_irAnd w_viIn order to adapt the weight of the weight to the weight,

w_vi＝1-w_ir；

correlation lossL_corrEnhancing the fusion of the structural features of the edge, as shown in formula (6):

wherein cov (·) is a covariance function, and σ is a standard deviation function.

Class significance lossL_sal-lEnhancing the fusion of plaque features, as shown in equation (7):

wherein phi_irFor infrared characteristics, phi_viCharacteristic of visible light, [ phi ]_fFor fused infrared and visible features, M_irAnd M_viTo remove Mask of noise in the feature, as shown in equations (8) and (9):

where θ is a constant.

Further, the converged network structure in step 3 is as follows: the system comprises 4 Transformer modules, wherein each Transformer module consists of 21 multiplied by 1 convolution layers and 1 FocalTransormer module; and adjusting a characteristic channel by the 1 st convolutional layer, combining local information and global information by a FocalTransformer module to fuse characteristics, and increasing the nonlinear characteristic of a network by the 2 nd convolutional layer.

Preferably, said adding significant label in step 4 comprises the steps of:

step 4.1, detecting the input infrared image by using an LC significance extraction algorithm to obtain a significance map M_sal；

Step 4.2, carrying out normalization processing on the significance map to obtain

Step 4.3, for the normalized saliency map

Design significance loss, as shown in equation (10):

wherein F is a fused image, I_irAnd I_viRespectively infrared and visible light images.

Furthermore, the step 5 of concatenating the encoder, the fusion module, and the decoder to form a complete image fusion network is represented as follows:

extraction of multi-mode features phi of infrared and visible images by trained encoder E_irAnd phi_viWill be infrared characteristic phi_irAnd the characteristic of visible light phi_viInputting a feature fusion network F after splicing on a channel, and generating the feature fusion network FHas a fusion characteristic phi_fDecoding by trained decoder D to generate fused image I_fThe entire fusion process can be formulated as equation (11):

I_f＝D(F(E(I_ir)，E(I_vi))) (11)

wherein, I_irAnd I_viRespectively representing an infrared image and a visible light image; e (-) represents the encoder function, F (-) represents the feature fusion network function, and D (-) represents the decoder function.

Further, the training process of the fusion model in the step 5 is as follows: taking the infrared image, the visible light image and the saliency map of the infrared image as input, the total loss function is shown as formula (12):

L＝L_ssim+αL_fea+βL_sal (12)

wherein L is_ssimFor structural similarity loss, the calculation formula is L_ssim＝1-ssim(I_f，I_vi)，L_feaAdaptive for multiple modes loss, L_salFor significance loss, α and β are hyperparameters to balance the weights of the three terms loss.

Compared with the prior art, the infrared and visible light image fusion method based on the multimode characteristics brings technical innovation that:

(1) a new infrared and visible light image fusion network is provided, the Focal transform module is used for fusing the extracted image characteristics, local and global information is considered, and the fusion performance is better;

(2) aiming at the multi-mode characteristics extracted by the encoder, multi-mode adaptive loss is designed to carry out optimization learning on the model, so that the information transmission in the fusion process is enhanced, and the problem that different characteristic information is weakened in the existing fusion method is effectively avoided;

(3) and adding significance information in the fusion, and adaptively improving the weight of texture details in the thermal target and the visible light image in the infrared image by the optimization model to finally obtain a fusion result with a good visual effect.

Drawings

FIG. 1 is a block diagram of a multi-mode feature encoder-decoder network;

FIG. 2 is a diagram of a network architecture during encoder-decoder training;

fig. 3 is a structure of a rolling block ECB in the encoder;

fig. 4 is a structure of a convolution block DCB in a decoder;

FIG. 5 is a graph showing the results of the first embodiment. Wherein, the figure (a) is the infrared image to be fused of the first embodiment; (b) the visible light image to be fused of the first embodiment is shown; (c) the graph is a fused image based on the Laplacian Pyramid (LP); (d) the figure is a fused image based on discrete wavelet Decomposition (DWT); (e) the graph is a fusion image based on a curvelet transform (CVT); (f) the figure is a fusion image of FusionGAN; (g) the diagram is a fused image of the DenseeFuses; (h) the figure is a fused image of the method of the invention.

FIG. 6 is a graph showing the results of the second embodiment. Wherein, (a) is the infrared image to be fused of the second embodiment; (b) the visible light image to be fused of the second embodiment is shown in the figure; (c) the graph is a fused image based on a Laplacian Pyramid (LP); (d) the figure is a fused image based on discrete wavelet Decomposition (DWT); (e) the graph is a fusion image based on a curvelet transform (CVT); (f) the figure is a fused image of fusingen; (g) the diagram is a fused image of the DenseeFuses; (h) the figure is a fused image of the method of the invention.

The invention is described in further detail below with reference to the figures and examples.

Detailed Description

The embodiment provides an infrared and visible light image fusion method based on multi-mode characteristics, which comprises the following steps:

step 1, constructing a feature extraction and image reconstruction network, and optimally generating a multi-mode feature encoder-decoder network based on a multi-scale convolution network through the guidance of a loss function, wherein the structure of the encoder-decoder network is shown in fig. 2. The encoder contains 1 x 1 convolutional layer and 4 coded convolutional modules ECB10, ECB20, ECB30 and ECB40, each containing 2 x 3 convolutional layers and one max-pooling layer. The ECB structure is shown in fig. 3. The decoder contains 1 × 1 convolutional layer and 6 decoding convolutional modules DCB30, DCB20, DCB21, DCB10, DCB11 and DCB12, each of which contains two 3 × 3 convolutional layers, and the DCB structure is shown in fig. 4. The specific connection of the decoder is as follows: and adopting transverse dense jump connection in the first scale and the second scale, adopting a channel connection mode to connect the final fused feature jump of the second scale to the input of the DBC21, connect the final fused feature jump of the first scale to the inputs of the DCB11 and the DCB12, and connect the output jump of the DCB10 to the input of the DCB 12. Through transverse dense skip connection, the depth features of all the middle layers are used for feature reconstruction, and the reconstruction capability of the multi-scale depth features is improved; in the decoding sub-network, longitudinal dense connection is established in all scales, the final fused feature of the fourth scale is connected to the input of DCB30, the final fused feature of the third scale is connected to the input of DCB20, the final fused feature of the second scale is connected to the input of DCB10, the output of DCB30 is connected to the input of DCB21, the output of DCB20 is connected to the input of DCB11, the output of DCB21 is connected to the input of DCB12, and all scale features are used for feature reconstruction through the longitudinal dense upsampling connection, so that the reconstruction capability of the multi-scale depth feature is further improved.

Loss function L of encoder-decoder network_EDWhich is the pixel consistency and structural similarity between the input image and the output image, as shown in equation (1):

L_ED＝L_p+βL_ssin (1)

wherein L is_pIs the pixel consistency loss, L_ssinIs the structural similarity loss;

pixel uniformity lossL_pAs shown in equation (2):

structural similarity lossL_ssinAs shown in equation (3):

L_ssim＝1-ssim(O，I) (3)

wherein, O is a network output image, and I is an input image.

In the codingA device-decoder training phase for converting 4 features phi of the output of the encoder¹、Φ²、Φ³And phi⁴Directly input into decoder, train the network with MS-COCO data set, select 8 ten thousand images, convert these images into gray scale, and then adjust to 256 × 256 as network input with loss function of L_ED. And freezing the network parameters after the training is finished.

And 2, extracting infrared and visible light multi-mode characteristics through the encoder-decoder network, measuring the multi-mode characteristics by using entropy, gradient and significance, and designing multi-mode adaptive loss. The step of measuring the multi-modal characteristics is:

step 2.1, calculating the entropy of the features output by the encoder, comparing the entropy values of the features of all scales, and classifying the features with the highest entropy as content features, wherein the features contain the most contents and details;

2.2, calculating the gradient of the input image of the encoder by using a Sobel gradient operator, performing down sampling on the gradient, then performing difference on the gradient and each feature, and calculating an average value, wherein the feature with the minimum average value contains more structural features such as contours, edges and the like, and is classified into edge structural features;

the multi-mode adaptive loss function comprises content loss, correlation loss and class saliency loss, and is shown as a formula (4):

L_fea＝L_con+λL_corr+ρL_sil-l (4)

content lossL_conEnhancing the fusion of features as shown in equation (5):

wherein w_irAnd w_viIn order to adapt the weights adaptively to each other,

w_vi＝1-w_ir；

wherein phi_irCharacteristic of infrared rays, [ phi ]_viCharacteristic of visible light,. phi_fFor fused infrared and visible features, M_irAnd M_viTo remove Mask of noise in the feature, as shown in equations (8) and (9):

where θ is a constant.

And 3, constructing a fusion weight learning model embedded with a Transformer fusion strategy, and assigning weights of the fusion model.

The converged network structure is as follows: the system comprises 4 Transformer modules, wherein each Transformer module consists of 21 multiplied by 1 convolution layers and 1 Focal Transformer module; the 1 st convolutional layer adjusts a characteristic channel, a local Transformer module combines local and global information to fuse the characteristics, and the 2 nd convolutional layer increases the nonlinear characteristic of the network.

Step 4, acquiring a saliency map of the infrared image as label, and adding the saliency label as the optimized region selection of the fusion network comprises the following steps:

Step 4.3, to the normalized saliency map

Design significance loss, as shown in equation (10):

And 5, cascading the fusion weight learning model embedded with the Transformer fusion strategy with a coder decoder to construct an infrared and visible light image fusion network, and training the infrared and visible light image fusion network by adopting the saliency label and the multimode loss. Extraction of multi-mode features phi of infrared and visible images by trained encoder E_irAnd phi_viWill be infrared characteristic phi_irAnd the characteristic of visible light phi_viInputting a feature fusion network F after splicing on the channel, and generating a fusion feature phi by the feature fusion network F_fIs trainedDecoding by the trained decoder D to generate a fused image I_fThe entire fusion process can be formulated as equation (11):

I_f＝D(F(E(I_ir)，E(I_vi))) (11)

L＝L_ssim+αL_fea+βL_sal (12)

In the fusion network training phase, the network is trained using the FLIR dataset, 12000 images are selected, the images are converted to grayscale, and then resized to 256 × 256 as the network input, with a loss function of L_total。

One specific embodiment is:

two sets of examples were chosen for the TNO dataset. The first set of data includes objects such as human bodies, bushes, houses, doors and windows, and the intensity difference of the same object in infrared and visible light is large. The second group of data comprises objects such as vehicles, houses, clouds and the like, and texture details in the infrared image are richer.

First group of embodiments

As shown in fig. 5, the input images of the first group of embodiments include human bodies, grass, houses, doors and windows, and the like, the infrared images have higher significant contrast, and the visible images have richer texture details. As can be seen from the analysis of the graph (c) in fig. 5 to the graph (h) in fig. 5: the LP method is unnatural in background brightness. Both the DWT method and the CVT method have serious artifacts and have poor fusion effect. The human target of the FusionGAN method is blurred and the texture detail page of the bush is lost. The DenseeFuse method has good fusion effect and is slightly dim as a whole. Compared with other methods, the infrared and visible light image fusion method based on the multimode characteristics can highlight the target information of the infrared image, simultaneously reserve more texture and detail information of the visible light image, and does not introduce artifacts. The fusion method based on the multi-mode characteristics has corresponding self-adaptive fusion strategies for various characteristics, so that more information of the infrared and visible light images can be reserved in the fused image.

Second group of embodiments

As shown in fig. 6, the second set of input images contains objects such as vehicles, houses, clouds, etc., and the texture details in the infrared images are richer. As can be seen from the analysis of (c) in fig. 6 to (h) in fig. 6: all fusion algorithms can perform good fusion on the infrared image and the visible light image on the whole. The image obtained by the LP method is low in brightness and is dull overall. Both the DWT and CVT methods produce fused images with artifacts. The background of FusionGAN is more blurred than the source image, and most of the texture information of the background is lost. The DenseeFuses have good fusion effect but higher brightness than the source image. The method can well fuse the characteristic information of the infrared image and the visible light image, completely reserve visible light textures of vehicle windows and the like and infrared textures of clouds, floor tiles and the like in the fused image, and has good visual effect.

In order to further verify the feasibility and effectiveness of the invention, 21 infrared and visible light images are selected to be subjected to fusion test and quantitatively compared with other five methods for evaluation. Quantitative evaluation refers to objective evaluation of fusion performance through some statistical indexes, and in this embodiment, 6 evaluation indexes widely used in the image fusion field, such as Standard Deviation (SD), Entropy (EN), Spatial Frequency (SF), Mutual Information (MI), Structural Similarity (SSIM), and Root Mean Square Error (RMSE), are selected. SD is a reflection of the fusion image contrast, with large SD values indicating good contrast. EN measures the amount of information of the fused image, and the larger the EN value is, the more information the fused image contains. The SF measures the overall detail richness of the fusion image, and the larger the SF is, the richer the texture contained in the fusion image is. MI measures the amount of information from the source image contained in the fused image, and a larger MI means that the fused image contains more information from the source image. The SSIM represents the structural correlation between the fused image and the source image, and the larger the SSIM is, the more similar the fused image and the source image are, and the smaller the distortion is. The RMSE measures the error between the fused image and the source image, and the smaller RMSE index indicates the better fusion performance, which means that the fused image is closer to the source image, and the error is the smallest in the fusion process.

Table 1 shows objective evaluation indexes of the experimental results of 21 selected pairs of infrared and visible light images under different fusion methods, wherein bold and underlined data respectively represent an optimal value and a suboptimal value of the evaluation indexes. As can be seen from the data in Table 1, the infrared and visible image fusion method based on multi-mode features provided in this embodiment achieves the best scores in the SD, EN, MI and RMSE indexes, and is inferior to the DWT and Denseuse methods in SF and SSIM, respectively. The results show that the method has the most information transmitted from the source image to the fusion image in the fusion process and can better keep the edge. The fused image has the highest contrast, contains the most information quantity, reserves the global structure and the edge characteristic of more source images and has better visual effect.

Table 1: objective evaluation index of infrared and visible light image fusion result

Claims

1. An infrared and visible light image fusion method based on multi-mode features is characterized by comprising the following steps:

2. The method of claim 1, wherein the structure of the encoder in step 1 comprises: 1 × 1 convolutional layer and 4 coded convolution modules ECB10, ECB20, ECB30, and ECB40, each coded convolution module containing 2 3 × 3 convolutional layers and one max pooling layer;

the structure of the decoder in step 1 comprises: 1 × 1 convolutional layer and 6 Decode convolution modules DCB30, DCB20, DCB21, DCB10, DCB11, and DCB12, each containing two 3 × 3 convolutional layers.

3. The method of claim 1, wherein the decoder network in step 1 is specifically connected as follows: adopting transverse dense jump connection in the first scale and the second scale, adopting a channel connection mode, jumping and connecting the final fused feature of the second scale to the input of the DBC21, jumping and connecting the final fused feature of the first scale to the inputs of the DCB11 and the DCB12, and jumping and connecting the output of the DCB10 to the input of the DCB 12; through transverse dense skip connection, the depth features of all the middle layers are used for feature reconstruction, and the reconstruction capability of the multi-scale depth features is improved; in the decoding sub-network, longitudinal dense connection is established in all scales, the final fused feature of the fourth scale is connected to the input of DCB30, the final fused feature of the third scale is connected to the input of DCB20, the final fused feature of the second scale is connected to the input of DCB10, the output of DCB30 is connected to the input of DCB21, the output of DCB20 is connected to the input of DCB11, the output of DCB21 is connected to the input of DCB12, and all scale features are used for feature reconstruction through the longitudinal dense upsampling connection, so that the reconstruction capability of the multi-scale depth feature is further improved.

4. The method of claim 1, wherein: loss function L for encoder-decoder networks_EDWhich is the pixel consistency and structural similarity between the input image and the output image, as shown in equation (1):

L_ED＝L_p+βL_ssin (1)

pixel uniformity lossL_pAs shown in equation (2):

structural similarity lossL_ssinAs shown in equation (3):

L_ssim＝1-ssim(O，I) (3)

wherein O is a network output image and I is an input image.

5. The method of claim 1, wherein said measuring the multi-modal features using entropy, gradient, and significance in step 2 comprises the steps of:

and 2.3, calculating a saliency map of the input image of the encoder by using a saliency extraction algorithm, carrying out down-sampling on the saliency map, carrying out difference on the saliency map and each feature, calculating an average value, and classifying the foreground target and the background by the feature with the minimum average value to obtain the patch feature.

6. The method of claim 1, wherein: the multi-mode adaptive loss function in the step 2 comprises content loss, correlation loss and class saliency loss, as shown in the formula (4):

L_fea＝L_con+λL_corr+ρL_sil-l (4)

content lossL_conEnhancing the fusion of features as shown in equation (5):

wherein, w_irAnd w_viIn order to adapt the weights adaptively to each other,

w_vi＝1-w_ir；

Class significance lossL_sal-lEnhancing the fusion of plaque features as shown in equation (7):

where θ is a constant.

7. The method of claim 1, wherein the converged network architecture of step 3 is as follows: the system comprises 4 Transformer modules, wherein each Transformer module consists of 21 multiplied by 1 convolution layers and 1 Focal Transformer module; the 1 st convolutional layer adjusts a characteristic channel, a local Transformer module combines local information and global information to fuse characteristics, and the 2 nd convolutional layer increases the nonlinear characteristic of a network.

8. The method of claim 1, wherein said adding significant label in step 4 comprises the steps of:

Step 4.3, for the normalized saliency map

Design significance loss, as shown in equation (10):

9. The method of claim 1, wherein the concatenation of the encoder, the fusion module, and the decoder into a complete image fusion network in step 5 is represented as follows:

extraction of multi-mode features phi of infrared and visible images by trained encoder E_irAnd phi_viWill be infrared characteristic phi_irAnd the characteristic of visible light phi_viInputting a feature fusion network F after splicing on the channels, and generating a fusion feature phi by the feature fusion network F_fDecoding by trained decoder D to generate fused image I_fThe entire fusion process can be formulated as equation (11):

I_f＝D(F(E(I_ir),E(I_vi))) (11)

10. The method of claim 1, wherein the fusion model training process in step 5 is: taking the infrared image, the visible light image and the saliency map of the infrared image as input, the total loss function is shown as formula (12):

L＝L_ssim+αL_fea+βL_sal (12)

wherein L is_ssimFor structural similarity loss, the calculation formula is L_ssim＝1-ssim(I_f，I_vi)，

L_ssim＝1-ssim(I_f,I_vi)，L_feaAdaptive for multiple modes loss, L_salFor significance loss, α and β are hyperparameters to balance the weights of the three losses.