CN112561846A

CN112561846A - Method and device for training image fusion model and electronic equipment

Info

Publication number: CN112561846A
Application number: CN202011548642.7A
Authority: CN
Inventors: 龙勇志
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2021-03-26

Abstract

The application discloses a method for training an image fusion model and electronic equipment, and belongs to the field of image processing. The method comprises the following steps: acquiring a training data set, wherein the training data set comprises an image pair; inputting an image pair into an initial image fusion model, wherein the image fusion model is a deep neural network model comprising a shallow feature extraction network, a deep feature extraction network, a global feature fusion network and a feature reconstruction network; sequentially processing a shallow feature extraction network, a deep feature extraction network, a global feature fusion network and a feature reconstruction network to obtain a fusion image of the image pair; and calculating a difference value between the fused image and the image pair according to a preset loss function, updating the network parameters of the image fusion model according to the calculated difference value until the calculated difference value is smaller than a preset threshold value, and obtaining the trained image fusion model, wherein the loss function is used for calculating a structural loss value and a content loss value. The method and the device can improve the objective authenticity of the fusion image.

Description

Method and device for training image fusion model and electronic equipment

Technical Field

The application belongs to the technical field of image processing, and particularly relates to a method and a device for training an image fusion model and electronic equipment.

Background

With the rapid development of sensor imaging technology, the imaging of a single sensor is difficult to meet the daily application requirements, and the imaging of multiple sensors leads to technical innovation. The image fusion is to comprehensively process the image information detected by a plurality of sensors, so as to realize more comprehensive and reliable description of the detection scene.

The infrared and visible light are used as image types which are most widely applied in the field of image processing, the infrared image can efficiently capture scene heat radiation and identify a scene highlight target, the visible light image has high resolution and can present scene detail texture information, and the image information of the infrared image and the image information of the visible light image have efficient complementarity. Therefore, the infrared image and the visible light image are fused, a fused image with rich scene information content can be obtained, and the scene background and the target can be clearly and accurately described.

At present, for the fusion task of an infrared image and a visible light image, a multi-scale decomposition-based fusion algorithm, such as a wavelet transform fusion algorithm, is generally used to extract and decompose the features of the image, then classify the features, and simultaneously make different feature fusion strategies for different features and scene category features, so as to obtain multiple groups of fused features for inverse transformation, and transform the fused features back into the fused image.

However, the classical image fusion algorithm based on multi-scale decomposition usually uses fixed transformation and feature extraction hierarchy to perform feature decomposition and reconstruction on the image, which not only has great limitation on image feature extraction and decomposition, but also needs to adopt artificially formulated feature fusion rules in the fusion algorithm, so that artificial visual artifacts are easily introduced into the fusion result, the objective reality of image information is damaged, and the effect of fusing the image is affected.

Disclosure of Invention

The embodiment of the application aims to provide a method for training an image fusion model, and the method can solve the problems that in the prior art, an artificial visual artifact is easily introduced into a fusion result due to the fact that an artificially-made characteristic fusion rule is needed for fusing an infrared image and a visible light image, objective authenticity of image information is damaged, and the effect of fusing the images is influenced.

In order to solve the technical problem, the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides a method for training an image fusion model, where the method includes:

acquiring a training data set, wherein the training data set comprises an image pair, and the image pair comprises an infrared image and a visible light image in the same scene;

inputting the image pair into an initial image fusion model, wherein the image fusion model is a deep neural network model comprising a shallow feature extraction network, a deep feature extraction network, a global feature fusion network and a feature reconstruction network;

sequentially processing the shallow feature extraction network, the deep feature extraction network, the global feature fusion network and the feature reconstruction network to obtain a fusion image of the image pair;

calculating a difference value between the fused image and the image pair according to a preset loss function, and updating the network parameters of the image fusion model according to the calculated difference value until the calculated difference value is smaller than a preset threshold value, so as to obtain the trained image fusion model, wherein the preset loss function is used for calculating a structural loss value and a content loss value.

In a second aspect, an embodiment of the present application provides an apparatus for training an image fusion model, where the apparatus includes:

the data acquisition module is used for acquiring a training data set, wherein the training data set comprises an image pair, and the image pair comprises an infrared image and a visible light image in the same scene;

the data input module is used for inputting the image pair into an initial image fusion model, and the image fusion model is a deep neural network model comprising a shallow feature extraction network, a deep feature extraction network, a global feature fusion network and a feature reconstruction network;

the data processing module is used for sequentially processing the shallow feature extraction network, the deep feature extraction network, the global feature fusion network and the feature reconstruction network to obtain a fusion image of the image pair;

and the parameter adjusting module is used for calculating a difference value between the fused image and the image pair according to a preset loss function, updating the network parameters of the image fusion model according to the calculated difference value until the calculated difference value is smaller than a preset threshold value, and obtaining the trained image fusion model, wherein the preset loss function is used for calculating a structural loss value and a content loss value.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, and when executed by the processor, the program or instructions implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.

In the embodiment of the application, the fusion of the infrared image and the visible light image is realized based on the neural network model, the neural network model can decompose and extract richer image features with appropriate types by simulating the structure of the human eye neurons, and the accuracy of feature extraction can be improved.

In addition, the neural network is forced to fit the most appropriate image feature fusion rule and strategy by setting the preset loss function, so that the fusion performance is improved on the premise of not adding artificial fusion rules and strategies, and the image information is efficiently fused. The preset loss function can be used for calculating a structural loss value and a content loss value, and the fused image can embody detailed content information in the infrared image and the visible light image on the basis of retaining the structural information of the infrared image and the visible light image in the fused image as much as possible.

Therefore, the image fusion model obtained by training of the application is used for fusing the infrared image and the visible light image, so that the situation that artificial visual artifacts are introduced into the fused image can be avoided, the objective authenticity of the fused image is improved, structural features and content features of the fused image can be better reserved, and the effect of the fused image is further improved.

Moreover, the trained image fusion model is an end-to-end model, after the trained image fusion model is obtained, the image pair to be fused is input into the trained image fusion model, namely, the fused image can be output through the trained image fusion model, and convenience of image fusion operation can be improved.

Drawings

FIG. 1 is a flow chart of steps of an embodiment of a method of training an image fusion model of the present application;

FIG. 2 is a schematic illustration of an infrared image and a visible light image of a scene;

FIG. 3 is a schematic diagram of a network structure of the present application comprising at least two layers of agglomerated residual compact blocks;

FIG. 4 is a model structure diagram of an image fusion model of the present application;

FIG. 5 is a schematic flow chart of an image pair processing through network modules of an image fusion model according to the present application;

FIG. 6 is a schematic diagram of a pixel level penalty function calculation process according to the present application;

FIG. 7 is a schematic flow chart illustrating the calculation of a feature level loss function according to the present application;

FIG. 8 is a schematic structural diagram of an embodiment of an apparatus for training an image fusion model according to the present application;

FIG. 9 is a schematic structural diagram of an electronic device of the present application;

fig. 10 is a hardware structure diagram of an electronic device implementing an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The method for training the image fusion model provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings by specific embodiments and application scenarios thereof.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a method for training an image fusion model of the present application is shown, including the steps of:

101, acquiring a training data set, wherein the training data set comprises an image pair, and the image pair comprises an infrared image and a visible light image in the same scene;

102, inputting the image pair into an initial image fusion model, wherein the image fusion model is a deep neural network model comprising a shallow feature extraction network, a deep feature extraction network, a global feature fusion network and a feature reconstruction network;

103, sequentially processing the shallow feature extraction network, the deep feature extraction network, the global feature fusion network and the feature reconstruction network to obtain a fusion image of the image pair;

and 104, calculating a difference value between the fused image and the image pair according to a preset loss function, and updating the network parameters of the image fusion model according to the calculated difference value until the calculated difference value is smaller than a preset threshold value, so as to obtain the trained image fusion model, wherein the preset loss function is used for calculating a structural loss value and a content loss value.

The application provides a method for training an image fusion model, wherein the image fusion model is a deep neural network model, and the deep neural network model comprises a shallow feature extraction network, a deep feature extraction network, a global feature fusion network and a feature reconstruction network. The method and the device realize the fusion of the infrared image and the visible light image based on the neural network model, and the neural network model can decompose and extract richer and image features with appropriate types through simulating the structure of the human eye neurons, so that the accuracy of feature extraction can be improved.

The image fusion model can be obtained by carrying out unsupervised or supervised training on the existing neural network according to a large amount of training data and a machine learning method. It should be noted that, the embodiment of the present application does not limit the model structure and the training method of the image fusion model. The image fusion model may be a deep neural network model in which a plurality of neural networks are fused. The neural network includes, but is not limited to, at least one or a combination, superposition, nesting of at least two of the following: CNN (Convolutional Neural Network), LSTM (Long Short-Term Memory) Network, RNN (Simple Recurrent Neural Network), attention Neural Network, and the like.

In an embodiment of the present application, the training data set collected is composed of a large number of image pairs, each image pair containing an infrared image and a visible light image of the same scene. Referring to fig. 2, a schematic diagram of an infrared image and a visible light image of a scene is shown. As shown in fig. 2, a and b are an infrared image and a visible light image of a certain scene, respectively, and the a and b may form an image pair.

And sequentially inputting each image pair in the training data set into an initial image fusion model, and sequentially processing the image pairs through the shallow feature extraction network, the deep feature extraction network, the global feature fusion network and the feature reconstruction network to obtain a fusion image of the image pairs.

The shallow feature extraction network is used for extracting a basic feature map of the image shallow layer. The deep layer feature extraction network is used for further extracting multi-level depth feature information from the basic shallow layer feature map extracted by the shallow layer feature extraction network. The global feature fusion network is used for further integrating the extracted image feature information, so that the global feature maps extracted by the fusion network are integrated, and the information content of the network multi-scale feature maps is deepened. The feature reconstruction network is used for reconstructing the fused image features to obtain a fused image of the image pair.

And finally, calculating a difference value between the fused image and the image pair according to a preset loss function, and updating the network parameters of the image fusion model according to the calculated difference value until the calculated difference value is smaller than a preset threshold value, so as to obtain the trained image fusion model.

The structural loss value is used for representing the structural similarity of the two images, and the structural loss value is used for better keeping structural information in the infrared image and the visible light image in the fused image. The content loss value is used for representing the content similarity of the two images, and the content loss value is used for effectively calculating the pixel gray information between the image pairs, so that errors are reduced as much as possible, and the fused image can store the detailed content information in the infrared image and the visible light image.

The neural network is forced to fit the most appropriate image feature fusion rule and strategy by setting the preset loss function, so that the fusion performance is improved and the image information is efficiently fused on the premise of not adding artificial fusion rules and strategies. The preset loss function can be used for calculating a structural loss value and a content loss value, and the fused image can embody detailed content information in the infrared image and the visible light image on the basis of retaining the structural information of the infrared image and the visible light image in the fused image as much as possible.

In addition, the trained image fusion model is an end-to-end model, after the trained image fusion model is obtained, the image pair to be fused is input into the trained image fusion model, namely, the fused image can be output through the trained image fusion model, and convenience of image fusion operation can be improved.

In an optional embodiment of the present application, the step 103 of sequentially processing the shallow feature extraction network, the deep feature extraction network, the global feature fusion network, and the feature reconstruction network to obtain the fusion image of the image pair includes:

step S11, carrying out channel cascade processing on the image pair to obtain a cascade image;

step S12, inputting the cascade image into the shallow feature extraction network to extract a shallow feature map;

step S13, inputting the shallow feature map into the deep feature extraction network to extract depth feature information;

step S14, inputting the shallow feature map and the depth feature information into the global feature fusion network to integrate the shallow feature map and the depth feature information to obtain an integrated image;

and step S15, inputting the integrated image into the feature reconstruction network to perform feature reconstruction on the integrated image to obtain a fused image.

It should be noted that, before inputting the image pair into the initial image fusion model, the image pair may be precisely aligned, and then the aligned image pair is input into the initial image fusion model, so as to further improve the fusion effect.

After an image pair is input into an initial image fusion model, an input infrared image and a visible light image are connected through channel cascade operation to obtain a cascade image, and the cascade image can be understood as a channel cascade image container I of the infrared image and the visible light image_v,r. In the embodiment of the application, I is used for infrared image_rFor presentation, visible light images I_vIs represented by_rAnd I_vTwo channel cascade mathematical expression1-1 is as follows:

I_v,r＝Concat(I_v,I_r) (1-1)

wherein Concat is an image channel cascade function. Then cascading the images I_v,rInputting the image into a shallow feature extraction network (the shallow feature extraction network is recorded as BFEnet in the application), and extracting basic shallow features to obtain a cascade image I_v,rThe superficial layer characteristic spectrum of (1).

Further, the shallow feature extraction network may be a convolutional neural network. In one example, the shallow feature extraction network may comprise two convolutional layers, each convolutional layer having a convolutional kernel size of 3 × 3 pixels, thereby obtaining I from the cascade_v,rExtracting a basic shallow layer characteristic map of the image, wherein the operation of the convolution layer is represented by a formula 1-2:

wherein the content of the first and second substances,

and extracting a shallow layer characteristic map for the ith convolution layer in the BFEnet module.

It is understood that the shallow feature extraction network described above includes two convolutional layers, and the size of the convolutional kernel used in each convolutional layer is 3 × 3 pixels, which is only an application example of the embodiment of the present application. The number of convolutional layers included in the shallow feature extraction network and the size of a convolutional kernel used by each convolutional layer are not limited in the embodiments of the present application.

The shallow feature maps are then propagated into a deep feature extraction network (the deep feature extraction network is denoted as RXDBs in the present application) to further extract multi-level depth feature information in BFEnet. The module structure of the deep feature extraction network improves the infrared image and the visible light image in a targeted manner, and further improves the feature circulation inside the module, so that the shallow and deep features inside the module become the components of module output, and the whole network module has better image feature extraction capability.

Then, the extracted feature map is input into a global feature fusion network (the global feature fusion network is denoted as GFF in the present application) to perform global feature fusion, and the extracted image feature information is further integrated.

In an optional embodiment of the present application, the deep feature extraction network includes at least two layers of aggregation residue dense blocks for extracting depth feature information, and the inputting the shallow feature map and the depth feature information into the global feature fusion network to integrate the shallow feature map and the depth feature information to obtain an integrated image includes:

step S21, performing feature fusion on the depth feature information extracted from each layer of aggregation residual compact blocks to obtain global features;

and step S22, performing feature fusion on the global features and the shallow feature map extracted by the shallow feature extraction network to obtain an integrated image.

In the embodiment of the present application, the deep feature extraction network RXDBs includes at least two layers of aggregation residue dense blocks RXDB for extracting deep feature information, and the global feature fusion network GFF includes a dense feature fusion module (the dense feature fusion module is denoted as DFF in the present application) and a global residue learning module (the global residue learning module is denoted as GRL in the present application). In the embodiment of the application, the depth feature information extracted from each layer of the aggregation residual compact blocks is input into the dense feature fusion module DFF for feature fusion to obtain global features; and inputting the global features and the shallow feature map extracted by the shallow feature extraction network into the global residual learning module GRL for feature fusion to obtain an integrated image.

Referring to fig. 3, a schematic diagram of a network structure including at least two layers of agglomerated residual dense blocks of the present application is shown, the structure calculation process being represented by equations 1-3:

wherein the content of the first and second substances,

and extracting image features for the ith aggregation residue dense block. A dense feature fusion module (DFF) derives global features by integrating the features of all the aggregated residual dense blocks. Further, in order to avoid model overfitting and improve calculation efficiency, the feature map dimension reduction is performed inside the DFF module, and specific calculation is shown in formulas 1-4:

wherein the content of the first and second substances,

representing a feature map extracted from the 6-layer aggregated residual dense block,

and obtaining image characteristics of the DFF in the GFF module. The global residual learning module GRL makes full use of all the previous features, wherein the basic shallow feature map obtained by the BFENet first convolution layer is included, so that the global feature map extracted by the fusion network is integrated, the information content of the network multi-scale feature map is deepened, and the integrated image is obtained. The global residual learning module GRL is calculated as shown in equations 1-5:

wherein the content of the first and second substances,

image features obtained for the GRL in the GFF module.

It should be noted that the deep feature extraction network includes 6 layers of aggregated residual dense blocks, which is only an application example of the embodiment of the present application, and the embodiment of the present application does not limit the number of the aggregated residual dense blocks included in the deep feature extraction network.

And finally, inputting the integrated image into a feature reconstruction network (the feature reconstruction network is denoted as Rbnet in the application) so as to perform feature reconstruction on the integrated image to obtain a fused image.

The feature reconstruction network may be a convolutional neural network. In one example, the feature reconstruction network contains 3 convolutional layers, the first two convolutional layers having a convolution kernel size of 3 × 3 pixels, and the last convolutional layer using a convolution kernel size of 1 × 1 pixel, in order to reconstruct local features of the fused image. The calculation formula of the feature reconstruction network is shown as 1-6:

wherein, I_fRepresenting the fused image.

It is understood that the above feature reconstruction network includes 3 convolution layers, and the sizes of the convolution kernels of the first two layers are 3 × 3 pixels, and the size of the convolution kernel of the last layer is 1 × 1 pixel, which is only an application example of the embodiment of the present application. The number of convolutional layers included in the feature reconstruction network and the size of a convolutional kernel used by each convolutional layer are not limited in the embodiments of the present application.

Referring to fig. 4, a model structure diagram of an image fusion model of the present application is shown. And referring to fig. 5, a schematic flow chart of image pair processing through each network module of the image fusion model according to the present application is shown.

The image fusion model sequentially processes image pairs through the shallow feature extraction network, the deep feature extraction network, the global feature fusion network and the feature reconstruction network to obtain the fusion image of the image pairs, and in the process, the middle layer of the image fusion model can multiplex the image features to increase the reusability and the liquidity of the features, so that the size and the computational complexity of the network model can be further reduced while the efficient fusion of the image features is promoted.

In an optional embodiment of the present application, the preset loss function includes a structural loss function and a content loss function, and the calculating a difference value between the fused image and the image pair according to the preset loss function includes:

step S31, performing weighted average calculation on a first structural loss and a second structural loss through the structural loss function to obtain a structural loss value of the fused image, where the first structural loss is a structural loss between the infrared image and the fused image, and the second structural loss is a structural loss between the visible light image and the fused image;

step S32, performing weighted average calculation on a first content loss and a second content loss through the content loss function to obtain a content loss value of the fused image, where the first content loss is a content loss between the infrared image and the fused image, and the second content loss is a content loss between the visible light image and the fused image;

and step S33, carrying out weighted calculation on the structure loss value and the content loss value of the fused image to obtain a difference value between the fused image and the image pair.

It should be noted that, in the embodiment of the present application, the order of step S31 and step S32 is not limited.

In the embodiment of the present application, the preset loss function is used for calculating a structural loss value and a content loss value. The structural loss value is used to better retain structural information in the infrared image and the visible image in the fused image. The content loss value is used to store detailed content information in the infrared image and the visible light image in the fused image.

An efficient network loss function can quickly reduce the gap between the neural network results and the expected real targets. Aiming at the characteristics of the fusion task of the infrared image and the visible light image, the preset loss function is designed as shown in formulas 1-7:

L＝αL_ssim+βL_content (1-7)

wherein, α andbeta is a trainable parameter, L_ssimIs a structure loss function for calculating a structure loss value, L_contentIs a content loss function for calculating a content loss value. L is_ssimAnd L_contentAs shown in equations 1-8 and 1-9, respectively:

L_ssim＝1-SSIM(A,B) (1-8)

L_content＝||A-B||₂ (1-9)

SSIM (a, B) represents Structural Similarity (SSIM) between image a and image B. I | · | purple wind₂Is represented by₂And performing norm operation for calculating the Euclidean distance between the image A and the image B.

In practical applications, the brightness information of the object is related to the illumination and the reflection coefficient, and the structure and the illumination of the object in the scene are independent from each other, and the reflection coefficient is related to the object itself. Therefore, the SSIM definition can reflect the structural information in one image by separating the influence of illumination on an object, wherein the brightness contrast is usually the average gray scale of two images as an estimation index, wherein the average gray scale and standard deviation of the images are shown in formulas 1-10 and 1-11 respectively.

In the above formula x_iRepresenting the grey value of the corresponding position of the image, N representing the number of pixels in the image, mu_xFor calculating the mean gray value of the resulting image, σ_xThe resulting standard deviation of the image was calculated.

Therefore, the background information of abundant texture details is kept, the scene highlight target in the infrared image is smoothly introduced, and a smoother and natural fusion image can be generated.

According to the embodiment of the application, the structure loss between the infrared image and the fused image and the structure loss between the visible light image and the fused image are subjected to weighted average calculation through the structure loss function, and the structure loss value of the fused image is obtained.

And carrying out weighted average calculation on the content loss between the infrared image and the fused image and the content loss between the visible light image and the fused image through a content loss function to obtain a content loss value of the fused image.

And performing weighted calculation on the structural loss value and the content loss value of the fused image to obtain a difference value between the fused image and the image pair, and performing iterative updating on network parameters of the image fusion model according to the difference value until the calculated difference value is smaller than a preset threshold value to obtain the trained image fusion model.

Further, in the embodiment of the present application, two different calculation strategies are designed for a preset loss function, which are a pixel-level loss function and a feature-level loss function respectively. The following description will be made separately.

In an optional embodiment of the present application, the calculating a weighted average of the first structural loss and the second structural loss by the structural loss function to obtain a structural loss value of the fused image includes:

carrying out weighted average calculation on the first structural loss and the second structural loss of the pixel level through the structural loss function to obtain a structural loss value of the pixel level;

the obtaining a content loss value of the fused image by performing weighted average calculation on the first content loss and the second content loss through the content loss function includes:

performing weighted average calculation on the first content loss and the second content loss of the pixel level through the content loss function to obtain a content loss value of the pixel level;

the weighting calculation of the structure loss value and the content loss value of the fused image to obtain the difference value between the fused image and the image pair includes:

and performing weighted calculation on the structure loss value of the pixel level and the content loss value of the pixel level to obtain a pixel-level difference value between the fusion image and the image pair.

Pixel level loss function L_Pixel-wiseIs directly applied to the L of the input image (infrared image and visible light image) and the output image (fused image)_ssimAnd L_contentWeighted combination of (3). The calculation formula is shown in 1-12:

L_Pixel-wise＝αL_ssim+βL_content (1-12)

1 to 12 of L_ssimFor calculating the structural similarity of the input image and the output image at the pixel level, i.e. for calculating the structural loss value of the fused image at the pixel level, which is determined by the first structural loss L at the pixel level_ssim-rAnd second structure loss L of pixel level_ssim-vWeighted average is obtained. Wherein the first structure loss L of the pixel level_ssim-rThe second structure loss L at the pixel level is calculated as shown in equations 1-13_ssim-vThe calculation is shown in equations 1-14, and the structural loss at the pixel level of the fused image is shown in equations 1-15:

L_ssim-r＝1-SSIM(I_r,I_f) (1-13)

L_ssim-v＝1-SSIM(I_v,I_f) (1-14)

L_ssim＝λL_ssim-r+(1-λ)L_ssim-v (1-15)

similarly, L in 1-12_contentFor calculating the content similarity of the input image and the output image at the pixel level, i.e. for calculating the content loss value at the pixel level of the fused image, which is determined by the first content loss L at the pixel level_content-rAnd a second loss of content L at the pixel level_content-vWeighted average is obtained. Wherein a first content loss L at pixel level_content-rThe second content loss L at the pixel level is calculated as shown in equations 1-16_content-vThe calculations are shown in equations 1-17, and the content loss at the pixel level for the fused image is shown in equations 1-18:

L_content-r＝||I_f-I_r||₂ (1-16)

L_content-v＝||I_f-I_v||₂ (1-17)

L_content＝δL_content-r+(1-δ)L_content-v (1-18)

referring to fig. 6, a schematic diagram of a calculation flow of a pixel level penalty function of the present application is shown.

The pixel-level calculation strategy provides a relatively rough estimation of the network loss function, and in order to obtain more detailed loss information, the feature-level loss function calculation strategy is preferably adopted in network training.

In an optional embodiment of the present application, before calculating the difference value between the fused image and the image pair according to the preset loss function, the method further includes:

step S41, extracting depth feature maps from the infrared image, the visible light image and the fusion image respectively by using a preset neural network;

step S42, calculating a first structural loss of a characteristic level based on the depth feature map of the infrared image and the depth feature map of the fused image, and calculating a second structural loss of the characteristic level based on the depth feature map of the visible light image and the depth feature map of the fused image;

step S43, calculating a first content loss of a characteristic level based on the depth feature map of the infrared image and the depth feature map of the fused image, and calculating a second content loss of the characteristic level based on the depth feature map of the visible light image and the depth feature map of the fused image;

the obtaining a structural loss value of the fused image by performing weighted average calculation on the first structural loss and the second structural loss through the structural loss function includes:

carrying out weighted average calculation on the first structural loss of the characteristic level and the second structural loss of the characteristic level through the structural loss function to obtain a structural loss value of the characteristic level;

carrying out weighted average calculation on the first content loss of the characteristic level and the second content loss of the characteristic level through the content loss function to obtain a content loss value of the characteristic level;

and performing weighted calculation on the structure loss value of the characteristic level and the content loss value of the characteristic level to obtain a difference value of the characteristic level between the fusion image and the image pair.

The feature level calculation strategy utilizes a preset neural network to perform feature extraction on the image pair needing loss calculation to obtain two corresponding sets of depth feature maps, and the loss function is evaluated from a brand new angle, namely, the loss function calculation is performed in the calculated feature maps.

The preset neural network is used for extracting the depth feature map of the image, and the preset network can be a pre-trained neural network. In one example, the predetermined network may be a VGG19 network. The VGG network is proposed by Oxford's Visual Geometry Group, and has two structures, namely VGG16 and VGG19, which are different in depth. It can be understood that the embodiment of the present application does not limit the kind of the preset neural network.

Characteristic level loss function L_Feature-wiseThe calculation of (c) is shown in equations 1-19:

L_Feature-wise＝αL_ssim+βL_content (1-19)

1 to 19 of L_ssimFor calculating structural similarity of feature levels of input and output images, i.e. for calculating structural loss values of feature levels of fused images, the structural loss values being calculated fromFirst structure loss L of characteristic level_ssim-rAnd second structure loss L of characteristic level_ssim-vWeighted average is obtained. Wherein the first structure loss L of the characteristic level_ssim-rCalculating the second structure loss L of the feature level as shown in equations 1-20_ssim-vThe calculation is shown in equations 1-21, and the structural loss of the feature level of the fused image is shown in equations 1-22:

L_ssim＝λL_ssim-r+(1-λ)L_ssim-v (1-22)

wherein, the feature map of the 'ReLU3_2' layer of the VGG19 network is identified as M ∈ {1, 2.

Respectively representing the depth feature maps extracted by the VGG19 network from the infrared image, the visible light image and the fused image.

Similarly, L in 1-19_contentFor calculating the content similarity of the characteristic levels of the input image and the output image, i.e. for calculating the content loss value of the characteristic level of the fused image, which is determined by the first content loss L of the characteristic level_content-rAnd a second content loss L of the characteristic level_content-vWeighted average is obtained. Wherein the first content loss L of the characteristic level_content-rThe second content loss L of the feature level is calculated as shown in equations 1-23_content-vThe calculation is shown in equations 1-24, and the content loss at the feature level of the fused image is shown in equations 1-25:

L_content＝δL_content-r+(1-δ)L_content-v (1-25)

referring to fig. 7, a schematic diagram of a calculation flow of a feature level loss function of the present application is shown.

In one application example of the application, the convolutional layer parameter settings of the network framework related modules in the network training are shown in table 1, and the internal convolutional layer parameter settings of the gather residue dense block RXDB are shown in table 2. In order to reduce the calculation cost, the embodiment of the present application may perform normalization processing on the pixel values of each training image block pair, set the learning rate to be a constant piecewise attenuation, and set the network training learning rate to be a corresponding value of [0.01,0.007,0.005,0.0025,0.001,0.0001,0.00005] when the number of network training iterations reaches [5000,10000,14000,18000,20000,24000 ].

TABLE 1

TABLE 2

It is to be understood that the above parameter setting is only an application example of the present application, and the present application does not limit the parameter setting of the image fusion model in practical application.

In an optional embodiment of the present application, after obtaining the trained image fusion model, the method further includes:

step S51, inputting an image pair to be fused into the trained image fusion model, wherein the image pair to be fused comprises an infrared image and a visible light image in the same scene;

and step S52, outputting a fused image through the trained image fusion model.

After the training of the image fusion model is completed, the image fusion processing can be performed by using the trained image fusion model. In the embodiment of the application, the image fusion model is an end-to-end model, and the infrared image and the visible light image in the same scene to be fused are input into the trained image fusion model as an image pair, i.e. the fusion model can be output.

Further, before the image pair to be fused is input into the trained image fusion model, the image pair can be accurately aligned, and then the aligned image pair is input into the trained image fusion model, so that the fusion effect is improved.

In summary, the present application provides a method for training an image fusion model, where the trained image fusion model can effectively implement fusion of an infrared image and a visible light image, and on the premise of not introducing artificial fusion rules, the image feature analysis and extraction effects are effectively improved, and the richness of the fusion features is further improved, so that the fusion image has better imaging performance, and a scene highlight target in the infrared image is smoothly introduced while background information of rich texture details is retained, thereby generating a smooth and natural fusion image. Meanwhile, artificial visual artifacts can be avoided, objective authenticity of the fused image is provided, and the effect of the fused image is improved.

It should be noted that, in the method for training an image fusion model provided in the embodiment of the present application, the execution subject may be an apparatus for training an image fusion model, or a control module in the apparatus for training an image fusion model, which is used for executing the method for training an image fusion model. In the embodiment of the present application, a method for executing a training image fusion model by using an apparatus for training an image fusion model is taken as an example, and the apparatus for training an image fusion model provided in the embodiment of the present application is described.

Referring to fig. 8, a schematic structural diagram of an embodiment of an apparatus for training an image fusion model according to the present application is shown, the apparatus including:

a data obtaining module 801, configured to obtain a training data set, where the training data set includes an image pair, and the image pair includes an infrared image and a visible light image in the same scene;

a data input module 802, configured to input the image pair into an initial image fusion model, where the image fusion model is a deep neural network model including a shallow feature extraction network, a deep feature extraction network, a global feature fusion network, and a feature reconstruction network;

a data processing module 803, configured to obtain a fused image of the image pair through sequential processing of the shallow feature extraction network, the deep feature extraction network, the global feature fusion network, and the feature reconstruction network;

a parameter adjusting module 804, configured to calculate a difference value between the fused image and the image pair according to a preset loss function, and update a network parameter of the image fusion model according to the calculated difference value until the calculated difference value is smaller than a preset threshold, so as to obtain a trained image fusion model, where the preset loss function is used to calculate a structural loss value and a content loss value.

Optionally, the data processing module 803 includes:

the cascade processing submodule is used for carrying out channel cascade processing on the image pair to obtain a cascade image;

a shallow extraction submodule, configured to input the cascade image into the shallow feature extraction network to extract a shallow feature map;

a deep extraction sub-module for inputting the shallow feature map into the deep feature extraction network to extract deep feature information;

the feature fusion submodule is used for inputting the shallow feature map and the depth feature information into the global feature fusion network to integrate the shallow feature map and the depth feature information to obtain an integrated image;

and the characteristic reconstruction submodule is used for inputting the integrated image into the characteristic reconstruction network so as to carry out characteristic reconstruction on the integrated image and obtain a fused image.

Optionally, the preset loss function includes a structure loss function and a content loss function, and the parameter adjusting module includes:

a structural loss value calculation operator module, configured to perform weighted average calculation on a first structural loss and a second structural loss through the structural loss function to obtain a structural loss value of the fused image, where the first structural loss is a structural loss between the infrared image and the fused image, and the second structural loss is a structural loss between the visible light image and the fused image;

a content loss value calculation operator module, configured to perform weighted average calculation on a first content loss and a second content loss through the content loss function to obtain a content loss value of the fused image, where the first content loss is a content loss between the infrared image and the fused image, and the second content loss is a content loss between the visible light image and the fused image;

and the difference value calculation submodule is used for performing weighted calculation on the structure loss value and the content loss value of the fused image to obtain a difference value between the fused image and the image pair.

Optionally, the structural loss value operator module is specifically configured to perform weighted average calculation on the first structural loss and the second structural loss at the pixel level through the structural loss function to obtain a structural loss value at the pixel level;

the content loss value calculation operator module is specifically configured to perform weighted average calculation on the first content loss and the second content loss at the pixel level through the content loss function to obtain a content loss value at the pixel level;

the difference value calculating submodule is specifically configured to perform weighted calculation on the structure loss value at the pixel level and the content loss value at the pixel level to obtain a difference value at the pixel level between the fused image and the image pair.

Optionally, the apparatus further comprises:

the feature extraction module is used for respectively extracting a depth feature map from the infrared image, the visible light image and the fusion image by utilizing a preset neural network;

a first calculation module, configured to calculate a first structural loss at a feature level based on the depth feature map of the infrared image and the depth feature map of the fused image, and calculate a second structural loss at a feature level based on the depth feature map of the visible light image and the depth feature map of the fused image;

a second calculation module, configured to calculate a first content loss at a feature level based on the depth feature map of the infrared image and the depth feature map of the fused image, and calculate a second content loss at a feature level based on the depth feature map of the visible light image and the depth feature map of the fused image;

the structural loss value calculation operator module is specifically configured to perform weighted average calculation on the first structural loss of the feature level and the second structural loss of the feature level through the structural loss function to obtain a structural loss value of the feature level;

the content loss value operator module is specifically configured to perform weighted average calculation on the first content loss of the feature level and the second content loss of the feature level through the content loss function to obtain a content loss value of the feature level;

the difference value calculating submodule is specifically configured to perform weighted calculation on the structure loss value of the feature level and the content loss value of the feature level to obtain a difference value of the feature level between the fusion image and the image pair.

Optionally, the deep feature extraction network includes at least two layers of aggregated residual dense blocks for extracting depth feature information, and the feature fusion sub-module includes:

the first fusion unit is used for performing feature fusion on the depth feature information extracted from each layer of aggregation residual compact blocks to obtain global features;

and the second fusion unit is used for carrying out feature fusion on the global features and the shallow feature spectrum extracted by the shallow feature extraction network to obtain an integrated image.

Optionally, the apparatus further comprises:

the image input module is used for inputting an image pair to be fused into the trained image fusion model, wherein the image pair to be fused comprises an infrared image and a visible light image in the same scene;

and the image fusion module is used for outputting a fusion image through the trained image fusion model.

The application provides a device of training image fusion model, the image fusion model of accomplishing through the device training can effectively realize the fusion to infrared image and visible light image, under the prerequisite of artificial integration rule not introduced, effectively promote image characteristic analysis and extraction effect, further promote the richness of fusing the characteristic, make the fusion image have better imaging performance, when keeping the background information of abundant texture detail, introduce the scene hi-lite target in the infrared image smoothly, generate smooth natural fusion image. Meanwhile, artificial visual artifacts can be avoided, objective authenticity of the fused image is provided, and the effect of the fused image is improved.

The device for training the image fusion model in the embodiment of the present application may be a device, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.

The device for training the image fusion model in the embodiment of the present application may be a device having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.

The device for training the image fusion model provided by the embodiment of the application can realize each process realized by the method embodiment of fig. 1, and is not repeated here to avoid repetition.

Optionally, as shown in fig. 9, an electronic device 900 is further provided in this embodiment of the present application, and includes a processor 901, a memory 902, and a program or an instruction stored in the memory 902 and executable on the processor 901, where the program or the instruction is executed by the processor 901 to implement each process of the above-mentioned method for training an image fusion model, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

It should be noted that the electronic device in the embodiment of the present application includes the mobile electronic device and the non-mobile electronic device described above.

Fig. 10 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 1000 includes, but is not limited to: a radio frequency unit 1001, a network module 1002, an audio output unit 1003, an input unit 1004, a sensor 1005, a display unit 1006, a user input unit 1007, an interface unit 1008, a memory 1009, and a processor 1010.

Those skilled in the art will appreciate that the electronic device 1000 may further comprise a power source (e.g., a battery) for supplying power to various components, and the power source may be logically connected to the processor 1010 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The electronic device structure shown in fig. 10 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is not repeated here.

The processor 1010 is configured to acquire a training data set, where the training data set includes an image pair, and the image pair includes an infrared image and a visible light image in the same scene; inputting the image pair into an initial image fusion model, wherein the image fusion model is a deep neural network model comprising a shallow feature extraction network, a deep feature extraction network, a global feature fusion network and a feature reconstruction network; sequentially processing the shallow feature extraction network, the deep feature extraction network, the global feature fusion network and the feature reconstruction network to obtain a fusion image of the image pair; calculating a difference value between the fused image and the image pair according to a preset loss function, and updating the network parameters of the image fusion model according to the calculated difference value until the calculated difference value is smaller than a preset threshold value, so as to obtain the trained image fusion model, wherein the preset loss function is used for calculating a structural loss value and a content loss value.

The method and the device realize the fusion of the infrared image and the visible light image based on the neural network model, and the neural network model can decompose and extract richer and image features with appropriate types through simulating the structure of the human eye neurons, so that the accuracy of feature extraction can be improved.

Optionally, the processor 110 is further configured to perform channel cascade processing on the image pair to obtain a cascade image; inputting the cascade image into the shallow feature extraction network to extract a shallow feature map; inputting the shallow feature map into the deep feature extraction network to extract depth feature information; inputting the shallow feature map and the depth feature information into the global feature fusion network to integrate the shallow feature map and the depth feature information to obtain an integrated image; and inputting the integrated image into the feature reconstruction network to perform feature reconstruction on the integrated image to obtain a fused image.

Optionally, the preset loss function includes a structural loss function and a content loss function, the difference value between the fused image and the image pair is calculated according to the preset loss function, and the processor 110 is further configured to perform weighted average calculation on a first structural loss and a second structural loss through the structural loss function to obtain a structural loss value of the fused image, where the first structural loss is a structural loss between the infrared image and the fused image, and the second structural loss is a structural loss between the visible light image and the fused image; performing weighted average calculation on a first content loss and a second content loss through the content loss function to obtain a content loss value of the fused image, wherein the first content loss is the content loss between the infrared image and the fused image, and the second content loss is the content loss between the visible light image and the fused image; and carrying out weighted calculation on the structure loss value and the content loss value of the fused image to obtain a difference value between the fused image and the image pair.

Optionally, the processor 110 is further configured to perform a weighted average calculation on the first structural loss and the second structural loss at the pixel level through the structural loss function to obtain a structural loss value at the pixel level; performing weighted average calculation on the first content loss and the second content loss of the pixel level through the content loss function to obtain a content loss value of the pixel level; and performing weighted calculation on the structure loss value of the pixel level and the content loss value of the pixel level to obtain a pixel-level difference value between the fusion image and the image pair.

The embodiment of the application designs two different calculation strategies aiming at the preset loss function, namely a pixel-level loss function and a characteristic-level loss function, and is suitable for different requirements.

Optionally, the processor 110 is further configured to extract a depth feature map from the infrared image, the visible light image, and the fusion image by using a preset neural network; calculating a first structural loss at a feature level based on the depth feature map of the infrared image and the depth feature map of the fused image, and calculating a second structural loss at a feature level based on the depth feature map of the visible light image and the depth feature map of the fused image; calculating a first content loss at a feature level based on the depth feature map of the infrared image and the depth feature map of the fused image, and calculating a second content loss at a feature level based on the depth feature map of the visible light image and the depth feature map of the fused image; carrying out weighted average calculation on the first structural loss of the characteristic level and the second structural loss of the characteristic level through the structural loss function to obtain a structural loss value of the characteristic level; carrying out weighted average calculation on the first content loss of the characteristic level and the second content loss of the characteristic level through the content loss function to obtain a content loss value of the characteristic level; and performing weighted calculation on the structure loss value of the characteristic level and the content loss value of the characteristic level to obtain a difference value of the characteristic level between the fusion image and the image pair.

The pixel-level calculation strategy provides a relatively rough estimation of the network loss function, and in order to obtain more detailed loss information, the characteristic-level loss function calculation strategy is preferably adopted in network training so as to improve the precision of the fused image.

Optionally, the deep feature extraction network includes at least two layers of aggregation residue dense blocks for extracting depth feature information, and the processor 110 is further configured to perform feature fusion on the depth feature information extracted by each layer of aggregation residue dense block to obtain a global feature; and carrying out feature fusion on the global features and the shallow feature spectrum extracted by the shallow feature extraction network to obtain an integrated image.

Optionally, the processor 110 is further configured to input an image pair to be fused into the trained image fusion model, where the image pair to be fused includes an infrared image and a visible light image in the same scene; and outputting a fused image through the trained image fusion model.

The trained image fusion model is an end-to-end model, after the trained image fusion model is obtained, the image pair to be fused is input into the trained image fusion model, namely, the fused image can be output through the trained image fusion model, and the convenience of image fusion operation can be improved.

It should be understood that in the embodiment of the present application, the input Unit 1004 may include a Graphics Processing Unit (GPU) 1041 and a microphone 1042, and the Graphics Processing Unit 1041 processes image data of a still picture or a video obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The display unit 1006 may include a display panel 1061, and the display panel 1061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1007 includes a touch panel 1071 and other input devices 1072. The touch panel 1071 is also referred to as a touch screen. The touch panel 1071 may include two parts of a touch detection device and a touch controller. Other input devices 1072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein. The memory 1009 may be used to store software programs as well as various data, including but not limited to application programs and operating systems. Processor 1010 may integrate an application processor that handles primarily operating systems, user interfaces, applications, etc. and a modem processor that handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 1010.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the above method for training an image fusion model, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the method embodiment for training an image fusion model, and can achieve the same technical effect, and in order to avoid repetition, the details are not repeated here.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of training an image fusion model, the method comprising:

2. The method of claim 1, wherein the obtaining of the fused image of the image pair through sequential processing of the shallow feature extraction network, the deep feature extraction network, the global feature fusion network, and the feature reconstruction network comprises:

carrying out channel cascade processing on the image pair to obtain a cascade image;

inputting the cascade image into the shallow feature extraction network to extract a shallow feature map;

inputting the shallow feature map into the deep feature extraction network to extract depth feature information;

inputting the shallow feature map and the depth feature information into the global feature fusion network to integrate the shallow feature map and the depth feature information to obtain an integrated image;

and inputting the integrated image into the feature reconstruction network to perform feature reconstruction on the integrated image to obtain a fused image.

3. The method of claim 1, wherein the predetermined loss function comprises a structural loss function and a content loss function, and the calculating the difference value between the fused image and the image pair according to the predetermined loss function comprises:

performing weighted average calculation on a first structural loss and a second structural loss through the structural loss function to obtain a structural loss value of the fused image, wherein the first structural loss is structural loss between the infrared image and the fused image, and the second structural loss is structural loss between the visible light image and the fused image;

performing weighted average calculation on a first content loss and a second content loss through the content loss function to obtain a content loss value of the fused image, wherein the first content loss is the content loss between the infrared image and the fused image, and the second content loss is the content loss between the visible light image and the fused image;

and carrying out weighted calculation on the structure loss value and the content loss value of the fused image to obtain a difference value between the fused image and the image pair.

4. The method of claim 3, wherein the calculating a weighted average of the first structural loss and the second structural loss by the structural loss function to obtain a structural loss value of the fused image comprises:

5. The method of claim 3, wherein before calculating the disparity value between the fused image and the image pair according to a preset loss function, further comprising:

extracting a depth feature map from the infrared image, the visible light image and the fusion image respectively by using a preset neural network;

calculating a first structural loss at a feature level based on the depth feature map of the infrared image and the depth feature map of the fused image, and calculating a second structural loss at a feature level based on the depth feature map of the visible light image and the depth feature map of the fused image;

calculating a first content loss at a feature level based on the depth feature map of the infrared image and the depth feature map of the fused image, and calculating a second content loss at a feature level based on the depth feature map of the visible light image and the depth feature map of the fused image;

6. The method of claim 2, wherein the deep feature extraction network comprises at least two layers of aggregated residual dense blocks for extracting depth feature information, and the inputting the shallow feature map and the depth feature information into the global feature fusion network for integrating the shallow feature map and the depth feature information to obtain an integrated image comprises:

performing feature fusion on the depth feature information extracted from each layer of aggregation residual compact block to obtain global features;

and carrying out feature fusion on the global features and the shallow feature spectrum extracted by the shallow feature extraction network to obtain an integrated image.

7. The method of claim 1, wherein after obtaining the trained image fusion model, further comprising:

inputting an image pair to be fused into the trained image fusion model, wherein the image pair to be fused comprises an infrared image and a visible light image in the same scene;

and outputting a fused image through the trained image fusion model.

8. An apparatus for training an image fusion model, the apparatus comprising:

9. The apparatus of claim 8, wherein the data processing module comprises:

10. The apparatus of claim 8, wherein the predetermined loss function comprises a structure loss function and a content loss function, and the parameter adjustment module comprises:

11. The apparatus according to claim 10, wherein the structural loss value operator module is specifically configured to perform a weighted average calculation on the first structural loss and the second structural loss at the pixel level through the structural loss function to obtain a structural loss value at the pixel level;

12. The apparatus of claim 11, further comprising:

13. The apparatus of claim 9, wherein the deep feature extraction network comprises at least two layers of aggregated residual dense blocks for extracting depth feature information, and wherein the feature fusion sub-module comprises:

14. The apparatus of claim 8, further comprising:

15. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions, when executed by the processor, implementing the steps of the method of training an image fusion model according to any one of claims 1 to 7.