CN117808691A

CN117808691A - Image fusion method based on difference significance aggregation and joint gradient constraint

Info

Publication number: CN117808691A
Application number: CN202311705681.7A
Authority: CN
Inventors: 李璇; 王杰; 陈荣富; 冯昭明; 张国敏; 丁一凡; 程莉
Original assignee: Wuhan Institute of Technology
Current assignee: Wuhan Institute of Technology
Priority date: 2023-12-12
Filing date: 2023-12-12
Publication date: 2024-04-02

Abstract

The invention discloses an image fusion method based on difference significance aggregation and joint gradient constraint, which comprises the following steps: inputting two types of source images of infrared and visible light into a fusion network, and integrating through a regional aggregation strategy to obtain a difference joint significance map; the generated difference joint saliency map and the two types of source images are input into a feature fusion sub-network together, and features are reconstructed through convolution to obtain a primary fusion image; constructing a joint gradient map containing source image complementation texture information through a two-channel gradient aggregation module in a fusion network; in a generator of the fusion network, respectively calculating content loss between the primary fusion image and the infrared and visible light source images; in a discriminator of the fusion network, calculating a contrast loss between the combined gradient map and the gradient map of the primary fusion image; the content loss and the contrast loss are used together to train the fusion network, generating the final fusion image. The invention realizes the fusion of infrared and visible light images.

Description

Image fusion method based on difference significance aggregation and joint gradient constraint

Technical Field

The invention belongs to the field of image processing, and particularly relates to an image fusion method based on difference saliency aggregation and joint gradient constraint.

Background

The image fusion technology synthesizes the data acquired by a plurality of sensors or shooting conditions in the same scene to form an image with information of different descriptions on the same scene, so that the computer can conveniently further recognize and process the image. In the task of fusing infrared and visible light images, problems such as distortion and blurring are inevitably caused in the final obtained image due to the influence of hardware conditions of imaging equipment, interference in imaging and transmission processes, natural environment and other factors. The image fusion technology can well solve the defect of insufficient single image information, improves the richness of image content, enables the expression of pictures to be more abundant, reduces the redundancy of information, and achieves better visual effect.

The purpose of image fusion is to obtain a higher visual effect image. However, due to technical conditions and environmental factors, background high-brightness information such as smoke, strong light, and night strong light like fog mist is generally retained in the visible light image. Once there is a situation in the visible image where the object is blocked by the background highlighting information, the texture information and the integrity of the infrared intensity features of the foreground object of the fused image are inevitably affected. Meanwhile, it is difficult to preserve meaningful background highlighting information while displaying the complete foreground object in the fused image. If the fusion process cannot retain meaningful image content, redundant information in the fused image can impair the expression of effective information.

Disclosure of Invention

The invention aims to provide an image fusion method based on difference significance aggregation and joint gradient constraint, which realizes infrared and visible light image fusion.

In order to solve the technical problems, the technical scheme of the invention is as follows: an image fusion method based on difference saliency aggregation and joint gradient constraint comprises the following steps:

s1, inputting two types of source images of infrared and visible light into a fusion network; the fusion network comprises a generator, a two-channel gradient aggregation module and a discriminator, wherein the generator comprises a saliency difference perception aggregation sub-network and a feature fusion sub-network; the method comprises the steps that two types of source images of infrared and visible light are firstly input into a saliency difference perception aggregation sub-network, the sub-network comprises a multi-scale stereoscopic attention module and a region aggregation strategy, attention information of saliency regions of the two types of source images of the infrared and the visible light is respectively extracted through the multi-scale stereoscopic attention module, and a difference joint saliency map is obtained through integration of the region aggregation strategy;

s2, inputting the generated difference joint saliency map and the two types of source images into a feature fusion sub-network; the feature fusion sub-network comprises gradient residual modules which are added with a chain structure in jump connection, the gradient residual modules are sequentially connected to extract features of a difference joint saliency map and two types of infrared and visible light source images, and then the features are reconstructed through convolution to obtain a primary fusion image;

s3, constructing a joint gradient map containing source image complementarity texture information through a two-channel gradient aggregation module in the fusion network; inputting the combined gradient map and the gradient map of the primary fusion image into a discriminator of a fusion network, and enhancing texture details of the primary fusion image;

s4, in a generator of the fusion network, respectively calculating content loss between the primary fusion image and the infrared and visible light source images; in a discriminator of the fusion network, calculating a contrast loss between the combined gradient map and the gradient map of the primary fusion image; the content loss and the antagonism loss are used together for training a fusion network, so that the effect of image fusion is optimized; when the training round number reaches the preset number, the network training is completed, and a final fusion image is generated.

The S1 specifically comprises the following steps:

s11, respectively inputting two types of source images, namely an infrared image and a visible light image, into a saliency difference perception aggregation sub-network; sensing the space and channel attention information of the saliency areas of the two types of source images under different scales through a multi-scale stereoscopic attention module to obtain a saliency feature image of the corresponding source image;

s12, integrating the saliency feature images through a regional aggregation strategy to obtain a difference combined saliency image I _mask 。

The step S2 is specifically as follows:

s21, combining the differences with a saliency map I in a feature fusion sub-network _mask Multiplying the images with infrared and visible light element by element respectively to obtain a saliency target area diagram I _t And salient background region map I _d The method comprises the steps of carrying out a first treatment on the surface of the Pair I _t And infrared image, I _d And extracting shallow features of the visible light image on two parallel feature extraction branches by using a 3x3 convolution layer respectively; in each parallel feature extraction branch, the extracted shallow features are respectively connected in series in the channel dimension, and the deep features are further extracted by using a gradient residual error module of a chain structure which is added in a feature fusion sub-network and connected in a jumping manner;

s22, the shallow layer features after being connected in series are further extracted through gradient residual modules which are connected in sequence, and meanwhile, adjacent gradient residual modules are connected in a jumping mode to avoid losing context information; in a gradient residual error module, deep features are extracted from a main stream by convolution dense connection of two 3x3, a residual error stream is subjected to gradient operation through a Sobel gradient operator, fine granularity features of a source image are reserved, the deep features and the fine granularity features are respectively obtained through the main stream and the residual error stream, and after channel dimension splicing, the deep features and the fine granularity information are integrated; carrying out feature reconstruction on deep features and fine-grained features extracted from a gradient residual error module through 4 3x3 convolution layers, wherein except for the fact that a final layer uses a Tanh activation function to obtain a primary fusion image, all other layers use BN normalization and LRelu activation functions;

the step S3 is specifically as follows:

s31, constructing a joint gradient map through a two-channel gradient aggregation module: firstly, respectively calculating gradients of infrared and visible light channels through Sobel operators; then integrating the gradient information of the infrared and visible light channels through a double-channel aggregation strategy to obtain a combined gradient map I containing source image complementarity texture information _grad This process is expressed as:

wherein,representing gradient operation, in particular calculating gradient by a sobel operator; abs (-) represents absolute value operations and max (-) represents the maximum selection policy at pixel level;

s32, inputting the combined gradient map with the complementary texture characteristics and the gradient map of the primary fusion image into a discriminator. The discriminator is a four-layer network structure, the first three layers use a convolution kernel of 3x3, the last layer is a full-connection layer, and a sigmoid function is used for outputting discrimination probability; the discriminator is used for calculating the similarity degree of texture information between the gradient map of the combined gradient map and the gradient map of the primary fusion image, so as to further describe texture details of foreground objects and background semantic information in the primary fusion image.

The step S4 specifically comprises the following steps:

s41, respectively calculating content loss between the primary fusion image and the two types of source images in a generator; wherein the content loss includes at least a pixel intensity loss, a structural similarity loss, and a contrast loss of the generator;

s42, in the discriminator, calculating the antagonism loss between the combined gradient map and the primary fusion image gradient map; training a network together with the content loss to optimize the fusion effect of the images; and presetting the total training wheel number, and when the training wheel number reaches the preset total training wheel number, finishing the network training to generate a final fusion image.

The working mechanism of the multi-scale stereoscopic attention module in the step S11 is as follows:

in this module, the input image is convolved with 3x3 depth separation to obtain the general features F of each channel ₀ The method comprises the steps of carrying out a first treatment on the surface of the F is convolved by 3x3 depth separable of different expansion rates ₀ Dividing into different scale features; calculating the attention weights of different branches of the multi-scale feature in space and channel dimensions through the stereoscopic attention; wherein, the calculation formula of the stereoscopic attention is expressed as follows:

where v represents the stereo attention weight, c and s represent the channel attention and the spatial attention, respectively,representing element-by-element multiplication;

the three-dimensional attention weight is processed by a softmax function to obtain a final three-dimensional attention weight, different scale features are integrated, the fusion feature F is subjected to 1x1 convolution adjustment channel and then summed with an input image, and a saliency feature map corresponding to the source image is obtained.

The step S12 is specifically as follows:

integrating the features of the corresponding saliency feature graphs of the infrared and visible light images through a regional aggregation strategy: integrating all significant regions of the source image by means of pixel-level maximum selection:

wherein I is _joint (I, j) represents the pixel value of the position of the saliency map (I, j) after integration, I _{ir_mask} (I, j) and I _{vi_mask} (i, j) pixel values at the (i, j) position of the saliency map of the infrared and visible light images, respectively;

the integrated saliency feature map is binarized through an OTSU threshold segmentation method to obtain a difference joint saliency map, which is expressed as follows:

I _mask ＝Threshold _OTSU (I _joint )

wherein, threshold _OTSU (. Cndot.) represents the OTSU thresholding method.

The step S41 specifically comprises the following steps:

the content loss of the generator includes pixel intensity loss, similarity loss, and generator contrast loss, expressed as: l (L) _int ,L _sim ,L _adv The method comprises the steps of carrying out a first treatment on the surface of the Wherein,

loss of pixel intensity L _int The definition is expressed as:

wherein H and W are the height and width of the image, respectively, |·|| ₁ Representing the L1 norm, max (·) representing the maximum choice at the image pixel level; intensity loss the pixel intensity distribution of the fused image is constrained by integrating the pixel intensity distribution of the infrared and visible images through a maximum selection strategy;

similarity loss L _sim The definition is expressed as:

wherein SSIM (I) _f ,I _ir ) Representing image I _f And I _ir Structural similarity measure between I _ir And I _vi Respectively infrared and visible light images;

generatingCounter loss L of the device _adv The definition is expressed as:

wherein c is a probability tag, and the value is 1; d (-) represents the output result of the discriminator; n represents the nth image, N represents the total number of images

Building generator content loss L for image fusion tasks _G The definition is expressed as:

L _G ＝λ ₁ L _int +λ ₂ L _SSIM +λ ₃ L _adv

wherein lambda is ₁ ,λ ₂ ,λ ₃ Is the weight that controls each loss;

the step S42 specifically includes:

s42, calculating the similarity between the gradient map of the fusion image and the combined gradient map through the antagonism loss function of the discriminator so as to achieve the purpose of reducing the difference of texture features between the primary fusion image and the combined gradient map; antagonistic loss function of arbiterThe definition is as follows:

wherein a is a label of a gradient map of the fusion image, and the value is 0; b is a label of the joint gradient map, and the value is 1; d (-) represents the output result of the discriminator; n represents the nth image and N represents the total number of images.

And presetting the total training round number as E, and when the training round number reaches E, finishing the network training to generate a final fusion image.

Compared with the prior art, the invention has the beneficial effects that:

(1) The invention provides a novel difference significance aggregation and combined gradient constraint generation countermeasure network to realize fusion of infrared and visible light images. The method coordinates well the relationship between the complete representation of the target and the retention of meaningful highlighting information. The generated fusion image has rich semantic information, and is helpful for meeting the requirements of advanced visual tasks.

(2) The invention designs a remarkable difference perception aggregation sub-network. Wherein a multi-scale stereo attention mechanism is used to perceive spatial location information and channel attention information of regions of saliency of source images of different modalities. The region aggregation strategy can be used for three-dimensionally integrating the difference of the salient regions in images of different modes and constructing a difference combined saliency map. The difference combined saliency map can effectively relieve the difficulty of reserving target features blocked by background highlight information.

(3) The invention designs the chain type gradient residual error module with the jump connection, wherein the gradient residual error modules are sequentially connected to enhance the extraction capability of image target characteristics and fine granularity information, and the jump connection is added between the adjacent gradient residual error modules to avoid the loss of context information. In the gradient residual error module, the main flow uses a dense connection mode to enhance the multiplexing capability of network characteristics, and the residual error flow uses a gradient operator to promote the description of fine granularity information;

(4) The dual-channel gradient aggregation module strengthens the connection between the infrared image gradient information and the visible image gradient information, and generates a combined gradient map containing the source image complementary texture information. Under the countermeasure constraint of the combined gradient map, the fusion image can effectively display rich texture details containing foreground objects and background semantic information.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of the present invention;

FIG. 2 is a schematic diagram of an embodiment of the present invention;

FIG. 3 is a schematic diagram of a multi-scale stereoscopic attention module structure according to an embodiment of the invention;

FIG. 4 is a schematic diagram of a converged network architecture in an embodiment of the present invention;

FIG. 5 is a schematic diagram of a gradient residual module structure according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a network structure of a arbiter according to an embodiment of the present invention;

fig. 7 is a fusion effect diagram of an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Referring to fig. 1 to 7, the technical scheme of the invention is as follows:

an image fusion method based on difference significance aggregation and joint gradient constraint comprises the following steps:

1. and inputting the infrared source image and the visible source image into a constructed fusion network. The fusion network mainly comprises a generator, a two-channel gradient aggregation module and a discriminator. The generator is composed of a saliency difference perception aggregation sub-network and a feature fusion sub-network. The method comprises the steps that two types of source images, namely infrared and visible light, are firstly input into a saliency difference perception aggregation sub-network, the network comprises a multi-scale three-dimensional attention module and a region aggregation strategy, attention information of an image saliency region is respectively extracted through the multi-scale three-dimensional attention module, and a difference joint saliency map is obtained through integration of the region aggregation strategy;

(1) And respectively inputting two types of source images, namely infrared and visible light images, into the saliency difference perception aggregation sub-network. Firstly, a multi-scale stereoscopic attention module is used for sensing channel and space position information of saliency areas of two types of source images to obtain a saliency feature map of a corresponding source image:

in the multi-scale stereo attention module, the input image is first convolved with a depth separable of 3x3 (DSConv 3x 3) to obtain the general features F of each channel ₀ . On this basis, F is convolved with DSConv3x3 of different expansion rates ₀ Divided into different scale features. To make full use of multi-scale featuresThe spatial and channel dimensions are related, the stereo attention is used to calculate the attention weights of the different branches. The calculation formula of the three-dimensional attention is as follows:wherein v represents a stereo attention weight, c and s represent channel attention and spatial attention, respectively, +.>Representing element-by-element multiplication. And obtaining final stereoscopic attention weight through softmax function processing, and integrating different scale features. And (3) after the channel is adjusted by using a 1x1 convolution, the fusion characteristic F is summed with the image characteristic to obtain a final output result.

(2) The saliency feature images are integrated stereoscopically through a simple and effective regional aggregation strategy to obtain a difference combined saliency image I _mask The process of (1) is as follows:

the corresponding saliency feature images of the infrared and visible light images are integrated into features by means of a region aggregation. First, the entire saliency areas of the source image are integrated by means of pixel-level maximum selection:

wherein I is _joint (I, j) represents the pixel value of the position of the saliency map (I, j) after integration, I _{ir_mask} (I, j) and I _{vi_mask} (i, j) are pixel values at the (i, j) position of the saliency map of the infrared and visible light images, respectively. Then, the integrated saliency feature map is divided into two values through an OTSU threshold value to obtain a final difference joint saliency map:

I _mask ＝Threshold _OTSU (I _joint )

wherein Threshold _OTSU (. Cndot.) represents the OTSU thresholding method.

2. And inputting the generated difference joint saliency map and the two types of source images into a feature fusion sub-network. The feature fusion sub-network comprises chain type gradient residual modules which are added with jump connection, wherein the gradient residual modules are sequentially connected to extract features of a difference joint saliency map and two types of infrared and visible light source images, and the adjacent modules are connected through the jump connection to avoid the loss of context information; and reconstructing the features through convolution to obtain a primary fusion image. The process is as follows:

in the feature fusion sub-network, the difference is combined with the saliency map I _mask Multiplying the images with infrared and visible light element by element respectively to obtain a saliency target area diagram I _t And salient background region map I _d The method comprises the steps of carrying out a first treatment on the surface of the Pair I _t And infrared image, I _d And extracting shallow features of the visible light image on two parallel feature extraction branches by using a 3x3 convolution layer respectively; in each parallel feature extraction branch, the extracted shallow features are respectively connected in series in the channel dimension, and deep features are further extracted by using a chain gradient residual error module which is connected in a jumping manner in a feature fusion sub-network;

the characteristics of the shallow layer after series connection are further extracted through gradient residual modules which are connected in sequence, and meanwhile, adjacent gradient residual modules are connected through jumping to avoid losing context information; in a gradient residual error module, deep features are extracted from a main stream by convolution dense connection of two 3x3, a residual error stream is subjected to gradient operation through a Sobel gradient operator, fine granularity features of a source image are reserved, the deep features and the fine granularity features are respectively obtained through the main stream and the residual error stream, and after channel dimension splicing, the deep features and the fine granularity information are integrated; carrying out feature reconstruction on deep features and fine-grained features extracted from a gradient residual error module through 4 3x3 convolution layers, wherein except for the fact that a final layer uses a Tanh activation function to obtain a primary fusion image, all other layers use BN normalization and LRelu activation functions;

3. constructing a joint gradient map containing source image complementation texture information through a two-channel gradient aggregation module in a fusion network; and inputting the combined gradient map and the constructed gradient map of the primary fusion image into a discriminator of the fusion network, and enhancing the texture details of the primary fusion image. The method comprises the following specific steps:

by double pairThe channel gradient aggregation module constructs a joint gradient map: firstly, respectively calculating gradients of infrared and visible light channels through Sobel operators; then integrating the gradient information of the infrared and visible light channels through a double-channel aggregation strategy to obtain a combined gradient map I containing source image complementarity texture information _grad This process is expressed as:

the combined gradient map with the complementary texture features and the gradient map of the preliminary fusion image are input together into a arbiter. The discriminator is a four-layer network structure, the first three layers use a convolution kernel of 3x3, the last layer is a full-connection layer, and a sigmoid function is used for outputting discrimination probability; the discriminator is used for calculating the similarity degree of texture information between the gradient map of the combined gradient map and the gradient map of the primary fusion image, so as to further describe texture details of foreground objects and background semantic information in the primary fusion image.

4. In a generator of the fusion network, respectively calculating content loss between the generated primary fusion image and the two types of source images; in a discriminator of the fusion network, calculating the antagonism loss between the combined gradient map and the primary fusion image gradient map; the two losses are used for training a fusion network together, so that the effect of image fusion is optimized; when the training round number reaches the preset number, the network training is completed, and a final fusion image is generated. The method comprises the following specific steps:

(1) The content loss of the generator includes pixel intensity loss, similarity loss, and generator contrast loss, respectively: l (L) _int ,L _sim ,L _adv . Wherein the pixel intensity loss L _int The definition is as follows:

where H and W are the height and width of the image respectively, I.I ₁ The L1 norm is represented by the expression,representing a maximum selection of image pixel levels. Intensity loss the pixel intensity distribution of the fused image is constrained by integrating the pixel intensity distribution of the infrared and visible images by a maximum selection strategy.

The effect of the similarity penalty is to maintain an overall structure in which the fused image is consistent with the source image. Similarity loss L _sim The definition is as follows:

wherein SSIM (I) ₁ ,I ₂ ) Representing image I ₁ And I ₂ Structural similarity measures between. I _ir And I _vi Respectively infrared and visible images.

The objective of the resistance penalty of the generator is to have the fused image retain more texture information in the source image. Counter loss L of generator _adv The definition is as follows:

wherein c is a probability tag, and the value is 1; d (-) represents the output result of the discriminator; n represents the nth image and N represents the total number of images.

Building generator content loss L for image fusion tasks _G The definition of the loss function is as follows:

L _G ＝λ ₁ L _int +λ ₂ L _SSIM +λ ₃ L _adv

wherein lambda is ₁ ,λ ₂ ,λ ₃ Is to control eachWeight of term loss.

(2) Calculating the similarity between the gradient map of the fusion image and the combined gradient map through the antagonism loss function of the discriminator so as to achieve the purpose of reducing the texture feature difference between the primary fusion image and the combined gradient map; antagonistic loss function L of discriminator _Dadv The definition is as follows:

Fig. 6 shows the fusion results of the test charts of the present invention. The result in the graph shows that the invention can well reserve the target blocked by the high-brightness information, and simultaneously, the meaningful high-brightness information in the image is reserved, thereby enriching the semantic information of the fusion image.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. An image fusion method based on difference significance aggregation and joint gradient constraint is characterized by comprising the following steps:

2. The image fusion method based on difference saliency aggregation and joint gradient constraint according to claim 1, wherein S1 specifically is:

s12, integrating the saliency feature images through a regional aggregation strategy to obtain differencesJoint saliency map I _mask 。

3. The image fusion method based on difference saliency aggregation and joint gradient constraint according to claim 2, wherein S2 is specifically:

s22, the shallow layer features after being connected in series are further extracted through gradient residual modules which are connected in sequence, and meanwhile, adjacent gradient residual modules are connected in a jumping mode to avoid losing context information; in a gradient residual error module, deep features are extracted from a main stream by convolution dense connection of two 3x3, a residual error stream is subjected to gradient operation through a Sobel gradient operator, fine granularity features of a source image are reserved, the deep features and the fine granularity features are respectively obtained through the main stream and the residual error stream, and after channel dimension splicing, the deep features and the fine granularity information are integrated; and carrying out feature reconstruction on deep features and fine-grained features extracted from the gradient residual error module through 4 3x3 convolution layers, wherein except for the final layer, a primary fusion image is obtained by using a Tanh activation function, and all the other layers use BN normalization and LRelu activation functions.

4. The image fusion method based on difference saliency aggregation and joint gradient constraint according to claim 3, wherein S3 specifically is:

s31, constructing a joint gradient map through a two-channel gradient aggregation module: firstly, respectively calculating gradients of infrared and visible light channels through Sobel operators; then go through double-passThe channel aggregation strategy integrates gradient information of infrared and visible light channels to obtain a combined gradient map I containing source image complementarity texture information _grad This process is expressed as:

5. The image fusion method based on difference saliency aggregation and joint gradient constraint according to claim 4, wherein S4 is specifically:

6. The image fusion method based on difference saliency aggregation and joint gradient constraint according to claim 2, wherein the working mechanism of the multi-scale stereoscopic attention module in S11 is as follows:

7. The image fusion method based on difference saliency aggregation and joint gradient constraint according to claim 2, wherein S12 is specifically:

wherein I is _joint (i, j) represents post-integration significancePixel value at feature map (I, j) position, I _{ir_mask} (I, j) and I _{vi_mask} (i, j) pixel values at the (i, j) position of the saliency map of the infrared and visible light images, respectively;

I _mask ＝Threshold _OTSU (I _joint )

wherein, threshold _OTSU (. Cndot.) represents the OTSU thresholding method.

8. The image fusion method based on difference saliency aggregation and joint gradient constraint according to claim 5, wherein S41 specifically is:

loss of pixel intensity L _int The definition is expressed as:

similarity loss L _sim The definition is expressed as:

counter loss L of generator _adv The definition is expressed as:

L _G ＝λ ₁ L _mt +λ ₂ L _SSIM +λ ₃ L _adv

wherein lambda is ₁ ,λ ₂ ,λ ₃ Is the weight that controls each loss.

9. The image fusion method based on difference saliency aggregation and joint gradient constraint according to claim 5, wherein S42 specifically is:

calculating the similarity between the gradient map of the fusion image and the combined gradient map through the antagonism loss function of the discriminator so as to achieve the purpose of reducing the texture feature difference between the primary fusion image and the combined gradient map; antagonistic loss function of arbiterThe definition is as follows: