CN116993639A

CN116993639A - Visible light and infrared image fusion method based on structural re-parameterization

Info

Publication number: CN116993639A
Application number: CN202310932335.6A
Authority: CN
Inventors: 蒋汶臻; 胡荣林; 王林涛; 李文超; 王佳雯; 马甲林; 李翔; 邵鹤帅; 张海艳; 何艳婷; 冯万利
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2023-07-27
Filing date: 2023-07-27
Publication date: 2023-11-03

Abstract

The invention discloses a visible light and infrared image fusion method based on structural reconsideration, which comprises the following steps: in the encoder, respectively extracting features of visible light and infrared images by using a deep learning convolutional neural network model stacked by a RepVGG block, a convolutional layer and a temporary retirement layer, and dividing the features into low-level features and high-level features according to the number of channels; the acquired low-level features and high-level features are input into a decoder, and the low-level features and the high-level features are fused into new depth features by using a feature fusion module, so that a fused image is generated. The invention uses the RepVGG block, the temporary return layer and the structural heavy parameterization in the task of fusion of the visible light and the infrared image, can effectively extract the image characteristics, lightens the overfitting, and simultaneously improves the reasoning speed of the model and the utilization rate of the memory.

Description

Visible light and infrared image fusion method based on structural re-parameterization

Technical Field

The invention relates to the field of computer vision visible light and infrared image fusion, in particular to a visible light and infrared image fusion method based on structural heavy parameterization.

Background

In the field of image fusion, fusion of visible light and infrared images is an important technology aimed at fusing information from visible light and infrared sensors to produce a composite image of more comprehensive and rich information. The visible light image and the infrared image respectively capture information in different wave bands, have complementary characteristics, and can provide more comprehensive visual perception capability.

The existing visible light and infrared image fusion method is mainly based on fusion strategies at pixel level or region level, such as weighted average, multi-scale decomposition, wavelet transformation and the like. However, these methods often fail to fully utilize the structural information of the image, resulting in problems such as artifacts, distortion, incomplete information, and the like in the fusion result. In addition, these methods have poor adaptability to different scenes and lighting conditions, and cannot realize precise control and adjustment of image content.

Ding X et al in Repvgg: making vgg-style convnets great again ([ C ]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognition2021:13733-13742 ]) propose a RepVGG network whose training model has a multi-branch topology, with decoupling of training and reasoning structures achieved by structure re-parameterization techniques such that the reasoning time body consists of only 3X 3 convolutions and a ReLU stack. Training a multi-branch model, equivalently converting the multi-branch model into a single-path model, and finally deploying the single-path model. The method can simultaneously utilize the advantages (high performance) of multi-branch model training and the benefits (high speed and memory saving) of single-path model reasoning. The network image classification task has high accuracy and high reasoning speed, but is introduced into the field of image fusion, and excessive RepVGG blocks, pooling layers and full-connection layers not only lead to large model calculation amount, but also distort the fusion image, and the fusion image lacks a temporary return layer and also has the risk of overfitting.

Li H et al (LRRNet: ANovel Representation Learning Guided Fusion Network for Infrared and Visible Images) ([ J ]. IEEE transactions on pattern analysis and machine intelligence, 2023.) propose a characterization learning guided fusion network (LRRNet) using a learning image decomposition model (supported LRR, LLRR) for infrared and visible (IR-VI) image fusion tasks. To train this network, a new detail-semantic information loss function is proposed, containing four levels of loss terms, pixel level, shallow feature level, intermediate feature level and deep feature level. The fusion performance (including improved fusion artifact measurement, multi-scale structure similarity and other indexes) of the fusion network is superior to that of most of the existing fusion methods, and the fusion network has fewer parameters and shorter training and reasoning time. However, LLRR blocks in LRRNet are insufficient in deep feature extraction of visible light and infrared images, so that fused image evaluation indexes such as entropy and mutual information are not ideal.

Disclosure of Invention

The invention aims to: the invention aims to provide a visible light and infrared image fusion method based on structural heavy parameterization, so that the reasoning speed is improved, and the utilization rate of a memory is improved.

The technical scheme is as follows: the invention discloses a visible light and infrared image fusion method based on structural heavy parameterization, which comprises the following steps:

(1) The input visible light image is convolved with a layer of convolution by a plurality of RepVGG blocks to extract low-level characteristics L _x And advanced feature S _x And then L is arranged _x And S is _x Respectively inputting the convolution layers to obtain C ₁₁ And C ₁₂ Two tensors;

and (1.1) reading an input visible light image as a gray image, scaling the gray image to a uniform size, taking a RepVGG convolutional neural network architecture as a basic convolutional neural network, removing a final pooling layer and a full connection layer, and adding a layer of convolution.

(1.2) inputting the scaled gray image into a modified RepVGG convolutional neural network architecture, and outputting a characteristic diagram C after convolution, batch normalization and linear unit operation correction ₁ 。

(1.3) feature map C ₁ The first 128 channels are sliced into L _x The last 128 channels are sliced as S _x 。

(1.4) mixing L _x And S is _x The two sub tensors are respectively input into a convolution layer to sequentially obtain C ₁₁ And C ₁₂ 。

(2) The input infrared image is convolved with a layer of convolution by a plurality of RepVGG blocks to extract low-level characteristics L _y And advanced feature S _y And then L is arranged _y And S is _y Respectively inputting the convolution layers to obtain C ₂₁ And C ₂₂ Two tensors;

and (2.1) reading the input infrared image as a gray image, scaling the gray image to a uniform size, taking a RepVGG convolutional neural network architecture as a basic convolutional neural network, removing a final pooling layer and a full connection layer, and adding a layer of convolution.

(2.2) inputting the scaled gray image into a modified RepVGG convolutional neural network architecture, and outputting a characteristic diagram C after convolution, batch normalization and linear unit operation correction ₂ 。

(2.3) feature map C ₂ The first 128 channels are sliced into L _y The last 128 channels are sliced as S _y 。

(2.4) mixing L _y And S is _y The two sub tensors are respectively input into a convolution layer to sequentially obtain C ₂₁ And C ₂₂ 。

(3) C is C ₁₁ And C ₂₁ The two tensors are spliced in the dimension 1 to obtain C ₃ C is carried out by ₁₂ And C ₂₂ The two tensors are spliced in the dimension 1 to obtain C ₄ ；

(4) C is C ₃ And C ₄ Respectively inputting the two tensors into a convolution layer to obtain a low tensor and a high tensor;

(5) And performing element-level addition operation on the low tensor and the high tensor, thereby generating a fusion image.

A computer storage medium having stored thereon a computer program which, when executed by a processor, implements a method of visible light and infrared image fusion based on structural reparameterization as described above.

The beneficial effects are that: compared with the prior art, the invention has the following advantages:

1. the invention has reasonable design, the low-level features and the high-level features of the image are extracted by using a deep convolutional neural network model, then the risk of overfitting is reduced by a convolutional layer and a temporary retirement layer in a decoder, the generalization capability is enhanced, and the low-level features and the high-level features are fused to obtain stronger feature representation;

2. in the reasoning process, the RepVGG block is converted into the stacked pure topology (without branches) convolution layer by utilizing the decoupling technology of the structural re-parameterization, so that the reasoning speed is improved, and the utilization rate of the memory is improved.

Drawings

FIG. 1 is a diagram of a visible and infrared image fusion network framework based on structural reparameterization;

FIG. 2 is a schematic diagram of the operation of the RepVGG block;

fig. 3 is a schematic diagram of the stacked convolution operation after decoupling.

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, in an encoder, a deep convolutional neural network model is used to extract low-level features and high-level features of an image, based on a visible light and infrared image fusion method of structural heavy parameterization; in the decoder, the low-level features and the high-level features are spliced to generate a fused image. Meanwhile, in the encoder, in order to effectively acquire the source image characteristics, accelerate the reasoning speed of the model and improve the utilization rate of the memory, a RepVGG block is used, the image is subjected to 3X 3 convolution, connected 1X 1 convolution and ReLU, and finally a layer of convolution and a temporary retirement layer is adopted, so that the characteristic image is divided into low-level characteristics and high-level characteristics for fusion of a decoder. The output of the network is a fused image with the same resolution as the source image, VGG-19 is used as a loss network, and a detail-semantic information loss function based on multi-level characteristics is used for evaluating the network performance so as to achieve the aim of training the network. The method specifically comprises the following steps:

and S1, extracting basic depth features of the image by using a deep learning convolutional neural network model in the encoder, and dividing the features into low-level features and high-level features according to the number of channels. The specific implementation method of the steps is as follows:

s1.1, reading an input image into a gray image, scaling the gray image to a uniform size, taking a RepVGG convolutional neural network architecture as a pre-trained basic convolutional neural network, removing a final pooling layer and a full connection layer of the basic convolutional neural network, and adding a layer of convolution;

s1.2, respectively inputting the scaled gray images into a modified RepVGG convolutional neural network architecture, and outputting a feature map C after a series of convolution, batch normalization and ReLU operation ₁ 、C ₂ ；

Step S1.3, feature map C ₁ 、C ₂ The first 128 channel slices are L, and the last 128 channel slices are S;

step S1.4, the input visible light image and the corresponding infrared image are processed in steps S1.1, S1.2 and S1.3 respectively to obtain L _x 、S _x 、L _y And S is _y Four sub-tensors;

and S2, inputting the acquired low-level features and high-level features into a decoder, and fusing the low-level features and the high-level features into new depth features by using a feature fusion module according to a fusion technology, so as to generate a fused image. The decoder consists of a series of convolution layers and a temporary return layer, and the overfitting risk of the model can be reduced by the output of the convolution layers through the temporary return layer, so that the generalization capability of the model is enhanced. The specific implementation method of the steps is as follows:

step S2.1, L _x 、S _x 、L _y And S is _y The four sub tensors are respectively input into a convolution layer to sequentially obtain C ₁₁ 、C ₁₂ 、C ₂₁ And C ₂₂ ；

Step S2.2, C ₁₁ And C ₂₁ The two tensors are spliced in the dimension 1 to obtain C ₃ C is carried out by ₁₂ And C ₂₂ The two tensors are spliced in the dimension 1 to obtain C ₄ ；

Step S2.3, C ₃ And C ₄ Respectively inputting the two tensors into a convolution layer to obtain low and high tensors;

and S2.4, performing element-level addition operation on the low tensor and the high tensor, thereby generating a fusion image.

And S3, using the VGG-19 network as a loss network, and evaluating the network performance by using a detail-semantic information loss function based on multi-level characteristics. The specific implementation method of the steps is as follows:

s3.1, selecting 4 convolution blocks to extract characteristics by using VGG-19 trained on ImageNet as a loss network;

step S3.2, L _total Is of the structure L _total ＝γ ₁ L _pixel +γ ₂ L _shallow +L _middle +γ ₄ L _deep Wherein gamma is ₁ 、γ ₂ 、γ ₄ The weight of each partial loss function is represented. L (L) _pixel Representing pixel level loss, L _shallow 、L _middle And L _deep Representing shallow, medium and deep feature loss, respectively, wherein features are extracted through a pre-training network;

step S3.3, after normalizing and scaling the fused image outputted in the step S2.4, mapping the numerical range between 0 and 255, and calculating a loss value L by a mean square error loss function (mse_loss) _pixel ；

S3.4, respectively inputting the fused image, the visible light image and the infrared image output in the S2.4 into a VGG-19 network to respectively output I _f 、I _vis 、I _ir ；

Step S3.5, I _f 、I _vis 、I _ir Combined calculation of the loss value L by means of a mean square error loss function (mse_loss) _shallow 、L _middle And L _deep And then L is arranged _shallow 、L _middle And L _deep Adding to obtain L _total The weights are updated using a back propagation algorithm.

The effect of the present invention will be explained by the following experiments conducted according to the method of the present invention.

Test environment: python3.9; a PyTorch framework; window11 system; NVIDIARTX 3070GPU

Test sequence: the selected training dataset is an image dataset KAIST for fusion of visible light with infrared images. The selected test dataset is TNO and VOT2020-RGBT from the common multi-modal dataset. Wherein 21 pairs of IR-VI images are selected from TNO for testing, and 40 pairs of images are selected from VOT2020-RGBT and TNO for constructing a new test data set. These images have arbitrary sizes and are converted into gray scales.

The test indexes are as follows: according to the invention, 6 quality indexes are selected to objectively evaluate the fusion performance. This includes entropy (En); standard Deviation (SD); mutual Information (MI); improved fusion artifact measurement (Nabf); difference correlation Sum (SCD); multiscale structural similarity (MS-SSIM). The performance of the image fusion method increases with increasing numerical index of the 6 indices (except Nabf).

The test results are shown in tables 1 and 2.

Table 1 shows the average of 6 quality metrics over a 21-pair infrared and visible image fusion of the present invention (RepVGGfuse) with other algorithms.

Table 1 average comparison of quality indicators

Table 2 shows the average of 6 quality indicators of the present invention (RepVGGfuse) and other algorithms on a fused image of 40 pairs of infrared and visible images

Table 2 average comparison of quality indicators

As can be seen from the comparison of the data, the fusion performance of the method is superior to that of most of the existing fusion methods.

Claims

1. The visible light and infrared image fusion method based on the structural reparameterization is characterized by comprising the following steps of:

2. The method of fusion of visible and infrared images based on structural reparameterization according to claim 1, wherein the step (1) includes the steps of:

3. The method of fusion of visible and infrared images based on structural reparameterization according to claim 1, wherein the step (1) includes the steps of:

4. The method of fusion of visible and infrared images based on structural reparameterization according to claim 1, wherein the step (1) includes the steps of:

5. The method of fusion of visible and infrared images based on structural reparameterization according to claim 1, wherein the step (1) includes the steps of:

6. The method of fusion of visible and infrared images based on structural reparameterization according to claim 1, wherein the step (2) includes the steps of:

7. The method of fusion of visible and infrared images based on structural reparameterization according to claim 1, wherein the step (2) includes the steps of:

8. The method of fusion of visible and infrared images based on structural reparameterization according to claim 1, wherein the step (2) includes the steps of:

9. The method of fusion of visible and infrared images based on structural reparameterization according to claim 1, wherein the step (2) includes the steps of:

10. A computer storage medium having stored thereon a computer program which, when executed by a processor, implements a method of structure-based re-parameterization of visible and infrared image fusion according to any of claims 1-8.