CN113487530A - Infrared and visible light fusion imaging method based on deep learning - Google Patents

Infrared and visible light fusion imaging method based on deep learning Download PDF

Info

Publication number
CN113487530A
CN113487530A CN202110878885.5A CN202110878885A CN113487530A CN 113487530 A CN113487530 A CN 113487530A CN 202110878885 A CN202110878885 A CN 202110878885A CN 113487530 A CN113487530 A CN 113487530A
Authority
CN
China
Prior art keywords
image
layer
pair
feature
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110878885.5A
Other languages
Chinese (zh)
Other versions
CN113487530B (en
Inventor
程良伦
李卓
吴衡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202110878885.5A priority Critical patent/CN113487530B/en
Publication of CN113487530A publication Critical patent/CN113487530A/en
Application granted granted Critical
Publication of CN113487530B publication Critical patent/CN113487530B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10048Infrared image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The application discloses an infrared and visible light fusion imaging method based on deep learning, which comprises the following steps: simultaneously acquiring infrared images and visible light images with the same size aiming at a target to form a target object image pair, carrying out sub-image pair segmentation, preprocessing and noise image addition on sub-images in the target object image pair, inputting the sub-image pair added with the noise image into a trained image fusion model as an input image pair, and obtaining a fused image; the image fusion model comprises a shallow layer feature extraction unit, an encoder, a fusion module and a decoder. The method considers the common fact that the source image contains noise, takes the common fact as the premise of practical imaging application, carries out image denoising in the process of fusing images, and effectively avoids the noise point in the source image pair from being introduced into the fused image; the method is very beneficial to the application research of the infrared and visible light fusion and deep learning technology, and is expected to be widely applied in the fields of medical imaging, night monitoring and the like.

Description

Infrared and visible light fusion imaging method based on deep learning
Technical Field
The application relates to the field of image fusion, in particular to an infrared and visible light fusion imaging method based on deep learning.
Background
The infrared thermal imaging takes target thermal radiation information as input, can also detect a target when the illumination is insufficient, and distinguishes the target from a background according to the difference of the radiation. Visible light sensors utilize object reflectivity to image the target, providing images with high resolution and sharp texture details. The infrared and visible light imaging system can simultaneously reflect different attributes of the same target object, provide scene information from different aspects, and have wide application in the fields of military affairs, video monitoring, automobile auxiliary driving, forest fire prevention and the like.
In recent years, with the widespread use of infrared and visible light imaging technologies, the information utilization rate of multi-modal sensors and the reliable operation time of imaging systems have been further improved, but some problems have been exposed, such as: the infrared thermal image has poor imaging quality, low contrast and serious noise interference; visible light images are susceptible to insufficient lighting, fog, and other inclement weather; most imaging systems generate images that are noisy. Therefore, the image fusion method based on the infrared and visible light imaging systems is extremely important in the field of image information fusion, and the development of an advanced image fusion algorithm is very helpful for the application and development of an information fusion technology.
Disclosure of Invention
The application aims to provide an infrared and visible light fusion imaging method based on deep learning, and the method is used for solving the problems that in an infrared and visible light imaging system, the infrared imaging quality is poor, the noise interference is serious, and a visible light image is susceptible to severe weather.
In order to realize the task, the following technical scheme is adopted in the application:
the application provides an infrared and visible light fusion imaging method based on deep learning in a first aspect, which comprises the following steps:
simultaneously acquiring infrared images and visible light images with the same size aiming at a target to form a target object image pair, carrying out sub-image pair segmentation, preprocessing and noise image addition on sub-images in the target object image pair, inputting the sub-image pair added with the noise image into a trained image fusion model as an input image pair, and obtaining a fused image;
the image fusion model comprises a shallow layer feature extraction unit, an encoder, a fusion module and a decoder, wherein:
the shallow feature extraction unit is used for performing shallow feature extraction on the input image pair;
the encoder is of a double-layer network structure and comprises an upper layer network and a lower layer network; the upper network comprises continuous stacking and jumping connection of a plurality of convolutional layers and linear rectifying layers, and the convolutional layers are arranged at the end and used for extracting characteristics and recombining channels; the lower layer network comprises a non-local enhancement module, a continuous stacking and jump connection of a multilayer convolution layer and a linear rectification layer, and then a second-order information attention module and a convolution layer;
the fusion module is used for generating a fusion feature map by combining a space attention mechanism and a channel attention mechanism for the feature map pair output by the upper network and the feature map pair output by the lower network;
the decoder comprises a plurality of upper sampling layers, a plurality of convolution blocks, a plurality of convolution layers and a linear rectification layer; wherein each convolution block includes two convolution layers having convolution kernels of different sizes.
Further, the sub-image pair segmentation, preprocessing and noise map addition for the sub-images in the target object image pair includes:
down-sampling the target object image pair to divide into sub-image pairs; extracting image blocks and recombining pixels of each sub-image in the sub-image pair to obtain a preprocessed sub-image pair; and constructing a noise map in a random sampling mode, and adding the noise map into the preprocessed sub-image pair as an additional channel.
Further, the processing procedure of the shallow feature extraction unit includes:
inputting pairs of images
Figure BDA0003191239180000021
Is input to the shallow feature extraction unit in an image tensor format; the shallow feature extraction unit comprises a convolution layer and a linear rectification layer ReLU, and the input image pair
Figure BDA0003191239180000022
Completing shallow feature extraction after the convolution layer and the linear rectification layer to obtain a shallow feature map pair
Figure BDA0003191239180000023
Further, the continuous stacking and hopping connection of the multilayer convolutional layer and the linear rectifying layer in the upper network includes:
each convolution layer and a linear rectifying layer ReLU form a feature extraction unit, and the feature extraction units are 4 in total; the input of the first feature extraction unit is a shallow feature map pair, the input of the second feature extraction unit is the output of the first feature extraction unit and the shallow feature map pair, the input of the third feature extraction unit is the output of the second feature extraction unit, the output of the first feature extraction unit and the shallow feature map pair, and the input of the fourth feature extraction unit is the output of the third feature extraction unit, the output of the second feature extraction unit, the output of the first feature extraction unit and the shallow feature map pair, so that continuous stacking and skip connection is formed.
Further, the non-local enhancement module includes an image partition layer and four convolution layers, and the shallow feature map pairs in the lower network of the encoder
Figure BDA0003191239180000031
Firstly, dividing image blocks into layers by an image, respectively generating m multiplied by m image blocks with the same size, and performing non-local feature enhancement on each divided feature image block, wherein a mathematical model of the non-local feature enhancement is as follows:
Figure BDA0003191239180000032
wherein i is a feature position index of the shallow feature map to be calculated, and N is the number of position indexes of the shallow feature map; j is the index of all possible positions in the feature map,
Figure BDA0003191239180000033
the ith position representing the t-th feature tile,
Figure BDA0003191239180000034
indicating after enhancement
Figure BDA0003191239180000035
Figure BDA0003191239180000036
Respectively representing the convolution processing in the non-local enhancement module,
Figure BDA0003191239180000037
Wψ,Wωand W andρweights learned for the four convolutional layers in the non-local enhancement module;
each enhanced feature tile
Figure BDA0003191239180000038
Finally, the feature images are combined into a feature image tensor to generate an enhanced feature image pair
Figure BDA0003191239180000039
After the continuous stacking and jump connection of the convolution layer and the linear rectification layer ReLU in the lower network, the information enters a second-order information attention module.
Further, the second-order information attention module comprises a normalization layer, a pooling layer, a convolution layer, a linear rectification layer ReLU, a convolution layer and a gating layer Sigmoid which are sequentially connected;
enhancing pairs of feature images
Figure BDA00031912391800000310
Is introduced into the second-order information attention module, and takes into account the characteristics of the second-order statistic channelAnd (4) self-adaptively learning the dependency relationship between the features and readjusting the channel.
Further, the mathematical model of the fusion module is:
Figure BDA00031912391800000311
Figure BDA00031912391800000312
wherein Sa (-) and Ca (-) represent implicit functions of spatial attention and channel attention, respectively,
Figure BDA00031912391800000313
and
Figure BDA00031912391800000314
a graph of the fused features is represented,
Figure BDA00031912391800000315
for a pair of feature maps output by an upper network,
Figure BDA00031912391800000316
is a feature map pair output by the lower network.
Further, there are 5 of the volume blocks, noted CD 1-CD 5; each convolution block comprises 1 layer of 3 × 3 convolution layers and one layer of 1 × 1 convolution layers;
the 5 convolution blocks CD are connected with each other in an up-sampling and jumping mode, wherein CD1, CD2 and CD3 are connected in sequence, the input of CD1 is simultaneously superposed with the input of CD2 and the input of CD3, and the output of CD1 is simultaneously superposed with the input of CD 3; the CD4 and the CD5 are sequentially connected, and the input of the CD4 is superposed with the input of the CD1 after passing through an up-sampling layer on one hand and the input of the CD5 on the other hand; the output of the CD4 is overlapped with the input of the CD2 after passing through an upsampling layer, the output of the CD5 is overlapped with the input of the CD3 after passing through the upsampling layer, the output of the CD3 is connected with two feature extraction units consisting of a 3 x 3 convolutional layer and a linear rectifying layer ReLU after passing through the upsampling layer, and finally, an output fusion image is obtained through one 3 x 3 convolutional layer.
In a second aspect, the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the deep learning based infrared and visible light fusion imaging method of the first aspect when executing the computer program.
In a third aspect, the present application provides a computer-readable storage medium storing a computer program, which when executed by a processor implements the steps of the aforementioned deep learning-based infrared and visible light fusion imaging method of the first aspect.
Compared with the prior art, the method has the following technical characteristics:
in the method, a network structure is designed into a double-layer structure to extract depth characteristic mapping of infrared and visible images; the fusion strategy combines spatial and channel attention mechanisms to generate a richer fusion feature map; the nested connection architecture is applied to the fusion network of the application to avoid semantic defects between the encoder and the decoder; the noise estimation image is added into the deep learning fusion network to realize the denoising function, so that the model is denoised in the process of fusing the images. Compared with the prior art, the method and the device consider the common fact that the source image contains noise, take the source image as the premise of practical imaging application, carry out image denoising in the process of fusing the images, and effectively avoid the introduction of noise points in the source image pair into the fused images. The method is very beneficial to the application research of the infrared and visible light fusion and deep learning technology, and is expected to be widely applied in the fields of medical imaging, night monitoring and the like.
Drawings
FIG. 1 is a schematic flow chart illustrating the execution of a method according to an embodiment of the present application;
FIG. 2 is the image resulting from the preprocessing in the example, processing a 1 × 320 × 240 infrared and visible light image pair into a 5 × 160 × 120 image pair, with a noise map M added as an additional channel to the input;
FIG. 3 is a diagram illustrating an embodiment of a deep neural network architecture, the final result of which is a fused image;
FIG. 4 is a schematic diagram of a network decoder in an embodiment, which performs feature extraction on infrared and visible light images to generate a feature map pair;
FIG. 5 is a schematic diagram of an exemplary fusion strategy, which is formed by combining spatial and channel attention mechanisms;
FIG. 6 is a schematic diagram of an embodiment of a network encoder;
fig. 7 is a schematic diagram of a convolution block CD in the example, which is composed of 1 layer of 3 × 3 convolution layer and 1 layer of 1 × 1 convolution layer.
Detailed Description
The method improves the imaging quality, effectively removes noise interference in the image, enables the target in the image to be prominent and clear in texture, facilitates more accurate target identification, and is more beneficial to all-weather work of an imaging system.
As shown in fig. 1, the present application processes a target object image pair by using a depth learning image fusion method, captures a detected target object image by using an infrared and visible light imaging system to generate an infrared and visible light image pair, and processes the target image pair, wherein the fusion process includes the following steps:
s1, an infrared image and a visible light image of the same size are simultaneously acquired for the target to form a target image pair, and the target image pair is down-sampled to be divided into sub-image pairs.
In the present application, n represents the sizechTarget image pair of x h x w (I)ir,Ivis) Performing downsampling division to form 4nchA sub-image pair of x h/2 x w/2; wherein IirAs an infrared image, IvisFor visible light images, nchFor a channel, h is the height and w is the width.
And S2, extracting image blocks and recombining pixels of each sub-image in the sub-image pair to obtain a preprocessed sub-image pair.
In this step, 2 × 2 image blocks are taken for each sub-image (infrared image and visible light image) and are outputRecombining pixels in different channels of the output image to obtain preprocessed sub-image pairs
Figure BDA0003191239180000051
Its mathematical model can be expressed as follows:
Figure BDA0003191239180000052
Figure BDA0003191239180000053
in the above expression, c is the image channel, x is the image pixel abscissa, y is the image pixel ordinate, where c is greater than or equal to 0 and less than or equal to 4nchX is more than or equal to 0 and less than or equal to h, and y is more than or equal to 0 and less than or equal to w. The post-processing of the present application will be carried out on the above reduced scale.
And S3, constructing a noise map in a random sampling mode, and adding the noise map as an additional channel into the preprocessed sub-image pairs.
From the standard deviation of the noise as [ sigma ]12) Randomly sampling n in uniform distributionchX h x w sized samples to construct the noise map M. The noise map M will be added as an extra channel to the input image
Figure BDA0003191239180000054
And
Figure BDA0003191239180000055
performing the following steps; the noise map controls the trade-off between noise reduction and detail retention; for example, in the present embodiment, a uniform distribution of [0,75) is employed.
The target image with the size of (4 n) is obtained through the pretreatment of the stepsch+nch) Sub-image pairs of x h/2 x w/2 infrared and visible images
Figure BDA0003191239180000061
A schematic diagram of which is shown in fig. 2.
S4, the pairs of sub-images to which the noise images are added are input as input image pairs to the trained image fusion model, and a fused image is obtained.
Sub-image pairs to which noise maps are added
Figure BDA0003191239180000062
As an input of the image fusion model shown in fig. 3, the fused image is finally output, and the mathematical model thereof can be expressed as:
Figure BDA0003191239180000063
in the above expression, IfFor fusing images, F (-) is an implicit function representing an image fusion model; in this embodiment, the model is a convolutional neural network model.
Referring to fig. 3, the image fusion model proposed in the present application sequentially includes, from left to right: shallow layer characteristic extraction element, encoder, fusion module, decoder. The modules are introduced as follows:
1. shallow feature extraction unit
Inputting pairs of images
Figure BDA0003191239180000064
A shallow feature extraction unit which is input to the image fusion model in an image tensor format; the shallow feature extraction unit comprises a 3 × 3 convolution layer and a linear rectifying layer ReLU, and the input image pair
Figure BDA0003191239180000065
Completing shallow feature extraction after the convolution layer and the linear rectification layer to obtain a shallow feature map pair
Figure BDA0003191239180000066
2. Encoder for encoding a video signal
The present application provides an encoder modular structure for extracting depth feature maps of input image pairs, as shown in fig. 4, the encoder is a dual-layer network structure including an upper layer network and a lower layer networkComplexing; shallow feature map pair
Figure BDA0003191239180000067
Respectively transmitting the data to an upper layer network and a lower layer network in a double-layer network to generate 4 deep layer feature maps; wherein:
2.1 upper network
The upper network comprises 4 3 × 3 convolutional layers and continuous stacking and hopping connection of linear rectifying layers ReLU from left to right, and 1 3 × 3 convolutional layer is arranged at the end for extracting features and recombining channels. The continuous stacking and jump connection means that each 3 x 3 convolution layer and a linear rectifying layer ReLU form a feature extraction unit, and the feature extraction units are 4 in total; the input of the first feature extraction unit is a shallow feature map pair, the input of the second feature extraction unit is the output of the first feature extraction unit and the shallow feature map pair, the input of the third feature extraction unit is the output of the second feature extraction unit, the output of the first feature extraction unit and the shallow feature map pair, and the input of the fourth feature extraction unit is the output of the third feature extraction unit, the output of the second feature extraction unit, the output of the first feature extraction unit and the shallow feature map pair, so that continuous stacking and skip connection is formed.
In the encoder, the shallow feature map pair
Figure BDA0003191239180000071
Transmitted to upper network for deep feature extraction, and passed through last convolutional layer of upper network to generate feature diagram pair
Figure BDA0003191239180000072
2.2 lower layer network
The lower layer network comprises a non-local enhancement module, 4 layers of 3 multiplied by 3 convolution layers and a continuous stacking and jumping connection of a linear rectification layer ReLU from left to right, and then a second-order information attention module, 1 layer of 3 multiplied by 3 convolution layers; there are also 4 feature extraction units in the lower network, where the input of the first feature extraction unit is a shallow feature map pair
Figure BDA0003191239180000073
The output after passing through the non-local enhancement module and the output of the subsequent 4 feature extraction units are used as the input of the second-order information attention module; the continuous stacking and jumping connection of the middle 4 feature extraction units is similar to that of the upper network, and is not described in detail herein.
(a) Non-local enhancement module
In this application, the non-local enhancement module includes an image partition layer and 4 1 × 1 convolution layers. Shallow profile pairs in the underlying network of encoders
Figure BDA0003191239180000074
Firstly, dividing the image into hierarchical image blocks to generate m × m image blocks with the same size, wherein the dividing process is as follows:
h1=h/m,w1=w/m
Figure BDA0003191239180000075
in the above expression, h1And w1The sizes of the divided image blocks are respectively, h is the height, and w is the width;
Figure BDA0003191239180000076
representing the k characteristic blocks divided by the shallow characteristic map, each divided characteristic block is subjected to non-local characteristic enhancement, and a mathematical model of the non-local characteristic enhancement can be represented as follows:
Figure BDA0003191239180000077
where i is a feature position index of the shallow feature map to be calculated, and N ═ h × w/k2Indexing the position of the shallow feature map; j is the index of all possible positions in the feature map,
Figure BDA0003191239180000078
the ith position representing the t-th feature tile,
Figure BDA0003191239180000079
indicating after enhancement
Figure BDA00031912391800000710
Figure BDA00031912391800000711
Figure BDA00031912391800000712
Respectively representing the convolution processing of 3 of the 1 x 1 convolutional layers in the non-local enhancement module,
Figure BDA00031912391800000713
Wψ,Wωand W andρweights learned for the four 1 x 1 convolutional layers in the non-local enhancement module.
Each enhanced feature tile
Figure BDA0003191239180000081
Finally, the feature images are combined into a feature image tensor to generate an enhanced feature image pair
Figure BDA0003191239180000082
Figure BDA0003191239180000083
After the continuous stacking and jump connection of 4 layers of 3 multiplied by 3 convolution layers and a linear rectification layer ReLU in a lower network, the information enters a second-order information attention module.
(b) Second-order information attention module
The second-order information attention module comprises a normalization layer, a pooling layer, a 3 x 3 convolution layer, a linear rectification layer ReLU, a 3 x 3 convolution layer and a gating layer Sigmoid which are sequentially connected.
Enhancing pairs of feature images
Figure BDA0003191239180000084
After being processed by the four feature extraction units, the data are transmitted to a second-order information attention module, the channel is readjusted by considering the dependency relationship between the feature information of a second-order statistic channel and self-adaptive learning features, and a mathematical model of the channel can be expressed as follows:
fsola=Channel(Cov(fR))
wherein f isRRepresenting enhanced feature images, Cov (-) representing covariance normalization, Channel (-) representing Channel attention, fsolaIs a profile enhanced by channel information. Enhancing pairs of feature images
Figure BDA0003191239180000085
Processing the deep characteristic graph pair by a characteristic extraction unit and enhancing second-order information by a second-order information attention module to generate a deep characteristic graph pair
Figure BDA0003191239180000086
Figure BDA00031912391800000812
Is transmitted into the 3 × 3 convolutional layer to complete the channel adjustment, and generate a feature map pair
Figure BDA0003191239180000087
3. Fusion module
In the fusion module, the feature map pair aiming at the output of the upper network
Figure BDA0003191239180000088
And feature map pairs of the lower network output
Figure BDA0003191239180000089
Combining the spatial attention mechanism and the channel attention mechanism to generate a fusion feature map, as shown in fig. 5, the spatial attention mechanism is used to fuse the multi-scale depth features in the image pair, and considering that the deep features are three-dimensional tensors, in the present application, the channel attention mechanism is used for channel information calculation, and a mathematical model of a fusion strategy thereof may be defined as:
Figure BDA00031912391800000810
Figure BDA00031912391800000811
wherein Sa (-) and Ca (-) represent implicit functions of spatial attention and channel attention, respectively,
Figure BDA00031912391800000813
and
Figure BDA00031912391800000814
a fused feature map is shown.
4. Decoder
As shown in fig. 6, the decoder includes a plurality of upsampled layers, 5 convolutional blocks CD, 3 layers of 3 × 3 convolutional layers, 2 linear rectifying layers ReLU; of the 5 convolutional blocks CD, each convolutional block CD contains 1 layer of 3 × 3 convolutional layers and one layer of 1 × 1 convolutional layers, as shown in fig. 7.
The 5 convolution blocks CD are connected with each other in an up-sampling and jumping mode, the 5 convolution blocks CD are marked by CD1, CD2, CD3, CD4 and CD5 respectively, wherein CD1, CD2 and CD3 are connected in sequence, the input of CD1 is simultaneously superposed with the input of CD2 and the input of CD3, and the output of CD1 is simultaneously superposed with the input of CD 3; the CD4 and the CD5 are sequentially connected, and the input of the CD4 is superposed with the input of the CD1 after passing through an up-sampling layer on one hand and the input of the CD5 on the other hand; the output of the CD4 is overlapped with the input of the CD2 after passing through an upsampling layer, the output of the CD5 is overlapped with the input of the CD3 after passing through the upsampling layer, the output of the CD3 is connected with two feature extraction units consisting of a 3 x 3 convolutional layer and a linear rectifying layer ReLU after passing through the upsampling layer, and finally, an output fusion image is obtained through one 3 x 3 convolutional layer. This way of linking the volume blocks CD allows the network model to avoid semantic loss between the encoder and the decoder.
In the present application, feature maps are fused
Figure BDA0003191239180000091
First with the upsampled fused feature map
Figure BDA0003191239180000095
Overlay Generation of feature maps
Figure BDA0003191239180000092
Figure BDA0003191239180000093
Is transmitted to CD1 for feature extraction, and simultaneously is connected by jump
Figure BDA0003191239180000094
The CD2 and CD3 are respectively transmitted as overlapping input to provide richer fusion information;
Figure BDA0003191239180000096
feature maps are generated by respectively extracting features of CD1 and CD4
Figure BDA0003191239180000097
And
Figure BDA0003191239180000098
Figure BDA0003191239180000099
after passing through the upper sampling layer and
Figure BDA00031912391800000910
and
Figure BDA00031912391800000911
overlay as input to CD2, simultaneous feature map
Figure BDA00031912391800000914
Superimpose the input of CD3 through a jump connection;
Figure BDA00031912391800000912
and
Figure BDA00031912391800000913
overlay input to CD5 to generate deep profile
Figure BDA00031912391800000915
CD2 output
Figure BDA00031912391800000916
Figure BDA00031912391800000917
And up-sampled
Figure BDA00031912391800000918
Superimposing input to CD3 to generate depth features
Figure BDA00031912391800000919
After the connection of the up-sampling is carried out,
Figure BDA00031912391800000920
the image data is input into two feature extraction units consisting of a 3 × 3 convolutional layer and a linear rectifying layer ReLU, and finally a fused image F' is obtained by completing the reconstruction of the fused image through one 3 × 3 convolutional layer.
In the deep neural network training process, an Adam function is adopted to optimize a loss function L (theta), wherein the loss function is defined as:
L(Θ)=LMSE+λLSSIM
in the above formula LMSEAs a function of the mean square error, it can be expressed as follows:
Figure BDA0003191239180000101
wherein the content of the first and second substances,
Figure BDA0003191239180000102
for the purpose of a high-definition image,
Figure BDA0003191239180000103
for carrying a parameterAn image of additive white gaussian noise (image in the input image pair) with a standard deviation σ of 5, N is the number of input images used for training, and F (·) is a latent function representing the processing of the proposed image fusion model; l isSSIMIs a cost function for image similarity, which is defined as:
Figure BDA0003191239180000104
where SSIM (. cndot.) is the image similarity function, the mathematical expression is:
Figure BDA0003191239180000105
in the above formula,. mu.k,σj,σjkAnd C are the mean of image k, the variance of image j, the covariance of images j and k, and a constant, respectively. Theta is a deep learning network parameter, and lambda is a similarity cost function weight control parameter. After 2000 training, the optimized parameter Θ' can be obtained.
The embodiment of the application further provides a terminal device, which can be a computer or a server; comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above deep learning based infrared and visible light fusion imaging method when executing the computer program, for example, the aforementioned S1 to S4.
The computer program may also be partitioned into one or more modules/units, which are stored in the memory and executed by the processor to accomplish the present application. One or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of a computer program in a terminal device.
Implementations of the present application provide a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above-described deep learning-based infrared and visible light fusion imaging method, e.g., S1-S4.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, in accordance with legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunications signals.
The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. An infrared and visible light fusion imaging method based on deep learning is characterized by comprising the following steps:
simultaneously acquiring infrared images and visible light images with the same size aiming at a target to form a target object image pair, carrying out sub-image pair segmentation, preprocessing and noise image addition on sub-images in the target object image pair, inputting the sub-image pair added with the noise image into a trained image fusion model as an input image pair, and obtaining a fused image;
the image fusion model comprises a shallow layer feature extraction unit, an encoder, a fusion module and a decoder, wherein:
the shallow feature extraction unit is used for performing shallow feature extraction on the input image pair;
the encoder is of a double-layer network structure and comprises an upper layer network and a lower layer network; the upper network comprises continuous stacking and jumping connection of a plurality of convolutional layers and linear rectifying layers, and the convolutional layers are arranged at the end and used for extracting characteristics and recombining channels; the lower layer network comprises a non-local enhancement module, a continuous stacking and jump connection of a multilayer convolution layer and a linear rectification layer, and then a second-order information attention module and a convolution layer;
the fusion module is used for generating a fusion feature map by combining a space attention mechanism and a channel attention mechanism for the feature map pair output by the upper network and the feature map pair output by the lower network;
the decoder comprises a plurality of upper sampling layers, a plurality of convolution blocks, a plurality of convolution layers and a linear rectification layer; wherein each convolution block includes two convolution layers having convolution kernels of different sizes.
2. The infrared and visible light fusion imaging method based on deep learning of claim 1, wherein the sub-image pair segmentation, pre-processing and noise map addition of the sub-images in the target object image pair comprises:
down-sampling the target object image pair to divide into sub-image pairs; extracting image blocks and recombining pixels of each sub-image in the sub-image pair to obtain a preprocessed sub-image pair; and constructing a noise map in a random sampling mode, and adding the noise map into the preprocessed sub-image pair as an additional channel.
3. The infrared and visible light fusion imaging method based on deep learning of claim 1, wherein the processing procedure of the shallow feature extraction unit comprises:
inputting pairs of images
Figure FDA0003191239170000011
Is input to the shallow feature extraction unit in an image tensor format; the shallow feature extraction unit comprises a convolution layer and a linear rectification layer ReLU, and the input image pair
Figure FDA0003191239170000012
Completing shallow feature extraction after the convolution layer and the linear rectification layer to obtain a shallow feature map pair
Figure FDA0003191239170000013
4. The deep learning based infrared and visible light fusion imaging method of claim 1, wherein the continuous stacking and jumping connection of the multilayer convolutional layer and the linear rectifying layer in the upper network comprises:
each convolution layer and a linear rectifying layer ReLU form a feature extraction unit, and the feature extraction units are 4 in total; the input of the first feature extraction unit is a shallow feature map pair, the input of the second feature extraction unit is the output of the first feature extraction unit and the shallow feature map pair, the input of the third feature extraction unit is the output of the second feature extraction unit, the output of the first feature extraction unit and the shallow feature map pair, and the input of the fourth feature extraction unit is the output of the third feature extraction unit, the output of the second feature extraction unit, the output of the first feature extraction unit and the shallow feature map pair, so that continuous stacking and skip connection is formed.
5. The infrared and visible light fusion imaging method based on deep learning of claim 1, wherein the non-local enhancement module comprises an image partition layer and four convolution layers, and the shallow feature map pair is in the lower network of the encoder
Figure FDA0003191239170000021
Firstly, dividing image blocks into layers by an image, respectively generating m multiplied by m image blocks with the same size, and performing non-local feature enhancement on each divided feature image block, wherein a mathematical model of the non-local feature enhancement is as follows:
Figure FDA0003191239170000022
wherein i is a feature position index of the shallow feature map to be calculated, and N is the number of position indexes of the shallow feature map; j is the index of all possible positions in the feature map,
Figure FDA0003191239170000023
the ith position representing the t-th feature tile,
Figure FDA0003191239170000024
indicating after enhancement
Figure FDA0003191239170000025
Figure FDA0003191239170000026
Respectively representing the convolution processing in the non-local enhancement module,
Figure FDA0003191239170000027
Wψ,Wωand W andρweights learned for the four convolutional layers in the non-local enhancement module;
each enhanced feature tile
Figure FDA0003191239170000028
Finally, the feature images are combined into a feature image tensor to generate an enhanced feature image pair
Figure FDA0003191239170000029
After the continuous stacking and jump connection of the convolution layer and the linear rectification layer ReLU in the lower network, the information enters a second-order information attention module.
6. The infrared and visible light fusion imaging method based on deep learning of claim 1, wherein the second-order information attention module comprises a normalization layer, a pooling layer, a convolution layer, a linear rectification layer ReLU, a convolution layer, a gating layer Sigmoid, which are connected in sequence;
enhancing pairs of feature images
Figure FDA0003191239170000031
The information is transmitted into a second-order information attention module, and the channel is readjusted by considering the dependency relationship between the characteristic information of a second-order statistic channel and self-adaptive learning characteristics.
7. The infrared and visible light fusion imaging method based on deep learning of claim 1, characterized in that the mathematical model of the fusion module is:
Figure FDA0003191239170000032
Figure FDA0003191239170000033
wherein Sa (-) and Ca (-) represent implicit functions of spatial attention and channel attention, respectively,
Figure FDA0003191239170000034
and
Figure FDA0003191239170000035
a graph of the fused features is represented,
Figure FDA0003191239170000036
for a pair of feature maps output by an upper network,
Figure FDA0003191239170000037
is a feature map pair output by the lower network.
8. The infrared and visible light fusion imaging method based on deep learning of claim 1, wherein there are 5 volume blocks, denoted CD 1-CD 5; each convolution block comprises 1 layer of 3 × 3 convolution layers and one layer of 1 × 1 convolution layers;
the 5 convolution blocks CD are connected with each other in an up-sampling and jumping mode, wherein CD1, CD2 and CD3 are connected in sequence, the input of CD1 is simultaneously superposed with the input of CD2 and the input of CD3, and the output of CD1 is simultaneously superposed with the input of CD 3; the CD4 and the CD5 are sequentially connected, and the input of the CD4 is superposed with the input of the CD1 after passing through an up-sampling layer on one hand and the input of the CD5 on the other hand; the output of the CD4 is overlapped with the input of the CD2 after passing through an upsampling layer, the output of the CD5 is overlapped with the input of the CD3 after passing through the upsampling layer, the output of the CD3 is connected with two feature extraction units consisting of a 3 x 3 convolutional layer and a linear rectifying layer ReLU after passing through the upsampling layer, and finally, an output fusion image is obtained through one 3 x 3 convolutional layer.
9. A terminal device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, characterized in that the processor, when executing the computer program, implements the steps of the deep learning based infrared and visible light fusion imaging method according to any one of claims 1 to 8.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method for deep learning based infrared and visible light fusion imaging according to any one of claims 1 to 8.
CN202110878885.5A 2021-08-02 2021-08-02 Infrared and visible light fusion imaging method based on deep learning Active CN113487530B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110878885.5A CN113487530B (en) 2021-08-02 2021-08-02 Infrared and visible light fusion imaging method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110878885.5A CN113487530B (en) 2021-08-02 2021-08-02 Infrared and visible light fusion imaging method based on deep learning

Publications (2)

Publication Number Publication Date
CN113487530A true CN113487530A (en) 2021-10-08
CN113487530B CN113487530B (en) 2023-06-16

Family

ID=77945059

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110878885.5A Active CN113487530B (en) 2021-08-02 2021-08-02 Infrared and visible light fusion imaging method based on deep learning

Country Status (1)

Country Link
CN (1) CN113487530B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116723412A (en) * 2023-08-10 2023-09-08 四川玉米星球科技有限公司 Method for homogenizing background light and shadow in photo and text shooting and scanning system
CN116824462A (en) * 2023-08-30 2023-09-29 贵州省林业科学研究院 Forest intelligent fireproof method based on video satellite

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160093034A1 (en) * 2014-04-07 2016-03-31 Steven D. BECK Contrast Based Image Fusion
WO2020243967A1 (en) * 2019-06-06 2020-12-10 深圳市汇顶科技股份有限公司 Face recognition method and apparatus, and electronic device
CN113033630A (en) * 2021-03-09 2021-06-25 太原科技大学 Infrared and visible light image deep learning fusion method based on double non-local attention models
CN113034408A (en) * 2021-04-30 2021-06-25 广东工业大学 Infrared thermal imaging deep learning image denoising method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160093034A1 (en) * 2014-04-07 2016-03-31 Steven D. BECK Contrast Based Image Fusion
WO2020243967A1 (en) * 2019-06-06 2020-12-10 深圳市汇顶科技股份有限公司 Face recognition method and apparatus, and electronic device
CN113033630A (en) * 2021-03-09 2021-06-25 太原科技大学 Infrared and visible light image deep learning fusion method based on double non-local attention models
CN113034408A (en) * 2021-04-30 2021-06-25 广东工业大学 Infrared thermal imaging deep learning image denoising method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
程永翔;刘坤;贺钰博;: "基于卷积神经网络与视觉显著性的图像融合", 计算机应用与软件, no. 03, pages 231 - 236 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116723412A (en) * 2023-08-10 2023-09-08 四川玉米星球科技有限公司 Method for homogenizing background light and shadow in photo and text shooting and scanning system
CN116723412B (en) * 2023-08-10 2023-11-10 四川玉米星球科技有限公司 Method for homogenizing background light and shadow in photo and text shooting and scanning system
CN116824462A (en) * 2023-08-30 2023-09-29 贵州省林业科学研究院 Forest intelligent fireproof method based on video satellite
CN116824462B (en) * 2023-08-30 2023-11-07 贵州省林业科学研究院 Forest intelligent fireproof method based on video satellite

Also Published As

Publication number Publication date
CN113487530B (en) 2023-06-16

Similar Documents

Publication Publication Date Title
CN112347859A (en) Optical remote sensing image saliency target detection method
CN114445430B (en) Real-time image semantic segmentation method and system for lightweight multi-scale feature fusion
CN113870335A (en) Monocular depth estimation method based on multi-scale feature fusion
CN113487530A (en) Infrared and visible light fusion imaging method based on deep learning
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
CN116311254B (en) Image target detection method, system and equipment under severe weather condition
US20240161304A1 (en) Systems and methods for processing images
CN116229452B (en) Point cloud three-dimensional target detection method based on improved multi-scale feature fusion
CN116596792B (en) Inland river foggy scene recovery method, system and equipment for intelligent ship
CN112085717B (en) Video prediction method and system for laparoscopic surgery
CN116612468A (en) Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism
CN115578262A (en) Polarization image super-resolution reconstruction method based on AFAN model
CN114511798B (en) Driver distraction detection method and device based on transformer
CN115861650A (en) Shadow detection method and device based on attention mechanism and federal learning
CN112926552B (en) Remote sensing image vehicle target recognition model and method based on deep neural network
CN110555877B (en) Image processing method, device and equipment and readable medium
CN117036436A (en) Monocular depth estimation method and system based on double encoder-decoder
CN116823908A (en) Monocular image depth estimation method based on multi-scale feature correlation enhancement
Chen et al. Exploring efficient and effective generative adversarial network for thermal infrared image colorization
CN117576483B (en) Multisource data fusion ground object classification method based on multiscale convolution self-encoder
CN114842012B (en) Medical image small target detection method and device based on position awareness U-shaped network
CN116883232A (en) Image processing method, image processing apparatus, electronic device, storage medium, and program product
Kong et al. Color subspace exploring for natural image matting
CN117788515A (en) Single-target tracking method combining attention mechanism and weighted response
Vasiljevic Neural Camera Models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant