CN113487530A

CN113487530A - Infrared and visible light fusion imaging method based on deep learning

Info

Publication number: CN113487530A
Application number: CN202110878885.5A
Authority: CN
Inventors: 程良伦; 李卓; 吴衡
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-08-02
Filing date: 2021-08-02
Publication date: 2021-10-08
Anticipated expiration: 2041-08-02
Also published as: CN113487530B

Abstract

The application discloses an infrared and visible light fusion imaging method based on deep learning, which comprises the following steps: simultaneously acquiring infrared images and visible light images with the same size aiming at a target to form a target object image pair, carrying out sub-image pair segmentation, preprocessing and noise image addition on sub-images in the target object image pair, inputting the sub-image pair added with the noise image into a trained image fusion model as an input image pair, and obtaining a fused image; the image fusion model comprises a shallow layer feature extraction unit, an encoder, a fusion module and a decoder. The method considers the common fact that the source image contains noise, takes the common fact as the premise of practical imaging application, carries out image denoising in the process of fusing images, and effectively avoids the noise point in the source image pair from being introduced into the fused image; the method is very beneficial to the application research of the infrared and visible light fusion and deep learning technology, and is expected to be widely applied in the fields of medical imaging, night monitoring and the like.

Description

Infrared and visible light fusion imaging method based on deep learning

Technical Field

The application relates to the field of image fusion, in particular to an infrared and visible light fusion imaging method based on deep learning.

Background

The infrared thermal imaging takes target thermal radiation information as input, can also detect a target when the illumination is insufficient, and distinguishes the target from a background according to the difference of the radiation. Visible light sensors utilize object reflectivity to image the target, providing images with high resolution and sharp texture details. The infrared and visible light imaging system can simultaneously reflect different attributes of the same target object, provide scene information from different aspects, and have wide application in the fields of military affairs, video monitoring, automobile auxiliary driving, forest fire prevention and the like.

In recent years, with the widespread use of infrared and visible light imaging technologies, the information utilization rate of multi-modal sensors and the reliable operation time of imaging systems have been further improved, but some problems have been exposed, such as: the infrared thermal image has poor imaging quality, low contrast and serious noise interference; visible light images are susceptible to insufficient lighting, fog, and other inclement weather; most imaging systems generate images that are noisy. Therefore, the image fusion method based on the infrared and visible light imaging systems is extremely important in the field of image information fusion, and the development of an advanced image fusion algorithm is very helpful for the application and development of an information fusion technology.

Disclosure of Invention

The application aims to provide an infrared and visible light fusion imaging method based on deep learning, and the method is used for solving the problems that in an infrared and visible light imaging system, the infrared imaging quality is poor, the noise interference is serious, and a visible light image is susceptible to severe weather.

In order to realize the task, the following technical scheme is adopted in the application:

the application provides an infrared and visible light fusion imaging method based on deep learning in a first aspect, which comprises the following steps:

simultaneously acquiring infrared images and visible light images with the same size aiming at a target to form a target object image pair, carrying out sub-image pair segmentation, preprocessing and noise image addition on sub-images in the target object image pair, inputting the sub-image pair added with the noise image into a trained image fusion model as an input image pair, and obtaining a fused image;

the image fusion model comprises a shallow layer feature extraction unit, an encoder, a fusion module and a decoder, wherein:

the shallow feature extraction unit is used for performing shallow feature extraction on the input image pair;

the encoder is of a double-layer network structure and comprises an upper layer network and a lower layer network; the upper network comprises continuous stacking and jumping connection of a plurality of convolutional layers and linear rectifying layers, and the convolutional layers are arranged at the end and used for extracting characteristics and recombining channels; the lower layer network comprises a non-local enhancement module, a continuous stacking and jump connection of a multilayer convolution layer and a linear rectification layer, and then a second-order information attention module and a convolution layer;

the fusion module is used for generating a fusion feature map by combining a space attention mechanism and a channel attention mechanism for the feature map pair output by the upper network and the feature map pair output by the lower network;

the decoder comprises a plurality of upper sampling layers, a plurality of convolution blocks, a plurality of convolution layers and a linear rectification layer; wherein each convolution block includes two convolution layers having convolution kernels of different sizes.

Further, the sub-image pair segmentation, preprocessing and noise map addition for the sub-images in the target object image pair includes:

down-sampling the target object image pair to divide into sub-image pairs; extracting image blocks and recombining pixels of each sub-image in the sub-image pair to obtain a preprocessed sub-image pair; and constructing a noise map in a random sampling mode, and adding the noise map into the preprocessed sub-image pair as an additional channel.

Further, the processing procedure of the shallow feature extraction unit includes:

inputting pairs of images

Is input to the shallow feature extraction unit in an image tensor format; the shallow feature extraction unit comprises a convolution layer and a linear rectification layer ReLU, and the input image pair

Completing shallow feature extraction after the convolution layer and the linear rectification layer to obtain a shallow feature map pair

Further, the continuous stacking and hopping connection of the multilayer convolutional layer and the linear rectifying layer in the upper network includes:

each convolution layer and a linear rectifying layer ReLU form a feature extraction unit, and the feature extraction units are 4 in total; the input of the first feature extraction unit is a shallow feature map pair, the input of the second feature extraction unit is the output of the first feature extraction unit and the shallow feature map pair, the input of the third feature extraction unit is the output of the second feature extraction unit, the output of the first feature extraction unit and the shallow feature map pair, and the input of the fourth feature extraction unit is the output of the third feature extraction unit, the output of the second feature extraction unit, the output of the first feature extraction unit and the shallow feature map pair, so that continuous stacking and skip connection is formed.

Further, the non-local enhancement module includes an image partition layer and four convolution layers, and the shallow feature map pairs in the lower network of the encoder

Firstly, dividing image blocks into layers by an image, respectively generating m multiplied by m image blocks with the same size, and performing non-local feature enhancement on each divided feature image block, wherein a mathematical model of the non-local feature enhancement is as follows:

wherein i is a feature position index of the shallow feature map to be calculated, and N is the number of position indexes of the shallow feature map; j is the index of all possible positions in the feature map,

the ith position representing the t-th feature tile,

indicating after enhancement

Respectively representing the convolution processing in the non-local enhancement module,

W_ψ，W_ωand W and_ρweights learned for the four convolutional layers in the non-local enhancement module;

each enhanced feature tile

Finally, the feature images are combined into a feature image tensor to generate an enhanced feature image pair

After the continuous stacking and jump connection of the convolution layer and the linear rectification layer ReLU in the lower network, the information enters a second-order information attention module.

Further, the second-order information attention module comprises a normalization layer, a pooling layer, a convolution layer, a linear rectification layer ReLU, a convolution layer and a gating layer Sigmoid which are sequentially connected;

enhancing pairs of feature images

Is introduced into the second-order information attention module, and takes into account the characteristics of the second-order statistic channelAnd (4) self-adaptively learning the dependency relationship between the features and readjusting the channel.

Further, the mathematical model of the fusion module is:

wherein Sa (-) and Ca (-) represent implicit functions of spatial attention and channel attention, respectively,

and

a graph of the fused features is represented,

for a pair of feature maps output by an upper network,

is a feature map pair output by the lower network.

Further, there are 5 of the volume blocks, noted CD 1-CD 5; each convolution block comprises 1 layer of 3 × 3 convolution layers and one layer of 1 × 1 convolution layers;

the 5 convolution blocks CD are connected with each other in an up-sampling and jumping mode, wherein CD1, CD2 and CD3 are connected in sequence, the input of CD1 is simultaneously superposed with the input of CD2 and the input of CD3, and the output of CD1 is simultaneously superposed with the input of CD 3; the CD4 and the CD5 are sequentially connected, and the input of the CD4 is superposed with the input of the CD1 after passing through an up-sampling layer on one hand and the input of the CD5 on the other hand; the output of the CD4 is overlapped with the input of the CD2 after passing through an upsampling layer, the output of the CD5 is overlapped with the input of the CD3 after passing through the upsampling layer, the output of the CD3 is connected with two feature extraction units consisting of a 3 x 3 convolutional layer and a linear rectifying layer ReLU after passing through the upsampling layer, and finally, an output fusion image is obtained through one 3 x 3 convolutional layer.

In a second aspect, the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the deep learning based infrared and visible light fusion imaging method of the first aspect when executing the computer program.

In a third aspect, the present application provides a computer-readable storage medium storing a computer program, which when executed by a processor implements the steps of the aforementioned deep learning-based infrared and visible light fusion imaging method of the first aspect.

Compared with the prior art, the method has the following technical characteristics:

in the method, a network structure is designed into a double-layer structure to extract depth characteristic mapping of infrared and visible images; the fusion strategy combines spatial and channel attention mechanisms to generate a richer fusion feature map; the nested connection architecture is applied to the fusion network of the application to avoid semantic defects between the encoder and the decoder; the noise estimation image is added into the deep learning fusion network to realize the denoising function, so that the model is denoised in the process of fusing the images. Compared with the prior art, the method and the device consider the common fact that the source image contains noise, take the source image as the premise of practical imaging application, carry out image denoising in the process of fusing the images, and effectively avoid the introduction of noise points in the source image pair into the fused images. The method is very beneficial to the application research of the infrared and visible light fusion and deep learning technology, and is expected to be widely applied in the fields of medical imaging, night monitoring and the like.

Drawings

FIG. 1 is a schematic flow chart illustrating the execution of a method according to an embodiment of the present application;

FIG. 2 is the image resulting from the preprocessing in the example, processing a 1 × 320 × 240 infrared and visible light image pair into a 5 × 160 × 120 image pair, with a noise map M added as an additional channel to the input;

FIG. 3 is a diagram illustrating an embodiment of a deep neural network architecture, the final result of which is a fused image;

FIG. 4 is a schematic diagram of a network decoder in an embodiment, which performs feature extraction on infrared and visible light images to generate a feature map pair;

FIG. 5 is a schematic diagram of an exemplary fusion strategy, which is formed by combining spatial and channel attention mechanisms;

FIG. 6 is a schematic diagram of an embodiment of a network encoder;

fig. 7 is a schematic diagram of a convolution block CD in the example, which is composed of 1 layer of 3 × 3 convolution layer and 1 layer of 1 × 1 convolution layer.

Detailed Description

The method improves the imaging quality, effectively removes noise interference in the image, enables the target in the image to be prominent and clear in texture, facilitates more accurate target identification, and is more beneficial to all-weather work of an imaging system.

As shown in fig. 1, the present application processes a target object image pair by using a depth learning image fusion method, captures a detected target object image by using an infrared and visible light imaging system to generate an infrared and visible light image pair, and processes the target image pair, wherein the fusion process includes the following steps:

s1, an infrared image and a visible light image of the same size are simultaneously acquired for the target to form a target image pair, and the target image pair is down-sampled to be divided into sub-image pairs.

In the present application, n represents the size_chTarget image pair of x h x w (I)_ir,I_vis) Performing downsampling division to form 4n_chA sub-image pair of x h/2 x w/2; wherein I_irAs an infrared image, I_visFor visible light images, n_chFor a channel, h is the height and w is the width.

And S2, extracting image blocks and recombining pixels of each sub-image in the sub-image pair to obtain a preprocessed sub-image pair.

In this step, 2 × 2 image blocks are taken for each sub-image (infrared image and visible light image) and are outputRecombining pixels in different channels of the output image to obtain preprocessed sub-image pairs

Its mathematical model can be expressed as follows:

in the above expression, c is the image channel, x is the image pixel abscissa, y is the image pixel ordinate, where c is greater than or equal to 0 and less than or equal to 4n_chX is more than or equal to 0 and less than or equal to h, and y is more than or equal to 0 and less than or equal to w. The post-processing of the present application will be carried out on the above reduced scale.

And S3, constructing a noise map in a random sampling mode, and adding the noise map as an additional channel into the preprocessed sub-image pairs.

From the standard deviation of the noise as [ sigma ]₁,σ₂) Randomly sampling n in uniform distribution_chX h x w sized samples to construct the noise map M. The noise map M will be added as an extra channel to the input image

And

performing the following steps; the noise map controls the trade-off between noise reduction and detail retention; for example, in the present embodiment, a uniform distribution of [0,75) is employed.

The target image with the size of (4 n) is obtained through the pretreatment of the steps_ch+n_ch) Sub-image pairs of x h/2 x w/2 infrared and visible images

A schematic diagram of which is shown in fig. 2.

S4, the pairs of sub-images to which the noise images are added are input as input image pairs to the trained image fusion model, and a fused image is obtained.

Sub-image pairs to which noise maps are added

As an input of the image fusion model shown in fig. 3, the fused image is finally output, and the mathematical model thereof can be expressed as:

in the above expression, I_fFor fusing images, F (-) is an implicit function representing an image fusion model; in this embodiment, the model is a convolutional neural network model.

Referring to fig. 3, the image fusion model proposed in the present application sequentially includes, from left to right: shallow layer characteristic extraction element, encoder, fusion module, decoder. The modules are introduced as follows:

1. shallow feature extraction unit

Inputting pairs of images

A shallow feature extraction unit which is input to the image fusion model in an image tensor format; the shallow feature extraction unit comprises a 3 × 3 convolution layer and a linear rectifying layer ReLU, and the input image pair

2. Encoder for encoding a video signal

The present application provides an encoder modular structure for extracting depth feature maps of input image pairs, as shown in fig. 4, the encoder is a dual-layer network structure including an upper layer network and a lower layer networkComplexing; shallow feature map pair

Respectively transmitting the data to an upper layer network and a lower layer network in a double-layer network to generate 4 deep layer feature maps; wherein:

2.1 upper network

The upper network comprises 4 3 × 3 convolutional layers and continuous stacking and hopping connection of linear rectifying layers ReLU from left to right, and 1 3 × 3 convolutional layer is arranged at the end for extracting features and recombining channels. The continuous stacking and jump connection means that each 3 x 3 convolution layer and a linear rectifying layer ReLU form a feature extraction unit, and the feature extraction units are 4 in total; the input of the first feature extraction unit is a shallow feature map pair, the input of the second feature extraction unit is the output of the first feature extraction unit and the shallow feature map pair, the input of the third feature extraction unit is the output of the second feature extraction unit, the output of the first feature extraction unit and the shallow feature map pair, and the input of the fourth feature extraction unit is the output of the third feature extraction unit, the output of the second feature extraction unit, the output of the first feature extraction unit and the shallow feature map pair, so that continuous stacking and skip connection is formed.

In the encoder, the shallow feature map pair

Transmitted to upper network for deep feature extraction, and passed through last convolutional layer of upper network to generate feature diagram pair

2.2 lower layer network

The lower layer network comprises a non-local enhancement module, 4 layers of 3 multiplied by 3 convolution layers and a continuous stacking and jumping connection of a linear rectification layer ReLU from left to right, and then a second-order information attention module, 1 layer of 3 multiplied by 3 convolution layers; there are also 4 feature extraction units in the lower network, where the input of the first feature extraction unit is a shallow feature map pair

The output after passing through the non-local enhancement module and the output of the subsequent 4 feature extraction units are used as the input of the second-order information attention module; the continuous stacking and jumping connection of the middle 4 feature extraction units is similar to that of the upper network, and is not described in detail herein.

(a) Non-local enhancement module

In this application, the non-local enhancement module includes an image partition layer and 4 1 × 1 convolution layers. Shallow profile pairs in the underlying network of encoders

Firstly, dividing the image into hierarchical image blocks to generate m × m image blocks with the same size, wherein the dividing process is as follows:

h₁＝h/m,w₁＝w/m

in the above expression, h₁And w₁The sizes of the divided image blocks are respectively, h is the height, and w is the width;

representing the k characteristic blocks divided by the shallow characteristic map, each divided characteristic block is subjected to non-local characteristic enhancement, and a mathematical model of the non-local characteristic enhancement can be represented as follows:

where i is a feature position index of the shallow feature map to be calculated, and N ═ h × w/k²Indexing the position of the shallow feature map; j is the index of all possible positions in the feature map,

the ith position representing the t-th feature tile,

indicating after enhancement

Respectively representing the convolution processing of 3 of the 1 x 1 convolutional layers in the non-local enhancement module,

W_ψ，W_ωand W and_ρweights learned for the four 1 x 1 convolutional layers in the non-local enhancement module.

Each enhanced feature tile

After the continuous stacking and jump connection of 4 layers of 3 multiplied by 3 convolution layers and a linear rectification layer ReLU in a lower network, the information enters a second-order information attention module.

(b) Second-order information attention module

The second-order information attention module comprises a normalization layer, a pooling layer, a 3 x 3 convolution layer, a linear rectification layer ReLU, a 3 x 3 convolution layer and a gating layer Sigmoid which are sequentially connected.

Enhancing pairs of feature images

After being processed by the four feature extraction units, the data are transmitted to a second-order information attention module, the channel is readjusted by considering the dependency relationship between the feature information of a second-order statistic channel and self-adaptive learning features, and a mathematical model of the channel can be expressed as follows:

f^sola＝Channel(Cov(f^R))

wherein f is^RRepresenting enhanced feature images, Cov (-) representing covariance normalization, Channel (-) representing Channel attention, f^solaIs a profile enhanced by channel information. Enhancing pairs of feature images

Processing the deep characteristic graph pair by a characteristic extraction unit and enhancing second-order information by a second-order information attention module to generate a deep characteristic graph pair

Is transmitted into the 3 × 3 convolutional layer to complete the channel adjustment, and generate a feature map pair

3. Fusion module

In the fusion module, the feature map pair aiming at the output of the upper network

And feature map pairs of the lower network output

Combining the spatial attention mechanism and the channel attention mechanism to generate a fusion feature map, as shown in fig. 5, the spatial attention mechanism is used to fuse the multi-scale depth features in the image pair, and considering that the deep features are three-dimensional tensors, in the present application, the channel attention mechanism is used for channel information calculation, and a mathematical model of a fusion strategy thereof may be defined as:

and

a fused feature map is shown.

4. Decoder

As shown in fig. 6, the decoder includes a plurality of upsampled layers, 5 convolutional blocks CD, 3 layers of 3 × 3 convolutional layers, 2 linear rectifying layers ReLU; of the 5 convolutional blocks CD, each convolutional block CD contains 1 layer of 3 × 3 convolutional layers and one layer of 1 × 1 convolutional layers, as shown in fig. 7.

The 5 convolution blocks CD are connected with each other in an up-sampling and jumping mode, the 5 convolution blocks CD are marked by CD1, CD2, CD3, CD4 and CD5 respectively, wherein CD1, CD2 and CD3 are connected in sequence, the input of CD1 is simultaneously superposed with the input of CD2 and the input of CD3, and the output of CD1 is simultaneously superposed with the input of CD 3; the CD4 and the CD5 are sequentially connected, and the input of the CD4 is superposed with the input of the CD1 after passing through an up-sampling layer on one hand and the input of the CD5 on the other hand; the output of the CD4 is overlapped with the input of the CD2 after passing through an upsampling layer, the output of the CD5 is overlapped with the input of the CD3 after passing through the upsampling layer, the output of the CD3 is connected with two feature extraction units consisting of a 3 x 3 convolutional layer and a linear rectifying layer ReLU after passing through the upsampling layer, and finally, an output fusion image is obtained through one 3 x 3 convolutional layer. This way of linking the volume blocks CD allows the network model to avoid semantic loss between the encoder and the decoder.

In the present application, feature maps are fused

First with the upsampled fused feature map

Overlay Generation of feature maps

Is transmitted to CD1 for feature extraction, and simultaneously is connected by jump

The CD2 and CD3 are respectively transmitted as overlapping input to provide richer fusion information;

feature maps are generated by respectively extracting features of CD1 and CD4

And

after passing through the upper sampling layer and

and

overlay as input to CD2, simultaneous feature map

Superimpose the input of CD3 through a jump connection;

and

overlay input to CD5 to generate deep profile

CD2 output

And up-sampled

Superimposing input to CD3 to generate depth features

After the connection of the up-sampling is carried out,

the image data is input into two feature extraction units consisting of a 3 × 3 convolutional layer and a linear rectifying layer ReLU, and finally a fused image F' is obtained by completing the reconstruction of the fused image through one 3 × 3 convolutional layer.

In the deep neural network training process, an Adam function is adopted to optimize a loss function L (theta), wherein the loss function is defined as:

L(Θ)＝L_MSE+λL_SSIM

in the above formula L_MSEAs a function of the mean square error, it can be expressed as follows:

wherein the content of the first and second substances,

for the purpose of a high-definition image,

for carrying a parameterAn image of additive white gaussian noise (image in the input image pair) with a standard deviation σ of 5, N is the number of input images used for training, and F (·) is a latent function representing the processing of the proposed image fusion model; l is_SSIMIs a cost function for image similarity, which is defined as:

where SSIM (. cndot.) is the image similarity function, the mathematical expression is:

in the above formula,. mu._k，σ_j，σ_jkAnd C are the mean of image k, the variance of image j, the covariance of images j and k, and a constant, respectively. Theta is a deep learning network parameter, and lambda is a similarity cost function weight control parameter. After 2000 training, the optimized parameter Θ' can be obtained.

The embodiment of the application further provides a terminal device, which can be a computer or a server; comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above deep learning based infrared and visible light fusion imaging method when executing the computer program, for example, the aforementioned S1 to S4.

The computer program may also be partitioned into one or more modules/units, which are stored in the memory and executed by the processor to accomplish the present application. One or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of a computer program in a terminal device.

Implementations of the present application provide a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above-described deep learning-based infrared and visible light fusion imaging method, e.g., S1-S4.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, in accordance with legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunications signals.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. An infrared and visible light fusion imaging method based on deep learning is characterized by comprising the following steps:

2. The infrared and visible light fusion imaging method based on deep learning of claim 1, wherein the sub-image pair segmentation, pre-processing and noise map addition of the sub-images in the target object image pair comprises:

3. The infrared and visible light fusion imaging method based on deep learning of claim 1, wherein the processing procedure of the shallow feature extraction unit comprises:

inputting pairs of images

4. The deep learning based infrared and visible light fusion imaging method of claim 1, wherein the continuous stacking and jumping connection of the multilayer convolutional layer and the linear rectifying layer in the upper network comprises:

5. The infrared and visible light fusion imaging method based on deep learning of claim 1, wherein the non-local enhancement module comprises an image partition layer and four convolution layers, and the shallow feature map pair is in the lower network of the encoder

the ith position representing the t-th feature tile,

indicating after enhancement

each enhanced feature tile

6. The infrared and visible light fusion imaging method based on deep learning of claim 1, wherein the second-order information attention module comprises a normalization layer, a pooling layer, a convolution layer, a linear rectification layer ReLU, a convolution layer, a gating layer Sigmoid, which are connected in sequence;

enhancing pairs of feature images

The information is transmitted into a second-order information attention module, and the channel is readjusted by considering the dependency relationship between the characteristic information of a second-order statistic channel and self-adaptive learning characteristics.

7. The infrared and visible light fusion imaging method based on deep learning of claim 1, characterized in that the mathematical model of the fusion module is:

and

a graph of the fused features is represented,

for a pair of feature maps output by an upper network,

is a feature map pair output by the lower network.

8. The infrared and visible light fusion imaging method based on deep learning of claim 1, wherein there are 5 volume blocks, denoted CD 1-CD 5; each convolution block comprises 1 layer of 3 × 3 convolution layers and one layer of 1 × 1 convolution layers;

9. A terminal device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, characterized in that the processor, when executing the computer program, implements the steps of the deep learning based infrared and visible light fusion imaging method according to any one of claims 1 to 8.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method for deep learning based infrared and visible light fusion imaging according to any one of claims 1 to 8.