CN117058507A

CN117058507A - Fourier convolution-based visible light and infrared image multi-scale feature fusion method

Info

Publication number: CN117058507A
Application number: CN202311037544.0A
Authority: CN
Inventors: 程文明; 陈国强; 魏振兴; 张国财; 唐长华
Original assignee: Zhejiang Aerospace Runbo Measurement And Control Technology Co ltd
Current assignee: Zhejiang Aerospace Runbo Measurement And Control Technology Co ltd
Priority date: 2023-08-17
Filing date: 2023-08-17
Publication date: 2023-11-14
Anticipated expiration: 2043-08-17
Also published as: CN117058507B

Abstract

The application discloses a Fourier convolution-based visible light and infrared image multi-scale feature fusion method, which comprises the following steps of: A. acquiring an RGB image and an infrared image to be fused; B. deep semantic information in the RGB image and the infrared image is extracted through a multi-scale feature extractor, and the RGB image and the infrared image with the deep semantic information are obtained; C. carrying out multi-source information fusion processing on the RGB image and the infrared image with deep semantic information through a fast Fourier convolution module to obtain a multi-source information fusion feature map; D. fusing the characteristics of different layers in the multiscale information fusion characteristic map by utilizing a multiscale characteristic fusion module to obtain a multiscale characteristic fusion characteristic map; E. and the covariance pooling module processes the multi-scale feature fusion feature map by adopting a global covariance pooling mode to obtain a comprehensive fusion feature map. According to the application, the infrared image and the RGB image are effectively fused together, so that more comprehensive and more accurate image data is obtained.

Description

Fourier convolution-based visible light and infrared image multi-scale feature fusion method

Technical Field

The application relates to an image feature extraction processing method, in particular to a multi-scale feature fusion method based on Fourier convolution visible light and infrared images.

Background

In the military and security fields, infrared or visible light detection images are currently in common use for detection and identification of objects. The infrared detection image can capture infrared radiation outside visible spectrum, which is radiation that can not be perceived by human eyes, the infrared radiation has strong penetrability for some substances and environments, and can penetrate through smog, haze, cloud layers and other visual barriers, so that the infrared image can still provide effective image information under severe weather conditions, and is beneficial to observation and monitoring in complex environments; but infrared images have some limitations such as relatively low resolution, imaging quality being affected by environmental factors, etc. The visible light detection image has weaker penetrability, but higher resolution and good imaging quality. If the infrared image and the RGB image can be combined, the respective limitations can be overcome, and the comprehensiveness and usability of the image can be improved. Therefore, there is a need for an infrared image and an RGB image that can be effectively fused together to obtain more comprehensive and accurate image data.

Disclosure of Invention

The application aims to provide a multi-scale feature fusion method based on Fourier convolution visible light and infrared images. According to the application, the infrared image and the RGB image are effectively fused together, so that more comprehensive and more accurate image data is obtained.

The technical scheme of the application is as follows: the method for fusing the multi-scale features of the visible light and infrared images based on Fourier convolution comprises the following steps:

A. acquiring an RGB image and an infrared image to be fused;

B. deep semantic information in the RGB image and the infrared image is extracted through a multi-scale feature extractor, and the RGB image and the infrared image with the deep semantic information are obtained;

C. carrying out multi-source information fusion processing on the RGB image and the infrared image with deep semantic information through a fast Fourier convolution module to obtain a multi-source information fusion feature map;

D. fusing the characteristics of different layers in the multiscale information fusion characteristic map by utilizing a multiscale characteristic fusion module to obtain a multiscale characteristic fusion characteristic map;

E. and the covariance pooling module processes the multi-scale feature fusion feature map by adopting a global covariance pooling mode to obtain a comprehensive fusion feature map.

In the method for fusing the multi-scale features of the visible light and the infrared image based on Fourier convolution, the specific process of the fast Fourier convolution module for fusing the multi-source information is as follows:

c1, representing RGB image with deep semantic information asWherein b _r Representing the band, r×c representing the pixel height and width; representing an infrared image with deep semantic information as +.>Wherein b _i Representing the band, r×c representing the pixel height and width;

c2, X is convolved with fast Fourier convolution module _r And X _i The decomposition is explicit along the channel dimension,

feature map Y comprising a mapping of high frequency branches H to high frequency branches H _l ^H→H Feature map Y of the mapping of high frequency branch H to low frequency branch L _l ^H→L Feature map Y of the mapping of low-frequency branches L to high-frequency branches H _h ^L→H Feature map Y of low frequency branch L to low frequency branch L mapping _h ^L→L ；

C3, Y _l ^H→H And Y is equal to _h ^L→H In series, Y _l ^H→L And Y is equal to _h ^L→L Connected in series to obtain two series characteristic diagramsWherein h× W, C represents the spatial resolution and the number of channels, respectively;

c4, splitting the serial characteristic graph X along the dimension of the characteristic channel by the fast Fourier convolution module, namely splitting the serial characteristic graph X into X= { X ^l ，X ^g -a }; wherein the local partFor learning from local neighborhood, global part->For capturing remote context, alpha _in ∈[0，1]Representing the percentage of characteristic channels assigned to the global portion;

c5, useLet y= { Y as output tensor ^l ，Y ^g And updated with equation 1),

c6, use 3X 3 convolution to Y ^l And Y ^g After convolution processing, fusing the two to obtain an output tensor Y, namely a multisource information fusion feature map;

in the method for fusing the multi-scale features of the visible light and the infrared image based on Fourier convolution, the specific processing procedure of the multi-scale feature fusion module is as follows: sequentially carrying out bottleneck processing on the multisource information fusion feature map through a plurality of bottleneck blocks which are connected in series to obtain a multiscale feature fusion feature map;

the convolution window moving stride of the bottleneck block comprises two modes, namely 1 and 2; when the convolution window moving step of the bottleneck block is 1, firstly carrying out feature extraction by using 1X 1 convolution processing on the bottleneck block through depth convolution, and finally carrying out point convolution processing; when the convolution window moving step of the bottleneck block is 2, the bottleneck block firstly uses 1×1 convolution processing, then uses multi-scale convolution to extract features, and finally carries out point convolution processing.

In the method for fusing the multi-scale features of the visible light and the infrared image based on Fourier convolution, the specific processing process of the multi-scale convolution is as follows: the input characteristic mapping diagram is equally divided into s groups according to the channel; features are then extracted from the feature map of the first set of inputs using a 3 x 3 convolution; transmitting the extracted feature output of the first group to the second group and adding to the input of the second group and transmitting the added result to the 3 x 3 convolution of the second group; repeating the steps until the final group of feature mapping is processed; and finally, splicing all extracted characteristic outputs according to channels, and carrying out 1X 1 point convolution to carry out information fusion.

In the Fourier convolution visible light and infrared image-based multi-scale feature fusion method, the specific processing procedure of the covariance pooling module is as follows:

e1, firstly converting a multi-scale feature fusion feature map with the size of h multiplied by w multiplied by d into a feature map with the size of n multiplied by d, wherein n=h multiplied by w;

h and w represent the height and width of the feature map, respectively, and d represents the size of the third dimensional channel of the feature map;

e2, byCalculating a covariance matrix Sigma, wherein +.>I is an n multiplied by n identity matrix, 1 is a matrix with all elements being 1, and X represents an original input characteristic diagram input to a covariance pooling module;

e3, pre-normalizing the covariance matrix Σ by the formula a= (1/(tr (Σ))) Σ;

e4, carrying out iterative treatment by adopting a Newton-Schulz iterative formula;

and E4, performing post-compensation treatment and splicing treatment in sequence.

In the Fourier convolution visible light and infrared image-based multi-scale feature fusion method, the Newton-Schulz iterative formula is as follows

Wherein I represents an identity matrix; y is Y _k-1 Taking the matrix A as a starting value, iterating the results obtained after k-1 times, and repeating the same as Y _k Representing the result obtained after iterating k times;

Z _k-1 representing the result obtained by iterating k-1 times by taking the identity matrix I as a starting value, and the same as I _k The results obtained after iterating k times are shown.

In the Fourier convolution visible light and infrared image-based multi-scale feature fusion method, the calculation formula of post-compensation processing is as follows: c= (tr (Σ)) ^1/2 Y _N Where tr (Σ) is the covariance matrix trace, Y _N The results obtained after N iterations.

In the Fourier convolution visible light and infrared image-based multi-scale feature fusion method, the specific process of the splicing treatment is to splice an upper triangular matrix of the symmetrical matrix obtained by post-compensation treatment into a feature map of d (d-1)/2-dimensional vector, so as to obtain a comprehensive fusion feature map.

Compared with the prior art, the application sequentially carries out the feature extraction of the multi-scale feature extractor, the multi-source information fusion processing of the fast Fourier convolution module, the fusion of different layer features of the multi-scale feature fusion module and the global covariance pooling processing of the covariance pooling module on the RGB image and the infrared image, thereby effectively fusing the RGB image and the infrared image together, simultaneously acquiring the heat energy information and the color information, realizing the efficient feature fusion and the more comprehensive target analysis and the feature extraction, acquiring the more comprehensive and more accurate image data and providing powerful support for the analysis and the research of a plurality of fields. For example, in the military and security fields, the combination of infrared images with RGB images can enable more accurate target detection and identification, improving night vision and target tracking capabilities.

Specifically, deep semantic information in an image is extracted through a multi-scale feature extractor; the fast Fourier convolution module performs multi-source information fusion and retains discrimination information; fusing the features of different layers in the feature map by utilizing a multi-scale feature fusion module; global covariance pooling replaces global average pooling, and high-order information is extracted from RGB images and infrared images to obtain richer depth feature statistical information.

In summary, the application effectively fuses the infrared image and the RGB image together to obtain more comprehensive and more accurate image data.

Extensive experimentation on the baseline dataset showed that classification accuracy by fusion increased 2.036% and 1.926%, respectively, compared to using only infrared images or RGB images.

Drawings

FIG. 1 is a schematic flow chart of the present application;

FIG. 2 is a schematic diagram of the overall structure of the present application;

FIG. 3 is a schematic diagram of a fast Fourier convolution module according to the present application;

FIG. 4 is a schematic diagram of the structure of a fast Fourier convolution layer in the fast Fourier convolution module of the present application; wherein, (a) is a total chart of a Fourier convolution module, and (b) is a specific structure of a SpectralTransformer branch in a;

FIG. 5 is a schematic diagram of a bottleneck block according to the present application;

FIG. 6 is a schematic diagram of a multi-scale convolution in a bottleneck block according to an embodiment of the present application;

fig. 7 is a schematic flow chart of a covariance pooling module according to an embodiment of the application.

Detailed Description

The application is further illustrated by the following figures and examples, which are not intended to be limiting.

Examples. The method for fusing the multi-scale features of the visible light and infrared images based on Fourier convolution comprises the following steps:

A. acquiring an RGB image and an infrared image to be fused;

The specific process of the fast Fourier convolution module for carrying out multi-source information fusion processing is as follows:

special bits comprising mapping of high frequency branches H to high frequency branches HSign diagram Y _l ^H→H Feature map Y of the mapping of high frequency branch H to low frequency branch L _l ^H→L Feature map Y of the mapping of low-frequency branches L to high-frequency branches H _h ^L→H Feature map Y of low frequency branch L to low frequency branch L mapping _h ^L→L ；

c5, useLet y= { Y as output tensor ^l ，Y ^g And updated with equation 1),

Y ^l ＝Y ^l→l +Y ^g→l ＝f _l (X ^l )+f _g→l X ^g )

Y ^g ＝Y ^g→g +Y ^l→g ＝f _g (X ^g )+f _l→g (X ^l ) 1, a method for manufacturing the same

C6, use 3X 3 convolution to Y ^l And Y ^g After convolution processing, the two are fused to obtain an output tensor Y, namely a multisource information fusion feature map；

The specific processing procedure of the multi-scale feature fusion module is as follows: sequentially carrying out bottleneck processing on the multisource information fusion feature map through a plurality of bottleneck blocks which are connected in series to obtain a multiscale feature fusion feature map;

The specific processing procedure of the multi-scale convolution is as follows: the input characteristic mapping diagram is equally divided into s groups according to the channel; features are then extracted from the feature map of the first set of inputs using a 3 x 3 convolution; transmitting the extracted feature output of the first group to the second group and adding to the input of the second group and transmitting the added result to the 3 x 3 convolution of the second group; repeating the steps until the final group of feature mapping is processed; and finally, splicing all extracted characteristic outputs according to channels, and carrying out 1X 1 point convolution to carry out information fusion.

The specific processing procedure of the covariance pooling module is as follows:

h and w represent the height and width of the feature map, respectively, and d represents the size of the third dimensional channel of the feature map; e2, byCalculating a covariance matrix Sigma, wherein +.>I is an n multiplied by n identity matrix, 1 is a matrix with all elements being 1, and X represents an original input characteristic diagram input to a covariance pooling module;

Newton-schulz iteration formula is

The calculation formula of the post-compensation processing is as follows: c= (tr (Σ)) ^1/2 Y _N Where tr (Σ) is the covariance matrix trace, Y _N The results obtained after N iterations.

The specific process of the splicing treatment is to splice the upper triangular matrix of the symmetrical matrix obtained by the post-compensation treatment into a characteristic diagram of d (d-1)/2-dimensional vector, and obtain a comprehensive fusion characteristic diagram.

Example 2. Based on a Fourier convolution visible light and infrared image multi-scale feature fusion method, the framework of the application is designed for carrying out pixel-level classification by fusing multi-source remote sensing images, and is shown in figure 2; it is mainly composed of two parts: 1) Multi-source frequency decomposition and fusion (first part) based on fast fourier convolution module (FFCN); 2) Feature extraction (second part) of the multi-scale layer feature fusion module, and covariance pooling module (GCP model).

Construction of a multiscale feature fusion covariance network of a fast fourier convolution module (F ² MCN), which is focused on efficient feature fusionAnd (5) combining and comprehensively extracting the characteristics. First, FFCN adopts fast Fourier convolution layer to fuse multi-source information and retain discrimination information. Then, using a multiscale feature fusion (MF 2) module pair F ² Features of different layers in the MCN are fused. Finally, global Covariance Pooling (GCP) replaces Global Average Pooling (GAP), and high-order information is extracted from RGB images and infrared images to obtain richer depth feature statistics.

Fig. 3 is a fast fourier convolution (FFC Conv) layer in the lower half of fig. 2, which has been used for visible image classification with a more efficient convolution layer. The simple characteristic splicing or superposition operation is extremely easy to generate redundant information superposition. The classical feature extraction and fusion method is adopted to fuse the visible light image and the infrared image information, so that partial redundancy is reduced, but redundancy still exists in a low-frequency part. The present application first uses a fourier convolution layer to decompose the input image into a multi-resolution representation, which makes it easier to reduce spatial redundancy.

In this step, the visible light image (RGB image) is represented asWherein b _r Representing the band, r c represents the pixel height and width. An Infrared image (Infrared image) is expressed as +.>Wherein b _i Representing the band, r c represents the pixel height and width.

Convolving X with Fast FourierConvolution (FFC) _r And X _i Explicit decomposition along the channel dimension. Wherein Y is _l ^H ^→H Representing a feature map for high frequency branch (H) to high frequency branch (H) mapping, Y _l ^H→L Feature map, Y, representing a mapping for high frequency branches (H) to low frequency branches (L) _h ^L→H Representing a feature map, Y, for mapping of low frequency branches (L) to high frequency branches (H) _h ^L→L A feature map for the low frequency branch (L) to low frequency branch (L) mapping is represented.

The FFC architecture is shown as a in fig. 4, and b in fig. 4 is a block diagram of a spectrolforsformer. Conceptually, the FFC is composed of two interconnected paths: a spatial (or local) path that performs a common convolution on a portion of the input feature channels, and a spectral (or global) path that operates in the spectral domain. Each path can capture complementary information with a different receptive field. The exchange of information between these paths is performed internally.

Formally, is provided withThe input features of the FFC are mapped, where h× W, C represents the spatial resolution and the number of channels, respectively. At the FFC entrance, X is first split along the dimension of the characteristic channel, i.e., x= { X ^l ，X ^g }. Local part->Learning from local neighborhood, second global part->The remote context is intended to be captured. Alpha _in ∈[0，1]Representing the percentage of characteristic channels assigned to the global portion. To simplify the network, it is assumed that the output is the same size as the input. Use->As an output tensor. Similarly, let Y= { Y ^l ，Y ^g The } is a local-global partition, the global part proportion of the output tensor is defined by the superparameter alpha _out ∈[0，1]And (5) controlling. The update process inside the FFC can be described by the following formula:

y ^l ＝y ^l→l +y ^g→l ＝f _l (X ^l )+f _g→l (X ^g )

y ^g ＝Y ^g→g +Y ^l→g ＝f _g (X ^g )+f _l→g (X ^l ) (1)

wherein component Y ^l→l The purpose of (2) is to capture small-scale information using conventional convolution. Also, the other twoComponent (Y) ^g ^→l /Y ^l→g ) Obtained by inter-path conversion, is also implemented using conventional convolution to make full use of the multi-scale acceptance domain. The main complexity is due to Y ^g→g Is calculated by the computer. For clarity of description we call f _g Is a spectral converter as shown in fig. 4 b.

There are two modes 1 and 2 in fig. 3 using stride in Bottleneck block. When the Bottleneck sets the stride to 1, the Bottleneck is specifically the structure shown in the left half of fig. 5, first the dimension is lifted using a 1 x 1 convolution, feature extraction is performed by a depth convolution (DW Conv), and finally a point convolution is performed. When the stride is set to 2, the bottleck is specifically the structure shown in the right half of fig. 6, and the dimension is lifted by using 1×1 convolution, then feature extraction is performed by using multi-scale convolution (MS Conv, the detailed structure of which is shown in fig. 6), and finally point convolution is performed.

Fig. 6 shows the MS Conv structure at split=4, with the input feature map divided equally into s groups according to channel. Features are extracted from the input feature map of the first set using 3 x 3Conv. The output of the first group is then sent to the second group and added to the input of the second group. At the same time, the result of the addition is sent to the second set of 3 x 3Conv. This process is repeated multiple times until the final set of feature maps has been processed. And finally, splicing all the outputs according to the channels, and carrying out 1X 1 point convolution to carry out information fusion.

The first 1 x 1 convolution of the bottleneck block of the present application performs an up-scaling process on the input feature map, which can provide sufficient channels for MSConv to perform multi-scale feature extraction, as compared to the bottleneck block of the res net. Since in MSConv the output of the previous group is added to the input of the current group, the size of the feature map needs to be the same, we use MS Conv only when the bottleneck step is 1.

The architecture of GCP is shown in fig. 7. A feature map of size h×w×d of the multi-scale feature fusion output is converted to a feature map of size n×d, where n=h×w. First, the covariance matrix Σ is composed ofAnd (3) calculating, wherein,i is an n x n identity matrix, and 1 is a matrix with all elements being 1.

Then, the covariance matrix Σ is divided by its trace a= (1/(tr (Σ))) Σ in a pre-normalization step, (eliminating the adverse effect of pre-normalization), where tr (·) is the trace of the matrix. This is done to enable the subsequent newton-schulz iterations to converge. The iterative formula is as follows:

i represents an identity matrix; y is Y _k-1 Taking the matrix A as a starting value, iterating the results obtained after k-1 times, and repeating the same as Y _k Representing the result obtained after iterating k times;

in the post-compensation, the result Y obtained after N iterations _N Multiplying the square root of the covariance matrix trace, c= (tr (Σ)) ^1/2 Y _N To eliminate the adverse effects of pre-normalization. And finally, splicing the upper triangular matrix of the symmetrical matrix C obtained by post-compensation into a d (d-1)/2-dimensional vector, and transmitting the d (d-1)/2-dimensional vector to the FC layer.

C＝(tr(∑)) ^1/2 Y _N The function of this formula is to eliminate the adverse effects of pre-normalization.

The FC layer, also known as the full connection layer (fullyconnectiedlayer), is a common layer type in deep learning neural networks. In the FC layer, each neuron is connected to all neurons of the previous layer to form a fully connected structure. Thus, each neuron of the FC layer has a weight connection with all input neurons of the previous layer.

The main function of the fully connected layer is to map the feature representation of the previous layer to the final output space. The method can learn complex nonlinear relations among input features, and perform linear combination and activation function processing through weight parameters so as to generate an output result. In deep learning, the full-join layer is often used for the final classification task or regression task.

The application can also carry out back propagation through final result data (parameters in a GCP module), thereby facilitating learning and training based on a Fourier convolution visible light and infrared image multi-scale feature fusion model. In the back propagation, the partial derivative of the loss function l relative to the covariance error input matrix is obtained by the gradient related to the network structure in the matrix back propagation algorithm, the chain law of the general matrix function is established by first-order taylor approximation,then, the corresponding gradient is calculated>And->

From the chain law of matrix back propagation and newton-schulz iteration, through a series of operations, k=n, …,2, one can derive

In the pre-normalization of the values,can be obtained by (7)

Here we need to combine the gradient of the loss function i with respect to Σ with the gradient of the back-propagation of the back-compensation layer, it can be deduced that:

finally, the gradient of the loss function/with respect to the input matrix X can be deduced as:

the parameters in the GCP module may be updated by back propagation formulas. GCP retains semantic information better than GAP. Most importantly, the GCP module is more suitable for GPU parallel operation.

Claims

1. The method for fusing the multi-scale features of the visible light and infrared images based on Fourier convolution is characterized by comprising the following steps of:

A. acquiring an RGB image and an infrared image to be fused;

2. The multi-scale feature fusion method based on Fourier convolution visible light and infrared images according to claim 1, wherein the specific process of performing multi-source information fusion processing by the fast Fourier convolution module is as follows:

feature map Y comprising a mapping of high frequency branches H to high frequency branches H _l ^H→H Feature map Y of the mapping of high frequency branch H to low frequency branch L _l ^H→L Feature map Y of the mapping of low-frequency branches L to high-frequency branches H _h ^L→H Feature map Y of low frequency branch L to low frequency branch L mapping _h ^L ^→L ；

c4, fast Fourier convolution module along characteristic channelDimension splitting series characteristic diagram X, namely splitting into X= { X ^l ，X ^g -a }; wherein the local partFor learning from local neighborhood, global part->For capturing remote context, alpha _in ∈[0，1]Representing the percentage of characteristic channels assigned to the global portion;

c5, useLet y= { Y as output tensor ^l ，Y ^g And updated with equation 1),

Y ^l ＝Y ^l→l +Y ^g→l ＝f _l (X ^l )+f _g→l (X ^g )

Y ^g ＝Y ^g→g +Y ^l→g ＝f _g (X ^g )+f _l→g (X ^l ) Formula 1);

c6, use 3X 3 convolution to Y ^l And Y ^g And after convolution processing, fusing the two to obtain an output tensor Y, namely a multisource information fusion feature map.

3. The method for multi-scale feature fusion based on Fourier convolution visible light and infrared images according to claim 1, wherein the specific processing procedure of the multi-scale feature fusion module is as follows: sequentially carrying out bottleneck processing on the multisource information fusion feature map through a plurality of bottleneck blocks which are connected in series to obtain a multiscale feature fusion feature map;

4. The fourier convolution-based visible and infrared image multi-scale feature fusion method according to claim 3, wherein the specific processing procedure of the multi-scale convolution is as follows: the input characteristic mapping diagram is equally divided into s groups according to the channel; features are then extracted from the feature map of the first set of inputs using a 3 x 3 convolution; transmitting the extracted feature output of the first group to the second group and adding to the input of the second group and transmitting the added result to the 3 x 3 convolution of the second group; repeating the steps until the final group of feature mapping is processed; and finally, splicing all extracted characteristic outputs according to channels, and carrying out 1X 1 point convolution to carry out information fusion.

5. The fourier convolution visible light and infrared image-based multi-scale feature fusion method according to claim 1, wherein the specific processing procedure of the covariance pooling module is as follows:

6. The Fourier convolution visible and infrared image-based multi-scale feature fusion method according to claim 5, wherein the Newton-Schulz iteration formula is that

7. The fourier convolution visible light and infrared image based multi-scale feature fusion method according to claim 5, wherein a calculation formula of the post-compensation process is: c= (tr (Σ)) ^1/2 Y _N Where tr (Σ) is the covariance matrix trace, Y _N The results obtained after N iterations.

8. The Fourier convolution visible light and infrared image-based multi-scale feature fusion method according to claim 5, wherein the specific process of the splicing process is to splice an upper triangular matrix of a symmetrical matrix obtained by post-compensation processing into a feature map of d (d-1)/2-dimensional vector, so as to obtain a comprehensive fusion feature map.