CN115797684A

CN115797684A - Infrared small target detection method and system based on context information

Info

Publication number: CN115797684A
Application number: CN202211461433.8A
Authority: CN
Inventors: 付莹; 李峻宇; 郑德智; 宋韬; 林德福
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-11-21
Filing date: 2022-11-21
Publication date: 2023-03-14

Abstract

The invention relates to an infrared small target detection method and system based on context information, and belongs to the technical field of computer vision. And training the small target detection network by using the classification loss function, the confidence coefficient loss function and the position loss function for the infrared small target data set. And then, using the trained small target detection network to extract the characteristics of the infrared image to obtain a characteristic result. And finally, further fusing the extracted features, and detecting the infrared small target on the fused features to obtain a final target detection result. Meanwhile, the invention provides an infrared small target detection system based on context information. The method does not depend on additional infrared image denoising, enhancing and other processing modules, the training process is carried out end to end, and the method is simple to realize, high in performance and strong in robustness. The method has extremely low overhead of extra calculation, is beneficial to realizing the low-delay and high-speed infrared small target detection, effectively improves the detection rate and reduces the omission factor.

Description

Infrared small target detection method and system based on context information

Technical Field

The invention relates to a method and a device for detecting a small target in an infrared image, in particular to a method and a device for detecting the small infrared target based on context information, and belongs to the technical field of computer vision processing.

Background

Compared with a visible light image, the infrared image is not influenced by extreme climate and environment, can be imaged without external illumination, has strong detection capability and long acting distance, and has important research and application significance in the civil field and the military field from an infrared monitoring system to an infrared guidance system. However, compared with visible light images, infrared images have the disadvantages of poor resolution, imaging blur, low signal-to-noise ratio and the like, and small objects in the infrared images are easily submerged by noise. Therefore, efficient detection of small targets in infrared images is a challenging task, receiving extensive attention from the signal processing and computer vision community.

Small object detection is a technique capable of detecting small objects from an image. This technique can detect the type and location of a small target from natural light. At present, the main small target detection methods are based on deep learning and deep convolution neural networks, and the technology is widely applied to the fields of monitoring security, automatic driving, remote sensing satellites and the like. Targets that are in principle smaller than 32 x 32 are called small targets by definition of the COCO dataset. The small target has low specific pixels, and the detection performance is greatly different from that of a large target. If the detection scene is complex, for example, the targets are blocked by each other, the targets are blocked by the background, or under the dense condition, the small targets are affected more severely than the large targets, and the difficulty of detecting the small targets is further increased.

Context information, i.e. the so-called characteristic context information, that an object usually appears with a corresponding environment, and besides the characteristics of the object itself, there is also a close connection between the object and the surrounding environment. Since the infrared image is seriously affected by noise and clutter, small targets therein are highly susceptible to interference. Therefore, the object is detected by means of other information related to the target in the image and the characteristics of the small target, so that the detection result can be effectively improved, and the probability of missing detection of the small target is reduced.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the existing infrared small target detection technology, and creatively provides an infrared small target detection method and system based on context information in order to solve the technical problems that the existing method does not fully utilize local context information around a target and global context information in an overall image, is not suitable for the insufficient change capability of small target features of different categories, is not properly fused with shallow features and deep features and the like. The invention effectively improves the performance of detecting the infrared small target and has good practical application effect.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme.

A method for detecting an infrared small target based on context information comprises the following steps:

step 1: and acquiring and processing an infrared small target data set.

And 2, step: and training the small target detection network by using the classification loss function, the confidence coefficient loss function and the position loss function.

And 3, step 3: and (5) carrying out feature extraction on the infrared image by using the trained small target detection network to obtain a feature result. In the method, the infrared image does not need to be preprocessed.

And 4, step 4: and further fusing the extracted features, and detecting the infrared small target on the fused features to obtain a final target detection result.

In order to achieve the purpose of the invention, the invention further provides an infrared small target detection system based on context information, which comprises an image processing module, a target information learning module, a feature extraction module, a feature fusion module and a target detection module.

Advantageous effects

Compared with the prior art, the method and the system have the following advantages that:

1. the method does not depend on additional infrared image denoising, enhancing and other processing modules, the training process is carried out end to end, and the method is simple to realize, high in performance and strong in robustness.

2. The method has extremely low extra calculation cost, is beneficial to realizing the infrared small target detection with low delay and high speed, effectively improves the detection rate of the small target, reduces the omission factor and improves the precision of other scale targets.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of the feature extraction method according to the method of the present invention.

FIG. 3 is a schematic diagram of the framework and the internal details of feature fusion according to the method of the present invention.

Fig. 4 is a flow chart of the system of the present invention.

Detailed Description

For a better understanding of the objects and advantages of the invention, reference should be made to the following detailed description taken in conjunction with the accompanying drawings.

As shown in fig. 1, a method for detecting a small infrared target based on context information includes the following steps:

step 1: and acquiring an infrared small target data set, and performing data enhancement processing.

High quality data sets are an indispensable choice for achieving good performance of infrared small target detection methods based on deep learning. However, the existing infrared small target data sets are small in number and scale, and the small target ratio is not good. Therefore, the invention firstly expands the infrared small target data set through data enhancement, thereby enhancing the robustness of the whole detection method.

Specifically, in an image containing infrared small objects, small objects that do not overlap with other objects are found, and randomly copied and pasted to other positions of the image. Wherein, the copied small target does not block other targets and keeps a distance with other targets.

Further, on the basis of copying and pasting the small target, other data enhancement operations (such as rotation translation, scaling and clipping, mosaic enhancement and the like) can be superposed, wherein the mosaic enhancement is preferably performed.

Step 2: and training the small target detection network by using the classification loss function, the confidence coefficient loss function and the position loss function.

Specifically, let the total loss function L (x, x') be expressed as:

wherein x, x' represent predicted and true values, respectively, α _box 、α _obj 、α _cls Respectively representing the weights of the three loss functions, L _CIoU 、L _obj 、L _cls Respectively representing a target detection task position loss function, a confidence coefficient loss function and a classification loss function; k. s ² B denote the output profile, the grid and the number of anchors (i.e. positions) on each grid, I _kij Whether the kth output feature map, the ith grid and the jth anchor box are positive samples or not is shown, if so, the kth output feature map, the ith grid and the jth anchor box are 1, and if so, the kth output feature map, the ith grid and the jth anchor box are 0; alpha is alpha _k To balance the weights of the output features at different scales.

And step 3: and (4) carrying out feature extraction on the infrared image by using the trained small target detection network to obtain a feature result. Moreover, the infrared image does not need to be preprocessed.

The common infrared small target detection method mainly extracts the characteristics contained in an object, lacks the capability of acquiring and analyzing context information, but lacks the texture information of the small target, and when the context information is not supplemented and enhanced, the complex background of an infrared image with low contrast is easy to submerge the small target, and different forms of the small target can also cause certain difficulty in detection.

For this purpose, the method proposes that first of all, the image is subjected to feature extraction. In the extraction process, the different requirements of the features of different shapes on the context information are fully considered, and through dynamic context information extraction, as shown in fig. 2, the remote dependence between the information is established on the features. The position codes added in the input part make up the position information of the features, and the problem that the feature information of the infrared target at a far place is insufficient is solved. And after the input deep features are flattened into a sequence in a blocking way and position information is introduced, the sequence is sent to a multi-head attention mechanism for weighted summation. Then, residual connection is adopted to optimize the result and accelerate convergence, and two full connection layers are passed through and residual connection is carried out again. And a layer of deformable convolution is followed, and a bias term is added at the same time of convolution, so that the characteristics of the object can be well expressed when the objects with different forms and obvious size difference exist at the same position.

Specifically, the process of dynamic context information extraction is as follows:

for an input feature F, the feature size is C × H × W, wherein C represents the number of channels, H represents the height, and W represents the width; given the size dimension P of a block, C × H × W is divided into N P × C blocks, P representing a block.

After obtaining N blocks, the N blocks are linearly transformed into feature vectors with N lengths, and a flag bit vector x is added at the initial position of the vector _p ；

F ₁ ＝E+F ₀

Wherein, F ₀ The vector result that is output is represented by,

denotes the Nth block, W _N Is a weight parameter, concat [ 2 ]]For a splicing operation.

F obtained finally ₀ Is divided intoAnd outputting the embedded block output result. To obtain F ₀ After blocking, the embedded features still lack the relative position information between blocks, so position coding information E and F are added ₀ Adding to obtain F ₁ ，F ₁ Indicating the result after adding the location information.

After embedding position information F ₁ And multiplying by three different parameter matrixes respectively, and mapping into a query matrix, a queried key value matrix and a value matrix. After the attention mechanism processing, a plurality of attention results are obtained and are used for representing different context information in the image. And splicing and standardizing the attention results to obtain a final context information summary result:

head _i ＝Attention(F ₁ W _q ；F ₁ W _k ；F ₁ W _v )

F _M ＝Concat[head _i ；head _i ；head _i ；...；head _i ]W _M

wherein, attention () represents Attention mechanism operation, Q, K, V represents query matrix, queried key value matrix and value matrix, respectively, T represents transposition operation,

representing a scaling factor; f ₁ Indicating the result after adding the position information, W _q 、W _k 、W _v 、W _M Is a learnable parameter matrix, softmax indicates to perform a Softmax operation, head _i Output representing multiple attention results, F _M Representing a multi-headed attention output feature. Concat represents the add operation.

The feedforward neural network comprises two fully-connected layers and a multi-head attention output characteristic F after residual normalization _M Mapped to a high-dimensional space by a first fully-connected layer and a low-dimensional space by a second fully-connected layer, further retaining useful information, whichThe process is as follows:

F ₂ ＝F _M [0]+F ₁

X＝F ₂ W _fc1 W _fc2 +F ₁

wherein, F ₂ Shows the result after the residue, F _M [0]Is a flag bit vector, X is the output result, W _fc1 、W _fc2 Are the weights of the two fully connected layers.

After the context information is processed, the output result X dynamically adjusts effective information through deformable convolution, and the relation between different small targets and the context information is related:

wherein, Y (p) ₀ ) Representing the output result of the deformable convolution, X, Y is the input feature map and the output feature map, p, respectively ₀ Indicating position in output feature, p _n Representing adjacent positions and R representing a real range. Function W () represents p _n The weight of (c). p is a radical of _n Is an offset value, and is learned by parallel convolution from input features.

Under the influence of noise, the characteristic expressions of different small targets in the infrared image have great difference, and the characteristic fusion capability of the model is tested. The common target detection model does not analyze the position characteristics of the object and the semantic characteristics of the object from the space dimension or only can fuse the obvious characteristics in the image by the operations of pure up-sampling, convolution, connection characteristics and the like, but ignores the small target information, and further causes the low final detection accuracy of the small target.

Therefore, in the method, after the image is subjected to feature extraction, the extracted features are subjected to feature fusion, and target detection is performed on the fused features. In the process of feature fusion, channel and spatial information in multiple features are aggregated by using multiple information fusion layers.

The aggregated features greatly improve the expression of the position information and semantic information of the object. And a new characteristic scale is added and fused during characteristic fusion to supplement deep small target characteristics, so that the detailed characteristics of the small target are enriched.

In order to increase the space-time information of the object as much as possible during feature fusion, more target features are reserved. As shown in fig. 3, the multiple information fusion layers perform an information fusion operation in each feature scale.

The multi-information fusion module (MFM) fuses information of different layers through a plurality of residual error structures, the structure of the multi-information fusion module is shown in fig. 3 (c), the first is an IC layer, and is shown in fig. 3 (b) and is responsible for refining the information of characteristics, then global pooling and maximum pooling are respectively carried out from a channel layer, the information is sorted through a full connection layer sharing weight, and after multiplication and addition, normalization is carried out through a softmax function, extracted channel information is obtained and multiplied by input information, and the effect of enhancing the channel information is achieved.

After the channel information is extracted and enhanced, the global pooling and the maximum pooling are continuously carried out on each position of the image, after the global pooling and the maximum pooling are added, 7 multiplied by 7 convolution superposition characteristics can be adopted, and normalization is carried out through a softmax function, so that the effect of enhancing the position information is achieved. Finally, the channel and spatial information can be further integrated via 1 × 1 convolution.

The deep layer features contain abundant semantic information, but most of the semantic information is target semantic information, the related features of small targets are easily covered by noise and difficult to locate after multiple downsampling operations, and the shallow layer features have abundant texture information and position information of the small targets. Meanwhile, in order to effectively utilize shallow features to enhance the detail information of the small target and supplement the position information of the small target, a feature scale is additionally added to specially focus on the small object, and a detection head is added to output a detection result. The related structure names are as shown in fig. 3 (a), the outputs of the dynamic context information extraction module and the three subsequent multi-information fusion modules (MFM) are T5, T4, T3, and T2, and the sizes of the outputs are 1/32, 1/16, 1/8, and 1/4 of the original image, respectively. Features of the same size that are connected to T5, T4, T3 are denoted as R4, R3, R2.

When the characteristic graph is processed to a T3 layer, the method continues to up-sample the characteristics, adds a T2 layer after up-sampling, and simultaneously connects the T2 layer with the characteristics with the same size as the second layer of the backbone network. The small target detection method has the advantages that the small target detail representation capability is improved, shallow layer detail information is transmitted, and the small target detection head is connected behind the T2 layer, so that the characteristic coupling of the small target and other targets on the same layer is reduced, the missing detection rate of the small target is reduced, the probability of detecting the small target is improved, and the poor precision caused by overlarge scale is relieved. In order to correspond to the channel of the network, an R2 layer is added after the T2 layer, and is connected with the T3 layer characteristic with the same dimension.

In order to achieve the purpose of the present invention, the present invention further provides an end-to-end infrared small target detection system based on context information, as shown in fig. 4, which includes an infrared image processing module 10, a target information learning module 20, a feature extraction module 30, and a feature fusion and target detection module 40.

The infrared image processing module 10 is configured to process an infrared image dataset used for training a small target detection model. The module can increase the number of small targets, enrich the change scenes of the data set and enhance the robustness of the model.

And the small target information learning module 20 is used for guiding the small target detection model to learn the image characteristics of the robustness. The module uses the infrared small target data set to learn a training model by using information and outputs the training model to obtain a trained small target detection model.

The image feature extraction module 30 extracts the target surrounding information and the global relevant information in the image features by using the dynamic context information extraction module, and adapts to the contour change of different small targets. And extracting stable and clean small target features on the infrared image to realize accurate infrared small target detection.

A feature fusion and target detection module 40 capable of fusing the extracted features. And identifying and extracting the size and shape of the category position of the interested target from the fused image features to obtain a final infrared small target detection result.

The connection relationship among the modules is as follows:

the output end of the infrared image processing module 10 is connected with the input end of the small target information learning module 20.

An output of the small target information learning module 20 is connected to an input of the image feature extraction module 30.

An output of the image feature extraction module 30 is connected to an input of a feature fusion and target detection module 40.

Claims

1. A small infrared target detection method based on context information is characterized by comprising the following steps:

step 1: acquiring an infrared small target data set, and performing data enhancement processing;

step 2: training the small target detection network by using a classification loss function, a confidence coefficient loss function and a position loss function;

and step 3: using the trained small target detection network to extract the characteristics of the infrared image to obtain a characteristic result;

firstly, extracting the characteristics of an image; in the extraction process, remote dependence between all information is established on the characteristics through dynamic context information extraction; the input deep layer features are partitioned and flattened into sequences, and position information is introduced and then sent to a multi-head attention mechanism for weighted summation;

then, adopting residual connection to optimize the result and accelerate convergence, and performing residual connection again through two full-connection layers; a layer of deformable convolution is carried out subsequently, and a bias term is added while the convolution is carried out;

2. The infrared small object detection method based on the context information as claimed in claim 1, characterized in that, when the data enhancement processing is performed in step 1, in the image containing the infrared small object, the small object which is not overlapped with other objects is found out and is randomly copied and pasted to other positions of the image; wherein, the copied small target does not block other targets and keeps a distance with other targets.

3. The method as claimed in claim 2, wherein other data enhancement operations including rotation and translation, scaling and cropping, and mosaic enhancement are further superimposed on the copy and paste small target.

4. The method for detecting infrared small targets based on context information according to claim 1, wherein in step 2, the total loss function L (x, x') is expressed as:

wherein x, x' represent predicted and true values, respectively, α _box 、α _obj 、α _cls Representing the weights of the three loss functions, L, respectively _CIoU 、L _obj 、L _cls Respectively representing a target detection task position loss function, a confidence coefficient loss function and a classification loss function; k. s ² B denotes the output profile, the grid and the number of positions on each grid, I _kij Whether the kth output feature map, the ith grid and the jth anchor box are positive samples or not is shown, if so, the kth output feature map, the ith grid and the jth anchor box are 1, and if so, the kth output feature map, the ith grid and the jth anchor box are 0; alpha is alpha _k To balance the weights of the output features at different scales.

5. The method according to claim 1, wherein in step 3, the dynamic context information extraction process comprises:

for an input feature F, the feature size is C × H × W, wherein C represents the number of channels, H represents the height, and W represents the width; given the size dimension P of a block, dividing cxhxw into N blocks of pxpxpc, P representing a block;

after obtaining N blocks, it is linearly transformed into N blocksLength feature vector, and adding a flag bit vector x at the start position of the vector _p ；

F ₁ ＝E+F ₀

Wherein, F ₀ The vector result that is output is represented by,

denotes the Nth block, W _N Is a weight parameter, concat [ 2 ]]Splicing operation is carried out; f obtained finally ₀ An output result of block embedding;

adding position-coding information E and F ₀ Adding to obtain F ₁ ，F ₁ Indicating the result after adding the position information;

after embedding position information F ₁ Multiplying by three different parameter matrixes respectively, and mapping into a query matrix, a queried key value matrix and a value matrix; after the attention mechanism processing, obtaining a plurality of attention results which are used for representing different context information in the image; and splicing and standardizing the attention results to obtain a final context information summary result:

head _i ＝Attention(F ₁ W _q ；F ₁ W _k ；F ₁ W _v )

F _M ＝Concat[head _i ；head _i ；head _i ；...；head _i ]W _M

wherein Attention () represents Attention mechanism operation, Q, K, V represents a query matrix, a queried key value matrix and a value matrix, respectively, T represents a transposition operation,

representing a scaling factor; f ₁ Indicates the result after adding the position information, W _q 、W _k 、W _v 、W _M Is a learnable parameter matrix, softmax indicates to perform a Softmax operation, head _i Output representing multiple attention results, F _M Representing a multi-headed attention output feature; concat represents an add operation;

the feedforward neural network comprises two fully-connected layers and a multi-head attention output characteristic F after residual normalization _M Mapping to a high-dimensional space by a first fully-connected layer, mapping to a low-dimensional space by a second fully-connected layer, and further retaining useful information, the process is as follows:

F ₂ ＝F _M [0]+F ₁

X＝F ₂ W _fc1 W _fc2 +F ₁

wherein, F ₂ Shows the result after the residue, F _M [0]Is a flag bit vector, X is the output result, W _fc1 、W _fc2 Is the weight of two fully connected layers;

after the context information is processed, the output result X dynamically adjusts effective information through deformable convolution, and links the relations between different small targets and the context information:

wherein, Y (p) ₀ ) Representing the output result of the deformable convolution, X, Y is the input feature map and the output feature map, p, respectively ₀ Indicating position in output feature, p _n Representing adjacent positions, R representing a real range; function W () represents p _n The weight of (c); p is a radical of _n Is an offset value, and is learned by parallel convolution from input features.

6. The method for detecting the small infrared target based on the context information as claimed in claim 1, wherein in the step 4, in the process of feature fusion, the channel and space information in the multiple features are aggregated by using a multi-information fusion layer, and the multi-information fusion layer performs information fusion operation in each feature scale;

the multi-information fusion module fuses information of different layers through a plurality of residual error structures, and comprises three parts, wherein the first part is an IC layer and is responsible for refining characteristic information, then global pooling and maximum pooling are respectively carried out from a channel layer, the information is sorted through a full connection layer sharing weight, and after multiplication and addition, the extracted channel information is obtained through softmax function normalization, and is multiplied by input information;

after extracting and enhancing the channel information, continuously performing global pooling and maximum pooling on each position of the image, and after adding, adopting convolution superposition characteristics and normalizing through a softmax function to achieve the effect of enhancing the position information; and finally, integrating channel and spatial information through convolution.

7. The method according to claim 6, wherein a feature scale is added for specially focusing on small objects; adding a detection head to output a detection result;

the outputs of the dynamic context information extraction module and the three subsequent multi-information fusion modules are T5, T4, T3 and T2, the sizes of the outputs are respectively 1/32, 1/16, 1/8 and 1/4 of the original image, and the features connected with T5, T4 and T3 and having the same size are marked as R4, R3 and R2;

when the feature graph is processed to a T3 layer, continuously up-sampling the features, adding a T2 layer after up-sampling, and simultaneously connecting the T2 layer with the features with the same size as the second layer of the backbone network; and a small target detection head is connected behind the T2 layer, and an R2 layer is added behind the T2 layer and is connected with the T3 layer with the same dimensionality.

8. An infrared small target detection system based on context information for realizing the method of claim 1, which is characterized by comprising an infrared image processing module (10), a target information learning module (20), a feature extraction module (30), a feature fusion and target detection module (40);

the infrared image processing module (10) is used for processing an infrared image data set used for training a small target detection model;

the small target information learning module (20) is used for guiding the small target detection model to learn robust image features; the module uses an infrared small target data set to learn a training model by using information and outputs a trained small target detection model;

the image feature extraction module (30) is used for extracting target surrounding information and global relevant information in the image features by utilizing the dynamic context information extraction module and adapting to the contour change of different small targets; extracting stable and clean small target features from the infrared image;

the characteristic fusion and target detection module (40) is used for fusing the extracted characteristics, identifying and extracting the size, the shape and the category position of the interested target from the fused image characteristics, and obtaining a final infrared small target detection result;

the connection relationship among the modules is as follows:

the output end of the infrared image processing module (10) is connected with the input end of the small target information learning module (20);

the output end of the small target information learning module (20) is connected with the input end of the image feature extraction module (30);

the output end of the image characteristic extraction module (30) is connected with the input end of the characteristic fusion and target detection module (40).