CN114863301A

CN114863301A - Small target detection method for aerial image of unmanned aerial vehicle

Info

Publication number: CN114863301A
Application number: CN202210488938.7A
Authority: CN
Inventors: 张红英; 张奇; 罗向东
Original assignee: Southwest University of Science and Technology
Current assignee: Southwest University of Science and Technology
Priority date: 2022-05-07
Filing date: 2022-05-07
Publication date: 2022-08-05

Abstract

The invention provides a small target detection method for aerial images of unmanned aerial vehicles. Firstly, aiming at the problem of small target size, the size of a detection head is changed to obtain original information of more small targets; secondly, in order to accurately detect dense continuous small targets in a complex scene, a residual convolution module fused with a CBAM attention mechanism is adopted in a feature extraction stage to increase weight for the densely connected interested areas in the training picture; then, in order to relieve the problems of background clutter and pixel blurring, a jump multi-scale feature enhancement module is introduced, two same-scale residual error paths and three cross-scale residual error paths are added in top-down feature fusion, shallow feature information and deep feature information are fully fused, and detection heads with different scales obtain enough image semantic information and space information. The invention utilizes multi-direction jumping residual connecting path to fuse multi-scale characteristics of the image, realizes excellent small target detection performance and has wide applicability.

Description

Small target detection method for aerial image of unmanned aerial vehicle

Technical Field

The invention relates to an image processing technology, in particular to a small target detection method for aerial images of unmanned aerial vehicles.

Background

The target detection combines two tasks of target positioning and target classification, and is the basis for solving higher-level visual tasks such as segmentation, scene understanding, target tracking, image description and event detection. Research on an object detection algorithm obtains the position of a predicted object from an initial sliding window-based mode, the object detection algorithm for extracting an interested area based on area suggestion generation appears along with the rise of deep learning, and the RPN based on anchors quickly replaces the area suggestion generation mode. Thus, anchors are widely used in a variety of target detection applications. However, the fixed size of the anchor cannot flexibly adapt to the prediction of targets with different scales, the generalization capability of the model is limited, the imbalance of positive and negative samples is aggravated by the dense anchor, and the calculated amount and the memory occupation are obviously increased, so that the anchor-free target detection method is gradually developed.

The unmanned aerial vehicle has the characteristics of low cost, small volume, high resolution, flexibility, convenient operation and capability of conveniently obtaining aerial images in various environments, is widely applied to industries which are not easy to operate manually, such as traffic monitoring, electric power inspection, military inspection, environment supervision and the like, and therefore has great significance for the research of target detection of the unmanned aerial vehicle. The unmanned aerial vehicle long-distance shot image has the characteristics of high background complexity, small target size and fuzzy appearance, so that great challenges are brought to the unmanned aerial vehicle small target detection task. The small targets widely exist in the aerial images, the problems of severe illumination change, target shielding, dense target connection, target scale change and the like usually exist in natural scenes in the aerial images, the influence of the factors on the characteristics of the small targets is more severe, and the difficulty of small target detection is further increased. The network training and the target prediction in the small target detection are mainly concentrated between the feature extraction and the feature fusion, and the excellent feature extraction mode and the feature fusion method can bring key high-frequency information for the small target detection, so that the problem that the feature information of the small target is dispersed or even disappears in a deep feature map due to the fact that the small target occupies a small number of pixels in an original image, the carried information is limited, and the appearance information such as textures, shapes and colors is lacked, and the down-sampling is solved, and therefore the rich network context semantic information, position information and feature representation become the research focus of the unmanned aerial vehicle small target detection task.

Disclosure of Invention

The invention aims to solve the problems of low small target detection accuracy, high false detection rate and high missed detection rate in aerial images of unmanned aerial vehicles, provide aerial images in different illumination environments, different climates and scenes, and obtain a detection model for small target detection through deep learning network training.

In order to achieve the above purpose, the present invention provides a small target detection method facing an unmanned aerial vehicle aerial image based on a YOLOX network, and the method adopts feature fusion of a multi-scale jump residual path and a residual structure embedded with an attention mechanism to achieve better small target detection, and includes five parts: the method comprises the steps of preprocessing a data set, extracting features of a small target data set of the unmanned aerial vehicle, fusing the features of the features from different stages, performing classification prediction and regression prediction of different scales on a fused feature map, and training and testing a network to obtain an optimal model for detecting the small target of the unmanned aerial vehicle.

The first part comprises 2 steps:

step 1, downloading a small target public data set of an unmanned aerial vehicle, selecting images with complex natural scenes, various angles and severe illumination change as a test set, and not performing any image enhancement on the test set;

step 2, resetting the size of the picture of the public data set, adjusting the picture to 640 multiplied by 640 pixels, carrying out four-time segmentation and turnover on each picture in the training set, then respectively placing the turnover pictures on corresponding segmentation positions, and finally carrying out operations such as color gamut transformation on the pictures, and the like, so as to enhance the training sample set and obtain a final training sample;

the second part comprises two steps:

step 3, inputting the training sample obtained in the step 2 into a convolution network with shared weight, carrying out channel number transformation through a convolution residual structure, and preliminarily obtaining a characteristic diagram of the aerial image from an RGB space;

step 4, performing multi-scale feature extraction on the feature map obtained in the step 3 to obtain feature maps feat0, feat1, feat2 and feat3 with channel numbers of 128, 256, 512 and 1024 respectively;

the third part comprises four steps:

step 5, respectively carrying out multi-scale jump residual processing on the feature maps feat0, feat1, feat2 and feat3 obtained in the step 4, carrying out concatenate operation for 3 times in a bottom-up path, simultaneously adding another bottom-up path, and carrying out cross-multi-scale jump connection on the bottommost feature feat3 and the shallowest feature feat0 to obtain fused feature maps P0, P1 and P2;

step 6, carrying out deep feature fusion on the P0, the P1 and the P2 obtained in the step 5 again through a top-down path to obtain feature maps P3 and P4 which are mutually fused in different scales;

step 7, on the basis of the step 5 and the step 6, additionally adding two same-scale residual paths during feature fusion from top to bottom, and performing jump fusion on the P3 and the P4 obtained in the step 6 and feat1 and feat2 obtained in feature extraction to obtain feature maps P5 and P6 with more details preserved;

step 8, adding two bottom-up paths, fusing the feature map feat3 obtained in the step 4 with the feature map P6 obtained in the step 7 again to obtain P7, and fusing the P0 with the P5 again to obtain P8;

the fourth section comprises three steps:

step 9, adding a 160 × 160 × 128 detection Head0, discarding a 20 × 20 × 1024 detection Head3 of the original YOLOX network, obtaining a refined feature map by a feature extraction and enhancement module which is integrated with a CBAM attention mechanism from the feature map P2 obtained in the step 5, and inputting the feature map into the Head0 to complete target prediction of small target detection;

step 10, respectively passing the feature maps P7 and P8 obtained in the step 8 through a different input feature extraction module, and respectively inputting the extracted feature information into a detection Head1 with the size of 80 × 80 × 256 and a detection Head2 with the size of 40 × 40 × 512 to complete target prediction of small target detection;

step 11, respectively completing classification and regression tasks of small target detection by convolution of two branches of the detection heads Head0, Head1 and Head2 in the steps 9 and 10;

the fifth part comprises two steps:

step 12, debugging the network structure hyper-parameters from step 3 to step 10, and obtaining a final training model;

and step 13, inputting the test set in the step 1 into the training model in the step 11 to obtain a test result of the unmanned aerial vehicle small target detection.

The invention provides a small target detection method for aerial images of unmanned aerial vehicles. Firstly, aiming at the problems that the small target has few pixels and the boundary between the target is not clear, the method adopts and fuses in the characteristic extraction stagenA feature refining module of the secondary CBAM attention mechanism searches an interested area on a dense object scene and increases attention to densely connected objects; in order to relieve the problems caused by background disorder and pixel blurring in the image, a multi-scale jump connection feature enhancement module is adopted, and the detection heads with different scales obtain enough image semantic information and space information by fully fusing shallow feature information and deep feature information and jump connecting original information and features after the first fusion; finally, for the problems of small image target size and large span of the unmanned aerial vehicle, a detection head with the size of 160 × 160 suitable for the small target size is added to a detection head with the size of 20 × 20 × 1024 of the original YOLOX network, and more original information mapped to the original image by the feature map is added by changing the size of the detection head.

Drawings

FIG. 1 is a network overall framework diagram of the present invention;

FIG. 2 is a diagram of a feature fusion network framework of the present invention;

FIG. 3 is a frame diagram of a feature extraction architecture of the present invention;

FIG. 4 is a frame diagram of a detecting head structure according to the present invention;

FIG. 5 shows the test set results output by the present invention.

Detailed Description

In order to better understand the present invention, a small target detection method for an aerial image of an unmanned aerial vehicle according to the present invention is described in more detail below with reference to specific embodiments. In the following description, detailed descriptions of the current prior art, which will be omitted herein, may obscure the subject matter of the present invention.

Step 1, downloading a small target public data set VisDrone of an unmanned aerial vehicle, selecting images with complex natural scenes, various angles and severe illumination change as a test set, unifying the sizes of the images in the test set to 640 multiplied by 640, and not performing any image enhancement on the test set;

step 2, resetting the size of the picture of the training data set, adjusting the picture to 640 multiplied by 640 pixels, performing mosaic data enhancement on the data set 90% of the total training round, performing four-time segmentation and turnover on each picture in the training set, then respectively placing the turned pictures on corresponding segmentation positions, and finally performing operations such as color gamut transformation and the like on the pictures, enhancing the training sample set, and obtaining a final training sample;

fig. 1 is a specific network model diagram of a small target detection method for an aerial image of an unmanned aerial vehicle according to the present invention, and in the present embodiment, the method is performed according to the following steps:

step 3, concentrating the width and height information of the training picture on a channel through a Focus structure, expanding the number of the channels from the original 3 channels to 64 channels, then performing 1 multiplied by 1 convolution expansion channel number expansion information, and increasing the nonlinear factor of a network model by using an SILU (silicon on insulator) activation function;

step 4, extracting image information features by using four residual modules such as Resblock1, Resblock2, Resblock3 and Resblock4 in the image information extraction system shown in the figure 1, and respectively outputting 4 feature maps feat0, feat1, feat2 and feat3 of the middle layer, wherein the number of channels is 128, 256, 512 and 1024;

and 5, fusing the features of different scales by adopting a multi-scale jump residual error path to obtain fused feature maps P0, P1 and P2, wherein the implementation is as follows:

step 5-1, a multi-scale jump residual feature fusion structure is shown in fig. 2, a feature fusion module is added in an original feature fusion structure, feat1 is subjected to channel change through convolution of 1 × 1, the number of channels is changed from 256 to 128, then the feature size is converted from 80 × 80 to 160 × 160 through upsampling of 2 times of adjacent interpolation, and is subjected to concatanate with feat0 generated in the first stage of feature extraction to obtain P2_1, and such features are fused for 3 times;

step 5-2, increasing a jump residual path across multiple scales, performing 3 times of 2 times of adjacent interpolation upsampling on feat3 generated in the last stage of feature extraction to transform the size of a feature map, changing 20 × 20 feat3 into a 160 × 160 feature map, performing channel change for convolution with 1 convolution kernel and step length being 1, changing the number of channels into 128, and finally performing concatenate feature fusion on the feature map and P2_1 in the step 5-1 to obtain P2;

step 7, on the basis of the step 5 and the step 6, additionally adding two same-scale residual paths during feature fusion from top to bottom, and performing jump fusion on the P3 and the P4 obtained in the step 6 and feat1 and feat2 obtained in feature extraction to obtain feature maps P5 and P6;

step 8, adding two bottom-up paths, fusing feat3 in step 4 with P6 in step 7 again, and fusing P0 in step 3 with P5 in step 7 again to obtain P7 and P8, respectively, specifically as follows:

performing 2 times of adjacent interpolation upsampling on feat3 to obtain a feature map of 40 × 40, then performing 1 × 1 convolution, normalization and SILU activation functions to obtain a feature map with the channel number being unchanged in size being 512, and then fusing the feature map with P6 in the step 7 again; similarly, the feature map of the P0 in step 5 after passing through a feature extraction module is subjected to twice adjacent interpolation upsampling, then 1 × 1 convolution is performed to adjust the number of channels to 258, and the channel is fused with the P5 in step 7 by using concatenate;

step 9, as shown in fig. 4, adding a Head0 detection Head, discarding the Head3 detection Head, passing the feature map P2 obtained in step 5 through a feature extraction enhancement module integrated with the CBAM attention mechanism, and inputting the feature map P2 into the Head0 to complete target prediction of small target detection, which is implemented specifically as follows:

step 9-1, the feat0 features are utilized and introduced into the feature fusion module added to the feature fusion structure, a 160 × 160 × 128 detection Head0 is generated by feat0, and a 20 × 20 × 1024 detection Head3 generated by feature feat3 is discarded.

Step 9-2, the feature extraction enhancement module is as shown in fig. 3, the input features are respectively subjected to 2 1 × 1 convolutions and divided into two branches, and one branch is subjected to further convolutionnStacking a secondary residual error structure Bottleneck, wherein one branch of the Bottleneck is subjected to feature extraction through a 1 × 1 convolution and a 3 × 3 convolution, a CBAM attention mechanism is added between the two 1 × 1 convolutions and the 3 × 3 convolution, the other branch of the Bottleneck does not do any operation, the two branches of the Bottleneck are fused through add, and the two branches of the Bottleneck are fused with add of the BottlenecknSecondary Stack CBAM also repeatsnSecondly; the other branch is a residual edge branch, only one convolution operation of 1 multiplied by 1 is carried out, and finally the two branches are connected by using the concatenate for output;

step 12, debugging the network structure hyper-parameters from step 3 to step 10, and setting network model parameters, wherein the epoch is set to 130, migration training is adopted, the main network is frozen in the first 50 epochs, the learning rate is set to 0.001, a pre-training model is loaded, and the bach size is set to 4; unfreezing the backbone network by the next 80 epochs, setting the learning rate to be 0.0001, setting the bach size to be 2, reducing the learning rate to be 0.92 after every 1 epoch, and obtaining a final training model after training;

and step 13, inputting the test set in the step 1 into the training model in the step 11 to obtain a test result of the unmanned aerial vehicle small target detection. FIG. 5 shows the test set results output by the present invention.

The invention provides a multi-scale feature fusion anchor-free small target detection method according to the characteristics of small targets of aerial images of unmanned aerial vehicles and a target detection method based on deep learning. Meanwhile, the size of a detection head of the network is improved, a detection head of 160 multiplied by 128 is added, the network is more suitable for small-size target detection, the detection head of 20 multiplied by 1024 with larger parameter quantity in the original YOLOX network is discarded, and the parameter quantity of the network is reduced. In the feature extraction module, a CBAM attention mechanism is adopted to search a target dense connection region, the attention of a difficult sample is increased, the weight of important information is improved, and a large amount of calculated amount and feature redundancy generated by a network for searching an interested region are reduced. The method has the advantages of simple algorithm, strong operability and wide applicability.

While the invention has been described with respect to the illustrative embodiments thereof, it is to be understood that the invention is not limited thereto but is intended to cover various changes and modifications which are obvious to those skilled in the art, and which are intended to be included within the spirit and scope of the invention as defined and defined in the appended claims.

Claims

1. A small target detection method for aerial images of unmanned aerial vehicles is characterized in that a residual error structure with a fusion attention mechanism is introduced for feature extraction and multi-directional multi-scale residual error jump path feature fusion, and multiple detection head predictions are carried out in a non-anchor prediction mode with different scales, the method comprises five parts, namely data set preprocessing, feature extraction on an unmanned aerial vehicle small target data set, feature fusion on features from different stages, classification and regression prediction with different scales on a fused feature map, and network training and testing;

the first part comprises two steps:

the second part comprises two steps:

the third part comprises four steps:

step 5, respectively carrying out multi-scale jump residual processing on the feature maps feat0, feat1, feat2 and feat3 obtained in the step 4, carrying out concatement operation for 3 times in a bottom-up path, simultaneously adding another bottom-up path, and carrying out cross-multi-scale jump connection on the bottommost feature map feat3 and the shallowest feature map feat0 to obtain fused feature maps P0, P1 and P2

step 8, adding two bottom-up paths, fusing feat3 in step 4 with P6 in step 7 again to obtain a feature map P7, and fusing P0 in step 3 with P5 in step 7 again to obtain P8, which is implemented as follows:

performing 2 times of adjacent interpolation upsampling on feat3 to obtain a 40 × 40 feature map, then performing 1 × 1 convolution, normalization and SILU activation function to obtain a feature map with unchanged size and channel number of 512, then fusing the feature map with P6 in step 7 again, similarly, performing twice of adjacent interpolation upsampling on the feature map after P0 in step 5 passes through a feature extraction module, then performing 1 × 1 convolution to adjust the channel number of 258, and fusing the feature map with P5 in step 7 by using concatenate;

the fourth section comprises two steps:

step 9, adding a Head0 detection Head, discarding the Head3 detection Head in the original YOLOX network, inputting the feature map P2 obtained in step 5 into the Head0 through a feature extraction enhancement module integrated with the CBAM attention mechanism to complete target prediction of small target detection, and specifically implementing the following steps:

step 9-1, utilizing feat0 features, introducing the feat0 features into a newly added feature fusion module of a feature fusion structure, generating a detection Head0 of 160 × 160 × 128 by feat0, and discarding a detection Head3 of 20 × 20 × 1024 generated by feat 3;

step 9-2, the feature extraction enhancement module is as shown in fig. 3, the input features are respectively subjected to 2 1 × 1 convolutions and divided into two branches, and one branch is subjected to further convolutionnStacking of sub-residual structure Bottleneck, and performing feature extraction on one branch of Bottleneck through 1 × 1 convolution and 3 × 3 convolution when two volumes of 1 × 1 and 3 × 3 are usedAdding a CBAM attention mechanism between the products, carrying out no operation on the other branch of the Bottleneck, fusing the two branches of the Bottleneck by add, and following the BottlenecknSecondary Stack CBAM also repeatsnSecondly; the other branch is a residual edge branch, only one convolution operation of 1 multiplied by 1 is carried out, and finally the two branches are connected by using the concatenate for output;

the fifth part comprises two steps:

2. The method for detecting the small target facing the unmanned aerial vehicle aerial image, according to claim 1, is characterized in that feature fusion of a plurality of reverse recursive paths is added in the steps 5 and 7, and fine granularity of deep feature information is added to a shallow feature layer.

3. The small target detection method for the aerial image of the unmanned aerial vehicle as claimed in claim 1, wherein two bottom-up paths are added in step 8, and feature fusion is further performed on the features after the first fusion through channel number conversion, so that diversity of context semantic information is enriched.

4. The method for detecting the small target facing the unmanned aerial vehicle aerial image in the claim 1 is characterized in that the step 9 uses a CBAM attention mechanism to put attention on the features beneficial to sample classification, and feature redundancy is reduced.

5. The method for detecting small objects oriented to aerial images of unmanned aerial vehicles as claimed in claim 1, wherein step 9 is implemented by adding a Head0 detection Head and discarding the Head3 detection Head in the original YOLOX network to improve the size of the detection Head.