CN117197687A

CN117197687A - Unmanned aerial vehicle aerial photography-oriented detection method for dense small targets

Info

Publication number: CN117197687A
Application number: CN202310235724.3A
Authority: CN
Inventors: 张红英; 蒲俊涛; 袁明东; 黄语涵; 曾静超; 曾芸芸; 杨靖儒
Original assignee: Southwest University of Science and Technology
Current assignee: Southwest University of Science and Technology
Priority date: 2023-03-13
Filing date: 2023-03-13
Publication date: 2023-12-08

Abstract

The application provides a detection method for a small dense target of unmanned aerial vehicle aerial photography. Firstly, aiming at constructing a lightweight backbone network CSPDarknet-tiny on the basis of a YOLOv7 model, the downsampling ratio is reduced, and more semantic information and detail characteristics are reserved; secondly, a multi-head attention mechanism MHSA is introduced into the neck network, so that the interference of irrelevant information brought by a complex background is effectively relieved, the network is helped to pay more attention to the feature information extraction of the small target, and the detection precision of the small target is improved; finally, aiming at the sensitivity of the IOU Loss to the small target position difference, introducing the NWD Loss, and obviously improving the detection precision of the small target by combining the IOU Loss and the NWD Loss through a certain weight proportion. The application improves the problem of small target detection under the unmanned aerial vehicle aerial photographing condition, improves the accuracy of small target detection, reduces the omission rate of small target detection, ensures the excellent detection performance of the small target, and has wide applicability.

Description

Unmanned aerial vehicle aerial photography-oriented detection method for dense small targets

Technical field:

the application relates to an image processing technology, in particular to a detection method for a dense small target for unmanned aerial vehicle aerial photography, which combines multi-scale feature fusion and a measurement mode based on Wasserstein distance, introduces a multi-head attention mechanism MHSA, and constructs a robust small target detection network.

Technical background:

the unmanned aerial vehicle has the advantages of low operation cost, high maneuverability, portability, multiple visual angles, small volume and the like, can make up for the defect of remote sensing satellite information acquisition, and is increasingly used as a research hotspot for domestic and foreign expert students along with the gradual opening of the low-altitude field and the continuous development of unmanned aerial vehicle research and development technologies.

The object detection technology is to distinguish an object of interest from a background in the obtained picture and video, and identify the type of the object and the position of the object. The early target detection method is that abstract semantic features cannot be captured well by utilizing artificial design features, only a single specified category can be identified, and therefore the identification efficiency is low and the detection performance is low. Because the aerial image has more complex scenes and targets than the daily image, the traditional target detection method is more disadvantageous, and the requirement of aerial image target detection cannot be met. Meanwhile, aerial images often have huge data volumes, and real-time detection is often required, which is more strict for detection methods. In recent years, with the rapid development of deep learning, compared with the traditional method, the detection performance of the convolutional neural network is greatly improved, the algorithm is mainly divided into two categories, namely a single-stage algorithm and a two-stage algorithm, the two-stage algorithm is mainly based on the basic idea of region detection, the detection process is divided into two steps, and candidate regions possibly containing targets are generated through methods such as selective search, edge detection, region extraction network and the like to extract features; and then, classifying and position regression is carried out on the positions of the candidate frames by using a convolutional neural network. The existing two-stage algorithm has low common false detection rate and omission rate and good detection effect, but needs multiple detection and classification, and has low detection speed, and the two-stage algorithm comprises R-CNN, fast R-CNN, mask R-CNN and SPP-Net. Unlike two stages, a single-stage detector can directly obtain a check box without generating a plurality of candidate areas in advance, so that the single-stage algorithm generally has a high detection speed, but has a low detection effect, such as SSD, YOLO columns, and the like.

The current target detection algorithm has the following problems in target detection of unmanned aerial vehicle aerial images: 1) The target scale is changed greatly, and the feature fusion requirement on the algorithm is high; 2) The target size is small, the distribution is dense, the background is complex, contradiction exists between the feature extraction and downsampling of the small target, and the detection difficulty is increased; 3) The algorithm model based on YOLO has large parameter quantity and complex calculation.

The application comprises the following steps:

the application aims to solve the problems of low target detection accuracy and high omission ratio caused by large target quantity and large small target ratio in aerial images of an unmanned aerial vehicle, provide aerial images under different time, different weather conditions and different illumination conditions, design an algorithm network model, and obtain the model through deep neural network training to perform target detection, thereby solving the problem of small target detection under the aerial condition of the unmanned aerial vehicle, improving the accuracy of small target detection and reducing the omission ratio of small target detection.

In order to achieve the above objective, the present application provides an unmanned aerial vehicle aerial photographing target detection model based on a YOLOv7 network, wherein the method uses YOLOv7 as a backbone network, reduces the downsampling ratio, introduces a multi-head attention Mechanism (MHSA), makes the model focus more on target feature information, introduces a Normalized Wasserstein Distance (NWD) loss function when calculating regression loss, and makes up for the defect of small target detection, and comprises three parts: the first part is to preprocess the data set, the second part is to construct an improved YOLOv7 network, the third part is to train and test the network, and the best aerial data set detection result is output.

The first part comprises three steps:

step 1: dividing a training set, a verification set and a test set by adopting an unmanned aerial vehicle aerial photography public data set VisDrone;

step 2: the obtained data set pictures are adjusted to 640 multiplied by 640 pixels, random overturning, scaling, color gamut conversion and other operations are carried out on each training picture through Mosaic data enhancement, and four pictures are spliced in a picture splicing mode, so that a final data set is obtained;

step 3: aiming at the data set obtained in the step 2, carrying out K-means++ clustering on the frames of the data set to obtain new anchor frame sizes, comparing the results with the originally set anchor frames, calculating the matching accuracy, and selecting the optimal anchor frame size setting;

the second part comprises three steps:

step 4: and establishing a lightweight trunk feature extraction network CSPDarknet-tiny. The downsampling multiplying power is reduced on the backbone network of the original YOLOv7, the downsampling is reduced from 32 times to 16 times, and the output characteristic diagrams comprise 160×160×256 characteristic diagrams map1, 80×80×512 characteristic diagrams map2 and 40×40×512 characteristic diagrams map3;

step 5: processing the feature map3 obtained in the step 4 by using SPPCSPC to obtain a feature map P1 of 40 multiplied by 256;

step 6: and establishing a feature fusion network. In the feature extraction network of the neck, a path fusion network of the YOLOv7 is reserved, different feature layers and detection layers are fused, the FPN up-sampling conveys semantic features, and the PAN down-sampling conveys positioning features, and the implementation is as follows:

(1) The P1 obtained in the step 5 is transmitted into a deep feature extraction module C3MS, wherein the C3MS introduces a multi-head attention mechanism MHSA on the basis of C3, so that the feature extraction capability of a network can be effectively enhanced, and a feature map P2 is obtained;

(2) Fusing the map1, map2 and map2 obtained in the step 4 through top-down and bottom-up paths, and outputting final feature maps P3, P4 and P5;

the third part comprises four steps:

step 7: the feature maps P3, P4 and P5 output by the step 6 are subjected to channel number adjustment by REPConv, three layers of 1×1 convolution are used for predicting three parts of objectness, class and bbox, and finally the used detection heads are head0 of 40×40×512, head1 of 80×80×256 and head2 of 160×160×128;

step 8: adjusting network structure super parameters, and setting network model parameters, wherein the training batch size epoch is set to 200, the Momentum momentum=0.937, and the learning rate is initially set to ir=0.01;

step 9: training an aerial target detection model by using a training set to obtain a prediction result of a target in each sample, wherein the prediction result comprises a target prediction boundary frame and a center point position of the prediction boundary frame;

step 10: calculating total loss according to the sample prediction result and the label difference obtained in the step 9, and updating network model parameters based on the total loss to obtain a final training model;

when the regression Loss in the total Loss is calculated, NWD Loss is introduced, the IOU Loss and the NWD Loss are combined through a certain weight proportion, and the regression Loss function is as follows:

Loss _box ＝λ ₁ ×(1.0-IOU)+λ ₂ ×(1.0-NWD(N _a ,N _b ))

wherein lambda is taken ₁ And lambda (lambda) ₂ The NWD Loss is 0.5, so that the defect of small target detection by the IOU is fully overcome, the detection precision of the original model on large and medium targets is reserved, and the detection capability of the model on the small targets is remarkably improved;

step 11: and (3) inputting the test set in the step (2) into the training model in the step (10) to obtain a test result of unmanned aerial vehicle small target detection.

The application is mainly characterized in that the improvement in the YOLOv7 network model is as follows:

(1) The lightweight backbone network CSPDarknet-tiny is constructed, the downsampling multiplying power is reduced, the original downsampling multiplying power is reduced to 16 times by 32 times, more semantic information and detail characteristics are reserved, the output characteristic diagram is changed from original 80×80×512, 40×40×1024 and 20×20×1024 into 160×160×256, 80×80×512 and 40×40×1024, the parameter quantity of a model is obviously reduced, the information loss caused by overlarge multiplying power of a target in an aerial image of an unmanned plane in the downsampling process is effectively relieved, and the detection precision of a small target is improved;

(2) Because the background of the aerial image is complex and the target to be detected is small, the extraction of the target feature by the network is not facilitated, and when the main network extracts the feature information, a large amount of interference of irrelevant information exists, and the influence on the detection result is extremely large, the application introduces a multi-head attention mechanism MHSA, which has the capability of capturing the large-range image target information, combines C3 with multi-head attention to design a deep feature extraction module C3MS, effectively relieves the interference of the irrelevant information in the aerial image on the feature extraction, and enhances the capability of the model for extracting the small target feature information;

(3) Targets in the aerial image have the characteristics of large scale difference and more small targets, while the loss function of the initial YOLOv7 model uses CIOU, but the IOU is quite sensitive to the position difference of targets with different scales, so that the model is not ideal in application of the aerial image data set. According to the application, the NWD Loss is introduced to make up for the deficiency of the IOU Loss, the NWD is insensitive to the target scale change, and the similarity comparison between small targets is facilitated, but only the NWD Loss is not beneficial to the detection of the large and medium scale targets, so that the application combines the NWD Loss and the IOU Loss with specific weights, not only compensates the detection of the small targets which are not beneficial to the IOU, but also retains the superiority of the IOU on the detection of the large and medium scale targets.

Drawings

FIG. 1 is a general flow chart of the present application;

FIG. 2 is a diagram of the overall network framework of the present application;

FIG. 3 is a diagram of a feature fusion network framework of the present application;

FIG. 4 is a frame diagram of the MHSA structure of the present application;

FIG. 5 is a graph showing test set results output by the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described in the following description, in which the detailed description of the prior art may fade the subject matter of the present application, and the description will be omitted herein. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. An overall flow chart of an embodiment of the present application is shown in fig. 1, and the present application is further described below with reference to the accompanying drawings.

Step 1: acquiring an unmanned aerial vehicle aerial image data set VisDrone, wherein the data set contains 10209 images (6471 training sets, 548 verification sets and 3190 test sets) and contains 10 types of targets (pedestrians, people, bicycles, automobiles, trucks, tricycles, awning tricycles, buses and automobiles), and the sizes of the training pictures are unified to 640 multiplied by 640;

step 2: performing Mosaic data enhancement on a data set, randomly selecting four pictures in the data set, respectively performing random overturn (left-right overturn on an original picture), random zoom (size zoom on the original picture), color gamut change (brightness, saturation and tone change on the original picture), then intercepting corresponding areas of the four pictures in a matrix mode for splicing, and respectively placing the four pictures in the upper left, lower left, upper right and lower right areas to obtain pictures with more background and more information, thereby enhancing training samples;

step 3: for the data set obtained in the step 2, using a K-means++ clustering algorithm to cluster the width and height of all target boundary frames in the training set to obtain a new anchor frame combination, continuously updating the numerical value of a priori frame through back propagation, so that the data set can be better fitted, and finally obtaining the optimal anchor frame combinations [3,4,4,9,8,7], [8,14,16,9,14,18], [31,17,25,33,58,42];

fig. 2 is a diagram of an improved network model based on YOLOv7 model, firstly, a picture with 640×640 size is input, three feature images with different scales are output to a neck network through a trunk feature extraction network, the neck network outputs feature images with corresponding scales through a path fusion network, and a prediction result is output at a detection head through Rep and CBS, in the embodiment, the following steps are performed:

step 4: a lightweight trunk feature extraction network CSPDarknet-tiny is established, as shown in fig. 3, firstly, 640×640×3 pictures are input, a double downsampled feature map 320×320×64 is output through three CBS convolution layers, and then 4 times downsampled feature maps are obtained through CBS convolution with a convolution kernel size of 3 and a step length of 2, wherein CBS is composed of Conv (convolution) + BN (Batch Normalization) +SiLU (Sigmoid linear unit). Then 4 times of downsampling feature images are processed through an ELAN module, learning capacity of a network is enhanced on the basis of not damaging an original gradient path, 160×160×256 feature images map1 are output through the ELAN module, downsampling is continuously carried out on the map1, YOLOv7 is realized by adopting two branches, one branch realizes space downsampling through MP (maxpooling), a 1×1 convolution compression channel is connected in parallel, the other branch realizes downsampling through the 1×1 convolution compression channel and 3×3 convolution with a step length of 2, finally, the two branches are combined to obtain 80×80×256 feature images, and 80×80×512 feature images map2 and 40×40×512 feature images map3 are continuously obtained through ELAN and MP operations;

step 5: the map3 of the feature obtained in the step 4 is processed by SPPCSPC, SPPCSPC continues the idea of SPP, the receptive field is increased, different receptive fields are obtained through maximum pooling, the feature map is divided into two parts, one part is subjected to conventional convolution, the other part is subjected to SPP operation, so that the precision is improved, the speed is increased, and the feature map P1 of 40 multiplied by 256 is obtained;

step 6: the application discloses a feature fusion network, which is characterized in that the application extends a path fusion network PAFPN of YOLO and consists of two paths from top to bottom and from bottom to top, and the implementation is as follows:

(3) And (3) introducing the P1 obtained in the step (5) into a deep feature extraction module C3MS, wherein the C3MS introduces a multi-head attention mechanism MHSA on the basis of C3, so that the feature extraction capability of the network can be effectively enhanced. The Multi-Head Self-Attention mechanism MHSA (Multi-Head Self-Attention) consists of a plurality of Self-Attention modules, global information is captured in different spaces, and the obtained information is spliced to form a new feature map, and the structure of the new feature map is shown in figure 4 to obtain P2;

(4) Fusing the map1, map2 and map2 obtained in the step 4 through top-down and bottom-up paths to obtain final feature maps P3, P4 and P5;

the total Loss function employed by the original YOLOv7 network in calculating the Loss includes a confidence Loss (Loss _obj ) Regression Loss (Loss) _box ) And Loss of classification (Loss _cls ) The Loss function Loss is as follows:

Loss＝λ ₁ Loss _obj +λ ₂ Loss _box +λ ₃ Loss _cls

wherein lambda is ₁ 、λ ₂ 、λ ₃ The weights of different Loss functions in the total Loss function are represented, the application is 1, the target confidence Loss and the classification Loss adopt BCEWIthLogitsLoss (binary cross entropy Loss with log), and the regression Loss adopts CIOULoss;

when regression loss is calculated, a novel small target detection and evaluation method based on Wasserstein distance is introduced, which is called Normalized Wasserstein Distance (NWD), similarity between targets is calculated through Gaussian distribution, the similarity can be measured on detected targets whether the detected targets are overlapped or not through distribution similarity, the method is insensitive to target scale, and the method is more suitable for measuring the similarity between small targets, and the formula of the NWD is as follows:

where C is a constant closely related to the dataset,is a distance measure, N _a And N _b Is formed by A= (cx) _a ,cy _a ,w _a ,h _a ) And b= (cx _b ,cy _b ,w _b ,h _b ) A modeled gaussian distribution;

however, the result of completely replacing the IOU Loss with the NWD Loss is not improved, and although the detection precision of small targets and tiny targets is improved, the detection performance of large and medium targets is reduced, so the application reserves the IOU Loss, combines the IOU Loss and the NWD Loss through a certain weight proportion, and has the following complete regression Loss function:

Loss _box ＝λ ₁ ×(1.0-IOU)+λ ₂ ×(1.0-NWD(N _a ,N _b ))

lambda is taken out in the application ₁ And lambda (lambda) ₂ The NWD Loss is 0.5, so that the defect of small target detection by the IOU is fully overcome, the detection precision of the original model on large and medium targets is reserved, and the detection capability of the model on the small targets is remarkably improved;

step 11: and (3) inputting the test set in the step (2) into the training model in the step to obtain a test result of unmanned aerial vehicle small target detection, as shown in fig. 5.

Aiming at the characteristics of large target scale difference, large small targets and large parameter quantity existing in the deep learning processing unmanned aerial vehicle aerial image, the application provides a light detection method for multi-scale feature fusion, firstly, an optimal anchor frame combination aiming at a VisDRone unmanned aerial vehicle aerial image data set is obtained by utilizing a K-means++ clustering algorithm, in addition, on a backbone network of an original YOLOv7 model, the downsampling multiplying power is reduced to 16 times by the original 32 times, the target information lost by a small target in the downsampling process is reduced, more semantic information and target features are reserved, the parameter quantity is reduced, the detection precision of the small target is improved by 71.1M reduction to 17.8M, and the experiment proves. In order to solve the interference of the complex background of the aerial image, the application introduces a multi-head attention mechanism MHSA, fuses the MHSA on the basis of C3 and designs a deep feature extraction module C3MS, thereby effectively relieving the interference of irrelevant information brought by the complex background, helping the network to pay more attention to the feature information extraction of the small target and improving the detection precision of the small target. In addition, as the IOU Loss function adopted by the original model is very sensitive to the position difference of targets with different scales, the NWD Loss is introduced to make up for the deficiency, the NWD Loss and the IOU Loss are combined through a certain weight proportion, so that the superiority of the detection of large and medium targets is reserved, the detection capability of small targets is improved, in the aspect of prediction, the original 20X 1024 detection head is abandoned, the 160X 256 detection head is increased, and the detection of the too small scale of the targets in the aerial image of the unmanned aerial vehicle is facilitated.

While the foregoing describes illustrative embodiments of the present application, it should be understood that the present application is not limited to the scope of the embodiments, but rather, it should be apparent to those skilled in the art that various changes can be made within the spirit and scope of the application as defined and defined by the appended claims, all of which are intended to be protected by the following inventive concept.

Claims

1. The unmanned aerial vehicle aerial photography dense small target detection method is characterized by being improved based on YOLOv7, combining target characteristics in an unmanned aerial vehicle aerial photography image, establishing a lightweight trunk feature extraction network, fusing a multi-head attention mechanism MHSA, introducing an NWD Loss, and specifically comprising three parts of preprocessing a data set, constructing an improved YOLOv7 network, training the network and testing the network:

the first part comprises three steps:

step 2: the obtained data set pictures are adjusted to 640 multiplied by 640 pixels, random overturning, scaling and color gamut transformation operations are carried out on each training picture through Mosaic data enhancement, and four pictures are spliced in a picture splicing mode, so that a final data set is obtained;

the second part comprises three steps:

the P1 obtained in the step 5 is transmitted into a deep feature extraction module C3MS, wherein the C3MS introduces a multi-head attention mechanism MHSA on the basis of C3, so that the feature extraction capability of a network can be effectively enhanced, and a feature map P2 is obtained;

fusing the map1, map2 and map2 obtained in the step 4 through top-down and bottom-up paths, and outputting final feature maps P3, P4 and P5;

the third part comprises four steps:

when the regression Loss in the total Loss is calculated, the NWD Loss is introduced, the IOU Loss and the NWD Loss are combined through a certain weight proportion, and the regression Loss function is as follows:

wherein take outAndthe NWD Loss is 0.5, so that the defect of small target detection by the IOU is fully overcome, the detection precision of the original model on large and medium targets is reserved, and the detection capability of the model on the small targets is remarkably improved;

2. The unmanned aerial vehicle-oriented aerial photography dense small target detection method is characterized in that a lightweight trunk feature extraction network CSPDarknet-tiny is established in the step 4, and downsampling multiplying power is reduced on a trunk network of an original YOLOv7, so that more feature information is reserved, parameter quantity is reduced, and meanwhile detection precision of the small target is improved.

3. The unmanned aerial vehicle-oriented detection method for the dense small targets is characterized in that a deep feature extraction module C3MS designed by a multi-head attention mechanism MHSA is introduced on the basis of C3 in the step 6, complex background noise interference in aerial images is reduced, a backbone network is enabled to be more focused on extracting feature information of the small targets, and irrelevant information is ignored.

4. The unmanned aerial vehicle-oriented aerial photography dense small target detection method according to claim 1, wherein step 10 introduces NWD Loss to compensate sensitivity of IOU Loss to position differences of targets of different scales, and the detection precision of the small targets is obviously improved according to combination of a certain weight proportion.