CN118015490A

CN118015490A - Unmanned aerial vehicle aerial image small target detection method, system and electronic equipment

Info

Publication number: CN118015490A
Application number: CN202410036593.0A
Authority: CN
Inventors: 吴国辉; 王一同
Original assignee: Jiangxi Feihang Communication Equipment Co ltd
Current assignee: Jiangxi Feihang Communication Equipment Co ltd
Priority date: 2024-01-10
Filing date: 2024-01-10
Publication date: 2024-05-10

Abstract

The invention discloses a method, a system and electronic equipment for detecting a small target of an aerial image of an unmanned aerial vehicle, and relates to the technical field of target detection, wherein the method comprises the following steps: acquiring an aerial image of the unmanned aerial vehicle to be identified in complex weather; inputting the unmanned aerial vehicle aerial image to be identified in the complex weather into a small target identification network to obtain the type of the small target in the unmanned aerial vehicle aerial image to be identified in the complex weather; the types include: vehicles and pedestrians; the small target recognition network is obtained by training a modified YOLOv network by using a plurality of unmanned aerial vehicle aerial images and corresponding small target types in complex weather for training, and the modified YOLOv network is obtained by modifying a YOLOv network through an attention mechanism, a weighted cross-layer feature pyramid network and a variable detection head. The method improves the detection precision of small targets in the aerial image of the unmanned aerial vehicle in complex weather.

Description

Unmanned aerial vehicle aerial image small target detection method, system and electronic equipment

Technical Field

The invention relates to the technical field of target detection, in particular to a method, a system and electronic equipment for detecting a small target of an aerial image of an unmanned aerial vehicle.

Background

In recent years, unmanned aerial vehicles (Unmanned AERIAL VEHICLE, UAV) are increasingly used in the fields of agricultural rice spike detection, motion detection, city inspection and the like due to the characteristics of portability and rapidness. However, due to factors such as long-distance shooting and changeable illumination conditions, the problems of small size, large size change, dense distribution, easy shielding and the like of the target in unmanned aerial vehicle aerial shooting can be caused, so that the complexity of the background is increased. In addition, in an actual unmanned aerial vehicle application task, complex weather conditions mainly including a foggy day are often encountered, and in this case, the quality of an image acquired by the unmanned aerial vehicle is seriously affected, so that the target detection performance is reduced. Therefore, improving the detection capability of a small target in the unmanned aerial vehicle aerial photographing process in complex weather becomes a challenging research direction in the field of target detection.

Conventional unmanned aerial vehicle detection methods typically employ a manual feature-based target detection algorithm. For example, the combination of the shopan-based pleasure and the support vector machine (Support Vector Machine, SVM) and the directional gradient histogram (Histogram Of Oriented Gradients, HOG) is used for small target detection of the unmanned aerial vehicle, but in the actual application scene, the conventional target detection algorithm based on the manual characteristics has the problems of lower stability, higher requirement on the detection environment and the like. When the illumination environment, the gesture and the weather state change, the detection precision is obviously reduced. With the rise of deep learning, the adoption of a target detection algorithm based on deep learning for unmanned aerial vehicle detection has become a research hotspot. Compared with the target detection based on manual characteristics, the target detection based on deep learning has the advantages of wider application field, convenient design, simple data set manufacture and the like. The target detection method based on deep learning can be mainly divided into two types, namely a two-stage target detection algorithm represented by Fast R-CNN, fast R-CNN and other algorithms, and the two-stage target detection algorithm regresses a target area on the basis of generating a candidate frame and has the characteristics of high detection precision, low omission ratio and the like, but has the problems of low detection speed, large calculation amount and the like, and is difficult to apply to real-time detection. The second type is a single-stage target detection algorithm represented by YOLO (You Only Look Once), a multi-classification single-rod detector (Single Shot Multibox Detector, SSD) and the like, and the algorithm directly predicts the position and the type of a target and has the advantages of high detection speed, less calculation amount and the like, but has poor detection capability on a small target.

Aiming at the problem, numerous scholars propose an unmanned aerial vehicle small target detection algorithm based on deep learning. Kisantal et al enhance each image by copying and pasting the small object multiple times, so that the detector can increase the detection weight of the small object, thereby improving the detection effect of the model on the small object. However, although the data enhancement improves the detection effect on the small targets to a certain extent, the proportion of the small targets is simply improved, and fusion utilization of semantic information is lacked. Lin et al propose feature pyramids (Feature Pyramid Network, FPN) that improve network characterization by fusing shallow feature map information with deep feature map information, but this increases many parameters and slows down reasoning. Liu et al propose a path aggregation network (Path Aggregation Network, PAN), based on FPN, by means of up-sampling, shallow information is transferred to a deep layer from bottom to top more efficiently and fused, so that the reasoning speed is increased, and the detection effect on a small target object is improved. However, because of the large information difference between the feature graphs of different levels, redundant information and noise information are easily generated by directly adopting an addition or channel dimension splicing method when different levels are fused, and small targets in the different levels are easily ignored. Wen et al introduce a attentive mechanism (Coordinate Attention, CA) into the module, strengthening the focus on small objects. Yang et al then employ a query mechanism to accelerate the speed of reasoning of the feature pyramid based object detector. In summary, although the existing small target detection algorithm improves the detection performance to a certain extent, some disadvantages still exist: 1) Small targets are positioned inaccurately in complex weather. Small objects in complex weather have less characteristic information and are susceptible to pixel level errors. 2) Small object category confusion in complex weather. Small target categories in complex weather are similar to other object categories in the surrounding environment, which can easily lead to confusion and misclassification. 3) The sample is unbalanced. Small objects in complex weather typically occupy less proportion in the image, making algorithms prone to larger objects that are easier to identify.

Disclosure of Invention

The invention aims to provide a method, a system and electronic equipment for detecting small targets in aerial images of unmanned aerial vehicles, and the detection precision of the small targets in the aerial images of unmanned aerial vehicles in complex weather is improved.

In order to achieve the above object, the present invention provides the following solutions:

An unmanned aerial vehicle aerial image small target detection method comprises the following steps:

acquiring an aerial image of the unmanned aerial vehicle to be identified in complex weather;

Inputting the unmanned aerial vehicle aerial image to be identified in the complex weather into a small target identification network to obtain the type of the small target in the unmanned aerial vehicle aerial image to be identified in the complex weather; the types include: vehicles and pedestrians; the small target recognition network is obtained by training a modified YOLOv network by using a plurality of unmanned aerial vehicle aerial images and corresponding small target types in complex weather for training, and the modified YOLOv network is obtained by modifying a YOLOv network through an attention mechanism, a weighted cross-layer characteristic pyramid network and a variable detection head.

Optionally, the improved YOLOv network includes: backbone network, neck network and detection network;

the backbone network is used for extracting the characteristics of the unmanned aerial vehicle aerial image to be identified in the complex weather to obtain a plurality of characteristic images to be fused;

the neck network is used for fusing the feature images to be fused to obtain a plurality of fused feature images;

The detection network is used for carrying out small target identification based on each fused characteristic diagram.

Optionally, the backbone network includes: 5 CBS modules, 4C 3 modules, and 1 SPPF module.

Optionally, the backbone network of the neck network employs a cascade of weighted cross-layer feature pyramid networks and PANet.

Optionally, the neck network comprises: 2 first feature extraction structures, 1 second feature extraction structure, and 3 third feature extraction structures;

The first feature extraction structure includes: the device comprises a CBS module, an up-sampling module, a splicing module and a C3 module; the second feature extraction structure includes: the system comprises a CBS module, an up-sampling module, a splicing module and a RFFEB module; the third feature extraction structure includes: the system comprises a CBS module, a splicing module and a RFFEB module;

the RFFEB module includes: 51 x1 convolutional layers, 23 x 3 convolutional layers, 15 x 5 convolutional layers, 17 x 7 convolutional layers, 13 x 3 convolutional layer with a 3 expansion rate, 13 x 3 convolutional layer with a 5 expansion rate, and 13 x 3 convolutional layer with a 7 expansion rate.

Optionally, the detection network includes: 4 CBS modules and 4 variable detection heads.

An unmanned aerial vehicle aerial image small target detection system, comprising:

The image acquisition module is used for acquiring an aerial image of the unmanned aerial vehicle to be identified in complex weather;

The small target detection module is used for inputting the unmanned aerial vehicle aerial image to be identified in the complex weather into a small target identification network to obtain the type of the small target in the unmanned aerial vehicle aerial image to be identified in the complex weather; the types include: vehicles and pedestrians; the small target recognition network is obtained by training a modified YOLOv network by using a plurality of unmanned aerial vehicle aerial images and corresponding small target types in complex weather for training, and the modified YOLOv network is obtained by modifying a YOLOv network through an attention mechanism, a weighted cross-layer characteristic pyramid network and a variable detection head.

An electronic device comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor runs the computer program to enable the electronic device to execute the unmanned aerial vehicle aerial image small target detection method.

Optionally, the memory is a readable storage medium.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

The invention discloses a method, a system and electronic equipment for detecting a small target of an unmanned aerial vehicle aerial image, wherein firstly, an unmanned aerial vehicle aerial image to be identified in complex weather is acquired; then, inputting the unmanned aerial vehicle aerial image to be identified in the complex weather into a small target identification network to obtain the type of the small target in the unmanned aerial vehicle aerial image to be identified in the complex weather; the types include: vehicles and pedestrians. The small target recognition network is obtained by training the improved YOLOv network by utilizing a plurality of unmanned aerial vehicle aerial images and corresponding small target types under the complex weather for training, and is obtained by improving the YOLOv network through an attention mechanism, a weighted cross-layer characteristic pyramid network and a variable detection head. According to the invention, the improved YOLOv network obtained by improving the YOLOv network through the attention mechanism, the weighted cross-layer characteristic pyramid network and the variable detection head is utilized to detect the small target in the unmanned aerial vehicle aerial image, so that the detection precision of the small target in the unmanned aerial vehicle aerial image in complex weather is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a method for detecting a small target of an aerial image of an unmanned aerial vehicle provided in embodiment 1 of the present invention;

FIG. 2 is a diagram of a YOLOv network architecture;

FIG. 3 is a schematic view of a C3_N module structure;

FIG. 4 is a schematic diagram of a CBS module structure;

FIG. 5 is a schematic view of SPPF module configuration;

FIG. 6 is a schematic diagram of a modified YOLOv network architecture;

FIG. 7 is a schematic diagram of FPN+ PANet;

FIG. 8 is a schematic diagram of BiFPN + PANet;

FIG. 9 is a schematic diagram of a CBAM module configuration;

FIG. 10 is a schematic diagram of a RFFEB module configuration;

FIG. 11 is a schematic diagram of a variable detector head configuration;

FIG. 12 is a graph of the detection results of YOLOv in the case of dense distribution;

FIG. 13 is a graph of the detection results of DIT-YOLOv in the densely distributed case;

FIG. 14 is a graph of the detection results of YOLOv in a complex background;

FIG. 15 is a graph of the detection results of DIT-YOLOv in a complex background;

FIG. 16 is a graph showing the detection results of YOLOv in the case of dark background and low night light intensity;

FIG. 17 is a graph showing the detection results of DIT-YOLOv in the dark background and low intensity of night light;

FIG. 18 is a graph of YOLOv for a very small target;

FIG. 19 is a graph of the detection results of DIT-YOLOv for very small targets;

FIG. 20 is a graph of YOLOv results of detection of small objects in haze weather;

FIG. 21 is a graph of DIT-YOLOv detection results for small targets in haze weather;

FIG. 22 is a graph YOLOv showing the detection results of small targets on a road in a thick fog;

FIG. 23 is a graph of DIT-YOLOv as a result of detection of small targets on a road in a thick fog;

FIG. 24 is a graph of YOLOv results of detection of small objects for rural small roads in mist;

FIG. 25 is a graph of DIT-YOLOv detection results for small targets of rural small roads in mist;

FIG. 26 is a graph of YOLOv results of detection of small objects with more vehicle scenes;

FIG. 27 is a graph of DIT-YOLOv detection results for a small target with more vehicle scenarios.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention aims to provide a method, a system and electronic equipment for detecting small targets in aerial images of unmanned aerial vehicles, and aims to improve the detection precision of the small targets in the aerial images of the unmanned aerial vehicles in complex weather.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Example 1

Fig. 1 is a schematic flow chart of a method for detecting a small target of an aerial image of an unmanned aerial vehicle according to embodiment 1 of the present invention. As shown in fig. 1, the method for detecting a small target of an aerial image of an unmanned aerial vehicle in this embodiment includes:

step 101: and acquiring an aerial image of the unmanned aerial vehicle to be identified in complex weather.

Step 102: inputting the unmanned aerial vehicle aerial image to be identified in the complex weather into a small target identification network to obtain the type of the small target in the unmanned aerial vehicle aerial image to be identified in the complex weather.

Wherein, the types include: vehicles and pedestrians; the small target recognition network is obtained by training a modified YOLOv network by using a plurality of unmanned aerial vehicle aerial images and corresponding small target types in complex weather for training, and the modified YOLOv network is obtained by modifying a YOLOv network through an attention mechanism, a weighted cross-layer feature pyramid network and a variable detection head.

Specifically, YOLO is a target detection model proposed by Joseph Redmon and ALI FARHADI et al in 2015, and is one of the most popular single target detection algorithms at present. Thanks to the transformation and innovation of numerous researchers, the YOLO series algorithm is continuously developed and is continuously updated. The YOLO algorithm of different versions has respective advantages and disadvantages, and each version combines new technology, enhances various functions and introduces different auxiliary technologies such as data enhancement, loss function and the like according to different application requirements. Currently, the YOLO algorithm has evolved to the eighth generation. In view of the requirements of unmanned aerial vehicle application scenes and the maturity of algorithms, the invention selects YOLOv versions to improve. YOLOv5 is divided into S, M, L, X according to the specifications from small to large, and the network structures of the four specifications are identical and are different in network width and network depth. Because unmanned aerial vehicle aerial image target detection task has problems such as the background is complicated, little target sample is many, and this needs certain network degree of depth in order to extract characteristic information, simultaneously, considers unmanned aerial vehicle's bearing condition, needs certain network lightweight, synthesizes detection speed and precision demand. Therefore, the invention selects YOLOv algorithm with the highest detection speed, the least parameter quantity and the specification of S as a baseline method.

YOLOv5 the overall frame is shown in fig. 2 as being divided into three sections: backbone network (Backbone), neck network (Neck), and detection network (Head). The backbox is responsible for extracting features of an input image and generating a feature map. Wherein the CBS module includes three operations of convolution, normalization and activation functions. As shown in fig. 3, the c3_n module is used for feature extraction, and adopts a residual structure, so that the size of the feature map is not changed. As shown in fig. 4, the CBS module is used to encapsulate the combined operations of convolution, batch normalization, and activation functions. As shown in fig. 5, the SPPF module is configured to implement adaptive size output, and convert a feature map of any size into a feature vector of a fixed size. Neck are responsible for fusing backbone network generated feature maps to enhance overall network performance. Upsample in Neck is an up-sampling operation, and Concat is a splicing operation. In order to enhance the feature extraction capability, neck adopts an FPN+ PANet structure, and the FPN structure transmits and fuses high-level feature information to low-level in a top-down up-sampling mode so as to improve the target detection effect; the PANet structure enhances feature fusion by bottom-up downsampling on the basis of FPN. Head performs final regression prediction. And (3) performing 8, 16 and 32 times downsampling to generate three different-scale detection of P3, P4 and P5, which are used for identifying targets with different sizes of small, medium and large.

As an alternative embodiment, the improved YOLOv network includes: backbone network, neck network and detection network.

The backbone network is used for extracting the characteristics of the unmanned aerial vehicle aerial image to be identified in complex weather, and a plurality of characteristic images to be fused are obtained.

The neck network is used for fusing the feature images to be fused to obtain a plurality of fused feature images.

The detection network is used for carrying out small target identification based on the feature graphs after fusion.

As an alternative embodiment, the backbone network comprises: 5 CBS modules, 4C 3 modules, and 1 SPPF module.

As an alternative embodiment, the backbone network of the neck network employs a cascade of weighted cross-layer feature pyramid networks and PANet.

As an alternative embodiment, the neck network comprises: 2 first feature extraction structures, 1 second feature extraction structure, and 3 third feature extraction structures.

The first feature extraction structure includes: the device comprises a CBS module, an up-sampling module, a splicing module and a C3 module; the second feature extraction structure includes: the system comprises a CBS module, an up-sampling module, a splicing module and a RFFEB module; the third feature extraction structure includes: CBS module, splice module, and RFFEB module.

RFFEB modules include: 51 x1 convolutional layers, 23 x 3 convolutional layers, 15 x 5 convolutional layers, 17 x7 convolutional layers, 13 x 3 convolutional layer with a 3 expansion rate, 13 x 3 convolutional layer with a 5 expansion rate, and 13 x 3 convolutional layer with a7 expansion rate.

As an alternative embodiment, the detecting network comprises: 4 CBS modules and 4 variable detection heads.

Specifically, as shown in fig. 6-11, the structure of the modified YOLOv network, namely (Diminutive Target-YOLOv5, DIT-YOLOv 5), is specifically as follows:

The framework structure of the DIT-YOLOv algorithm is shown in fig. 6, the Neck part adopts a weighted cross-layer feature pyramid network (Bi-directional Feature Pyramid Network, biFPN) structure, and features of different layers are effectively fused through a bidirectional cross connection and rapid normalization method. In addition, a RFFEB module is introduced, which introduces a focusing mechanism to improve the detection accuracy of small targets. One of the reasons why small object detection is poor in YOLOv is that the small object sample size is small, by default, the YOLOv5 algorithm inputs 640 x 640 resolution and uses 8, 16, and 32 downsampling operations, which can result in deep feature mapping that makes it difficult for the network to learn feature information of the small object efficiently. Accordingly, there is a need to adjust the downsampling strategy accordingly to better accommodate small object datasets in complex weather. In order to solve the problem, the DIT-YOLOv introduces a 4-time downsampling small target detection layer, so as to improve the recognition and detection efficiency of the small target, thereby solving the problem that the network is easy to miss in processing the small target in complex weather, therefore, the Head part is newly added with a characteristic layer P2 obtained by 4-time downsampling, and the regression detection Head of the YOLOv network is replaced by the variable detection Head VA-Head provided by the invention, so that the network is improved to position the dense small target in complex weather.

(1) A backbone network.

The backbone network comprises: 5 CBS modules, 4C 3 modules, and 1 SPPF module.

(2) A neck network.

One of the difficulties in small target detection in unmanned aerial vehicle aerial photography in complex weather is how to effectively represent and process multi-scale feature fusion. Common multi-scale feature fusion is to first scale all features to the same scale and then add the features. However, the influence of different input features on the output features is different under different scales, and in the model training process, due to the fact that the downsampling times of deep feature images are increased, the receptive fields are large, the overlapping areas among the receptive fields are increased, the extracted feature information is insufficient in fine granularity, and further, some spatial position information which is only possessed by shallow feature images is ignored, so that the positioning accuracy and the detection accuracy of a small target are affected.

The key to solving this problem is how to effectively perform information fusion on the shallow feature map and the deep feature map to generate a final feature map with rich spatial position information and semantic information. YOLOv5 the backbone network employed is a cascade of FPN and PANet (as shown in fig. 7). The FPN communicates and merges feature graphs of different levels by establishing context connections, PANet further performs multi-scale context aggregation, and selectively aggregates features of different levels by an adaptive feature aggregation module. However, in the up-down sampling and tensor splicing processes, the weight distribution of the fusion features may be inconsistent, so that the detection effect on the small target in complex weather is poor. The modified YOLOv network thus uses a weighted cross-layer feature pyramid network (BiFPN) instead of the FPN in the YOLOv network, resulting in a backbone network structure as shown in fig. 8, introducing learnable weights to distinguish the importance of different input features. Therefore, the influence of the small target features on the feature fusion network in the complex weather can be enhanced, and the detection performance of the small target in the complex weather can be improved.

BiFPN integrate two-way cross-scale connection and rapid normalization fusion, wherein the adopted weighted fusion mode is to fuse feature layers with different resolutions, a weight is added for each input, and then the fusion weights of different inputs are adjusted by a network, and the rapid normalization feature fusion formula is as follows:

（1）。

Wherein, Features fused to normalized features; /(I)Is a small added value to keep the value stable,；/>For the learnable weight of the input features of the j-th layer, guarantee/>, by means of a ReLU activation function；For the leachable weight of the input features of the i-th layer, guarantee/>, through the ReLU activation function；/>Is an input feature of the i-th layer.

At the time of rapid normalization fusion, feature fusion between upper and lower layers was achieved as follows (withoutAnd/>Is a fusion of (c).

（2）。

（3）。

Wherein,Intermediate features that are the i-th layer top-down path; /(I)Is a convolution operation; /(I)And/>Are learning weights related to the input features; /(I)Input features for the i-th layer; resize is a sampling operation; /(I)An intermediate feature that is a layer i+1 top-down path; /(I)Output characteristics of the bottom-up path of the ith layer; /(I)、/>And/>All are updated learning weights after the previous layer is calculated; /(I)Is an output feature of the layer i-1 bottom-up path.

BiFPN has the following advantages over the original YOLOv5 feature pyramid network and path aggregation network cascade node connection: and (1) the computational complexity is reduced, and the feature transfer efficiency is improved. Because the contribution of the nodes lacking the feature fusion to the feature network transfer calculation is very limited, the intermediate nodes of P2 and P5 can be removed, and a small-scale simplified bidirectional network is formed, so that the network transfer is more efficient. And (2) realizing high-dimensional feature fusion. Unlike the path aggregation network, which has only one top-down feature path and one bottom-up feature path, the weighted bi-directional feature pyramid network BiFPN treats each bi-directional path as a feature network layer. The feature network layer is repeated for a plurality of times, so that feature fusion of a deeper level can be realized, and the expression capability and feature extraction capability of the network are improved. (3) Different resolution characteristics are effectively fused, and the sensitivity of the output characteristics to small target detection is enhanced. BiFPN cross-connect the features in the upper and lower directions, and through normalization operation, make the importance of each input to the detection network different, thereby increasing the weight of the small target.

In a common unmanned aerial vehicle aerial scene, there is a change in the scale of objects in the image. With deepening of network structures and multiple convolution operations, small targets tend to lose a large amount of key feature information in complex weather, making detection and identification difficult in advanced feature maps. Therefore, it is important to acquire feature information of various dimensions to improve the reliability of small target detection in complex weather. Although Receptive Field Blocks (RFBs) can capture image features of various scales and obtain different receptive field sizes, the extracted feature information is extensive, lack of attention to key details, and results in non-ideal detection performance of small targets.

To address this problem, the present invention introduces a receptive field feature extraction module (RECEPTIVE FIELDS Feature Extraction Block, RFFEB) that integrates both channel and spatial attention mechanisms. The RFFEB module considers the characteristic variation between different receptive field channels to enhance the expression of characteristic information. The Convolution Block Attention Module (CBAM) is an attention module for improving the convolutional neural network by combining channels and spatial attention, and is helpful to capture the importance of each channel in the input feature map and identify the importance of different positions on the feature map, as shown in fig. 9, the introduction of the CBAM module not only enables the RFB to cover a larger area to obtain abundant feature information, but also adopts an attention mechanism to extract key features from the abundant feature information. Thus, it enhances the ability of the model to detect multi-scale and dense small objects in a complex context.

As shown in fig. 10, the RFFEB structure consists of five parallel branches. The first branch consists of 1 x 1 and 3 x 3 convolutional layers, and is intended to extract information from the input feature map without expansion. The central three branches use expansions of 3,5 and 7 respectively, each integrated with a CBAM module to collect comprehensive characterization information while emphasizing important details. The last branch then contains only one 1 x 1 convolutional layer to reduce the number of channels. The feature map extracted from the first four branches is connected and added to the original input feature information of the fifth branch to form a residual structure.

The implementation process of RFFEB structure is as follows:

（4）。

（5）。

（6）。

（7）。

（8）。

Wherein, Outputs for branches consisting of 1 x 1 and 3 x 3 convolutional layers; /(I)Is an input feature; /(I)Outputs for branches consisting of CBAM blocks, 1 x 1,3 x 3 convolutional layers, and 3 x 3 convolutional layers with a 3 expansion rate; /(I)Outputs for branches consisting of CBAM blocks, 1 x 1, 5 x 5 convolutional layers, and 3 x 3 convolutional layers with a dilation ratio of 5; /(I)Outputs for branches consisting of CBAM blocks, 1 x 1, 7 x 7 convolutional layers, and 3 x 3 convolutional layers with a dilation rate of 7; /(I)Is the output of a branch consisting of a1 x 1 convolutional layer;

(3) The network is detected.

The complexity of localization and classification in target detection is essentially due to the contradiction between translational invariance and image scale invariance in convolutional neural networks. Because of the real world image, there are typically multiple objects of different scale and size, which may appear in distinct shapes and positions at different perspectives. To solve this problem, the head portion of the object detection should have a certain spatial perceptibility.

The detection network proposes a new detection Head named variable Head (VA-Head). As shown in fig. 11, the detection head coherently combines multiple self-attention mechanisms to adapt the diversified feature level importance between scale-aware feature levels and spatially-aware spatial locations to adapt it to the input data.

Four-dimensional tensor for a given feature pyramidWherein/>The number of layers of the pyramid is indicated,，/>、/>And/>Representing the height, width and number of channels of the feature, respectively. The self-attention expression is:

（9）。

Wherein, As a self-attention function; /(I)As a function of attention.

The attention function is implemented through the fully connected layer, but learning the attention function directly in all dimensions wastes a lot of computational resources. Thus, the present invention converts attention functions into three successive attentions, each attention function focusing on only one angle. The decomposition helps to process the relationship between the different level features and the object scale differences, thereby improving the representation learning of the different level features and helping to improve the scale perception of target detection. Furthermore, various geometric transformations of different object shapes are related to features of different spatial locations, so improving the learning of the characterization of F at different spatial locations also helps to enhance the spatial perception of object detection.

According to semantic importance of different scales, the dynamic fusion characteristic equation is as follows:

（10）。

Wherein, Perceiving an attention function for the task; /(I)Sensing an attention function for a scale; /(I)Is a spatially aware attention function.

Sparse spatial attention is obtained through variability convolution learning, and spatial is sampled through additional self-learning offsets. This not only can apply attention to each spatial location, but can adaptively aggregate multiple feature layers to learn more discriminative characterizations.

（11）。

Wherein,For/>A layer pyramid; /(I)For/>Sparse sampling positions; /(I)For/>Layer pyramid/>Weights of the values of the individual sparse sampling locations; /(I)Using different variability convolutions for feature maps according to different levels, and unifying the feature maps of different levels into a function of feature map dimensions of an intermediate level; /(I)For/>Self-learning spatial offsets of the sparse sampling locations; /(I)Is the number of object classes; /(I)Is a self-learned scalar at position pk.

Features are dynamically fused according to semantic importance of different scales.

（12）。

Wherein,Approximating a linear function of a1 x1 convolutional layer; /(I)Is a Sigmoid function.

Dynamic ReLU activation functions of DYReLU-b are used. First, at/>Global average pooling is performed dimensionally to reduce the dimension, then processed through two fully connected layers and one normalization layer, and finally the output is normalized to [ -1,1] using the Sigmoid function.

Experiments and analyses were also performed to verify the methods in the examples, as follows.

1. Data set and experimental environment

The invention selects VisDrone2021 data set and homemade Foggy data set for experiment. VisDrone2021 dataset is collected by the Tianjin university AISKYEYE team, covers images taken at different places and heights using different models of unmanned aerial vehicles under different scenes, weather and lighting conditions. The data set includes 8599 pictures in total and is divided into three subsets, training set (6471), validation set (548) and test set (1580). Foggy the dataset is automatically collected by a team, more complex weather conditions are covered, and images shot at different places by using unmanned aerial vehicles of the same model are used. The data set includes 2066 pictures in total and is divided into three subsets, training set (1275), validation set (124) and test set (667).

The automobile and the pedestrian occupy most of the targets, the tricycle, the bus and the like occupy very small target proportion, and most of the targets belong to small targets, so that the target types are unevenly distributed, and the method brings remarkable challenges for experiments.

The experimental environment is based on a Windows system, the training and reasoning process of the model is carried out on an RTX 3080Ti GPU, a PyTorch.12 is adopted in a deep learning framework, and the CUDA version number is 11.7. The total training round number was set to 300 epochs and the learning rate was 0.01. The optimizer uses a random gradient descent method SGD (Stochastic GRADIENT DESCENT) with momentum, the momentum parameter is set to 0.937, and the weight attenuation coefficient is 0.0005.

2. Experimental evaluation index

Recall R (recovery), average Precision (mAP), precision P (Precision), and transmission frames per Second (FRAMES PER seconds, FPS)) were used as evaluation indexes for experiments.

Wherein, the calculation expression of P and R is as follows:

（13）。

Wherein, To predict and detect the correct number of real samples; /(I)A number of false positive samples that are mispredicted as positive samples; /(I)The number of samples that are false negative samples that were false floor drain detected.

The mAP is calculated as follows:

（14）。

Wherein, A serial number of the category; /(I)For/>Average accuracy of category; /(I)Is the total number of categories.

The average accuracy (Average Precision, AP) is considered as the area under a P-R curve of a certain class, and the formula is as follows:

（15）。

3. ablation experiment and result analysis

To verify the effect of each module on model performance, the experiment was based on the original YOLOv algorithm and a series of ablative experiments were performed on the VisDrone2021 dataset. These experiments were followed by the addition of the improved modules and methods proposed by the present invention to evaluate their impact on model performance. To ensure accuracy of the experiment, a pre-trained model was not used in the experimental setup, and Batchsize was set to 32 and training epochs was set to 300. Firstly, the experiment adds a new detection branch P2 on the basis of the original YOLOv, then replaces the structure of a BiFPN feature pyramid network structure in Neck, then adds a feature fusion module RFFEB with attention, and finally replaces the coupled prediction Head with a detection Head V-Head. In order to ensure the fairness of comparison, the different experiments only add corresponding modules one by one without changing the optimization method or super parameters. The experimental results are shown in table 1.

According to the ablation experimental result, the model improvement of the invention shows remarkable improvement in the aspect of small target detection accuracy. Firstly, the comparison experiment result shows that the improvement of the newly added detection branch P2 on the mAP50 index is most obvious, and the improvement reaches 4.5 percent. This indicates that the newly added detection branch is very effective for small target detection, and that setting the newly added P2 detection branch anchor frame to the size of the small target can greatly reduce missed detection conditions due to the anchor frame being set too large. Secondly, RFFEB structures promote mAP50 indexes by 2.7 percentage points, which shows that the introduction of the feature fusion module with the attention mechanism can improve the attention degree of a small target, thereby improving the detection effect. In addition, the V-Head detection Head improves the mAP50 by 1.6 percent, which indicates that the accuracy of target detection can be improved by respectively introducing attention mechanisms in three dimensions of detection Head scale, space and task. Finally, the BiFPN structure improves mAP50 by 0.6 percent, which shows that the detection effect on small targets can be improved by combining multi-level information fusion and shallow shape and size information. The detection effect of the table on small targets such as a sunshade tricycle, a bicycle and a bag-making vehicle is obviously improved after each model modification, which indicates that each added module has positive influence on the detection effect of the small targets.

Table 1 VisDrone results table of results of ablation experiments on datasets

。

4. Comparative experiments

In order to verify the performance of the algorithm DIT-YOLOv, the experiment compares the algorithm DIT-YOLOv5 with a main stream target detection algorithm, and the result is shown in a table 2, and shows the detection performance of the main stream algorithm, wherein the main stream algorithm comprises a classical two-stage target detection algorithm Light-RCNN and different branches popular in a single-stage target detection algorithm, namely a QueryDet algorithm and a TPH-YOLOv algorithm. When the AP value is 50%, the DIT-YOLOv is 30.68 percent, 8.84 percent and 24.51 percent higher than RETINANET, YOLOX and Cascade R-CNN respectively. Compared with the TPH-YOLOv algorithm, the method has similar detection precision. When the AP is 75%, the mAP of the DIT-YOLOv reaches 39.95%, which is 1.26 percent and 8.54 percent higher than that of the TPH-YOLOv5 and YOLOX respectively.

Table 2 DIT-YOLOv5 Table of results of comparative experiments with the mainstream method

。

The experimental result shows that the two different AP values of DIT-YOLOv are superior to those of a comparison algorithm, and the improvement is obvious compared with other algorithms when the AP50 is used, so that the method has stronger capability in the aspect of small target detection. In a comprehensive view, the DIT-YOLOv shows certain advantages compared with other advanced algorithms, and is suitable for unmanned aerial vehicle aerial image target detection tasks.

5. Detection effect analysis

In order to verify the detection effect of the DIT-YOLOv algorithm in an actual scene, the experiment researches the detection capability of the DIT-YOLOv algorithm in a special scene, selects one unmanned aerial vehicle aerial image under the conditions of VisDrone2021 data set distribution density, complex background, minimum target and dark background, and uses YOLOv5 and DIT-YOLOv to test, and the test results are shown in fig. 12-19.

As can be seen from fig. 12 and 13, YOLOv does not identify pedestrians in distant dense vehicles in dense distribution, but DIT-YOLOv5 is able to accurately identify. As can be seen from fig. 14 and 15, YOLOv erroneously recognizes the eave as an automobile due to the interference of the complex background information in the complex background, whereas DIT-YOLOv5 correctly recognizes the object in the picture without being interfered by the complex information. As can be seen from fig. 16 and 17, YOLOv fails to correctly identify pedestrians beside the red vehicle in the dark background and in the low intensity of night illumination, and DIT-YOLOv5 does not miss. Finally, as can be seen from fig. 18 and 19, DIT-YOLOv is able to learn the features more fully for very small targets, and does not misdetect the black object on the roof as a motorcycle as compared to YOLOv 5.

In order to further verify the detection effect of the DIT-YOLOv algorithm in complex weather scenes, the experiment researches the detection capability of the DIT-YOLOv algorithm in different weather scenes, and selects Foggy unmanned aerial vehicle aerial images in complex weather in a centralized test mode for testing, wherein the conditions comprise T-junctions in haze weather, roads in dense fog, rural roads in mist and more traffic tool scenes are selected. Tests were performed using YOLOv and DIT-YOLOv, respectively, and the test results were compared as shown in FIGS. 20-27.

As can be seen from fig. 20 and 21, the vehicle with smaller outline at the distance is not recognized by YOLOv at the center of the t-intersection where haze is thicker, but the DIT-YOLOv5 can be recognized accurately. As can be seen from fig. 22 and 23, on a road under dense fog, YOLOv erroneously recognizes an antenna on an automobile as a pedestrian due to the interference of complex background information, and the pedestrian deep in the dense fog is not recognized, whereas DIT-YOLOv5 is not interfered by the complex information, and correctly recognizes a target in a picture. As can be seen from fig. 24 and 25, YOLOv erroneously recognizes small objects on both sides of the rural small road as an automobile and a motorcycle on the rural small road of the mist, and DIT-YOLOv5 does not exhibit such recognition errors. Finally, as can be seen from fig. 26 and 27, in a scene with many targets to be detected in foggy days, the DIT-YOLOv can learn the features more fully, and almost all the targets to be detected are identified. In contrast YOLOv does not identify trucks on the road.

Through the comparison result, the DIT-YOLOv can obtain more sufficient small target feature information through the fusion of the multi-level features, and extract information favorable for target positioning and classification from a large amount of multi-scale information, and compared with YOLOv, the method reduces the conditions of missed detection and false alarm, and has stronger identification capability for small targets of unmanned aerial vehicle images under a complex background.

Finally, the present experiment tested unmanned aerial vehicle aerial images of DIT-YOLOv5 in multiple complex weather scenarios in the disclosure VisDrone test set and the homemade Foggy test set. The test set includes daytime and night scenes, dense fog and mist scenes, and the like, and for small objects distributed densely, such as target samples of automobiles, pedestrians, and the like, the DIT-YOLOv can accurately locate the target location.

Under the condition of complex background, the method can exclude the influence of interference objects such as trees, buildings and the like, and accurately classify and position the targets. In general, the DIT-YOLOv algorithm shows excellent detection effect in actual scenes under different illumination conditions, backgrounds and target distribution conditions, and meets the requirements of unmanned aerial vehicle aerial image target detection tasks.

Example 2

The unmanned aerial vehicle aerial image small target detection system in this embodiment includes:

The image acquisition module is used for acquiring the unmanned aerial vehicle aerial image to be identified in complex weather.

The small target detection module is used for inputting the unmanned aerial vehicle aerial image to be identified in the complex weather into the small target identification network to obtain the type of the small target in the unmanned aerial vehicle aerial image to be identified in the complex weather; the types include: vehicles and pedestrians; the small target recognition network is obtained by training a modified YOLOv network by using a plurality of unmanned aerial vehicle aerial images and corresponding small target types in complex weather for training, and the modified YOLOv network is obtained by modifying a YOLOv network through an attention mechanism, a weighted cross-layer feature pyramid network and a variable detection head.

Example 3

An electronic device comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to perform the unmanned aerial vehicle aerial image small target detection method of embodiment 1.

As an alternative embodiment, the memory is a readable storage medium.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. The unmanned aerial vehicle aerial image small target detection method is characterized by comprising the following steps of:

2. The unmanned aerial vehicle aerial image small target detection method of claim 1, wherein the improved YOLOv network comprises: backbone network, neck network and detection network;

3. The unmanned aerial vehicle aerial image small target detection method of claim 2, wherein the backbone network comprises: 5 CBS modules, 4C 3 modules, and 1 SPPF module.

4. The unmanned aerial vehicle aerial image small target detection method of claim 2, wherein the backbone network of the neck network employs a cascade of weighted cross-layer feature pyramid networks and PANet.

5. The unmanned aerial vehicle aerial image small target detection method of claim 2, wherein the neck network comprises: 2 first feature extraction structures, 1 second feature extraction structure, and 3 third feature extraction structures;

6. The unmanned aerial vehicle aerial image small target detection method according to claim 2, wherein the detection network comprises: 4 CBS modules and 4 variable detection heads.

7. An unmanned aerial vehicle aerial image small target detection system, the system comprising:

8. An electronic device comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to perform the unmanned aerial vehicle aerial image small target detection method of any of claims 1 to 6.

9. The electronic device of claim 8, wherein the memory is a readable storage medium.