CN115810157A

CN115810157A - Unmanned aerial vehicle target detection method based on lightweight feature fusion

Info

Publication number: CN115810157A
Application number: CN202211633735.9A
Authority: CN
Inventors: 周鹏; 曹杰
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2022-12-19
Filing date: 2022-12-19
Publication date: 2023-03-17

Abstract

The invention discloses an unmanned aerial vehicle target detection method based on lightweight feature fusion, which can be used for carrying out target identification on an image aerial by an unmanned aerial vehicle. For the problems that the current target detection network has low detection precision on small images and large network parameter quantity and is difficult to detect in real time, and the like, a neck feature extraction module of YOLOv4-tiny is redesigned by using Depth Separable Convolution (DSC) and Coding Attachment (CA); then, an SPPF feature extraction module is used for increasing the receptive field information of the extracted features, and the calculation lightweight of the model is effectively kept; finally, the Decoupled Head detection Head module is used to enable the finally extracted characteristic information to have stronger spatial associativity, and the background and the target to be detected can be distinguished more easily, so that a brand-new lightweight target detection network is constructed.

Description

Unmanned aerial vehicle target detection method based on lightweight feature fusion

Technical Field

The invention relates to the technical field of target detection, in particular to an unmanned aerial vehicle target detection method based on lightweight feature fusion.

Background

With the continuous development of deep learning, the target detection field becomes one of the most popular research directions. The target detection algorithm can be applied to the traffic field, industrial detection, face recognition, military monitoring and other scenes which have great significance in real life. Currently, common target detection algorithms are generally classified into a two-stage target detection algorithm and a single-stage target detection algorithm. The method has the advantages that the target detection average precision is high, the false detection rate and the missed detection rate are low, but the model parameters are large, the calculated amount is large, the real-time requirement is difficult to achieve, and Fast R-CNN are mainly taken as representatives. The single-stage detection algorithm is represented by a YOLO series algorithm, an SSD algorithm and the like, directly fuses feature extraction and the positioning of a prediction box together, has the advantages of high detection speed, low model complexity, easiness in deployment and the like, and can meet the requirement of real-time detection.

Compared with the existing general target detection algorithm, the unmanned aerial vehicle target detection algorithm has the defects of small visual field range, more transverse blocking of detected targets, difficult visual field conversion and the like, has the advantages of transverse no shielding of high-altitude visual fields, wide monitoring range, rapid visual angle conversion and the like, can be applied to military affairs and various dangerous scenes for target detection, and can ensure that the application of the target detection algorithm has higher safety, so the existing target detection algorithm based on the aerial image of the unmanned aerial vehicle becomes a research hotspot.

However, the unmanned aerial vehicle aerial image also has a series of problems such as small target size, complex background, blurred image appearance and the like, so that the detection accuracy is reduced. The most difficult problem to solve belongs to the problem of low small target detection precision, network training and target prediction in small target detection are mainly focused between feature extraction and feature fusion, and then due to the fact that small targets occupy a small number of pixels in original images, the carried information is limited, appearance information such as textures, shapes and colors is lacked, and down sampling generated by deep convolution enables feature information of the small targets to be dispersed or even disappear in a deep feature map, and therefore the fact that context semantic information, position information and feature representation of rich networks become research key points of unmanned aerial vehicle small target detection tasks. Meanwhile, the memory and the computing resources of the unmanned aerial vehicle are limited, so that the parameter quantity of the model must be controlled to increase based on the target detection algorithm of the aerial image of the unmanned aerial vehicle, and the real-time performance of the detection speed of the model can be kept only by keeping certain light weight of the model.

The unmanned aerial vehicle target detection technology under the current deep learning can be combined with the unmanned aerial vehicle technology, and the characteristics of aerial images and the like can be used for carrying out fusion improvement on a classic deep learning target detection algorithm. In the target detection algorithm with higher speed requirement, most algorithms improve the YOLO series algorithm by combining the unmanned aerial vehicle detection characteristic. Due to the problem of limitation of the computing power of the unmanned aerial vehicle, zhang et al prune the original network on the basis of Yolov3 to obtain Slim Yolov3 with basically unchanged precision but greatly reduced parameters, memory occupation and inference time, and the improved model is more suitable for being deployed on the unmanned aerial vehicle; zhu et al proposed that TPH-yolov5 uses a transducer to replace the original yolo gauging head on the basis of yolov5, and the number is increased from the original 3 to 4, and moreover, a Convolutional Attention Module (CBAM) Attention mechanism is integrated in the feature fusion stage, so that the detection precision is greatly increased.

However, most of unmanned aerial vehicle-based aerial target detection algorithms mainly aim at improving the target detection precision or the target detection speed, and the detection precision of the model and the detection speed of the model are not well balanced. Therefore, an algorithm for effectively balancing the target detection precision and the detection speed of the unmanned aerial vehicle is designed, so that the application of the unmanned aerial vehicle in the field of target detection can be accelerated and developed, and the algorithm is an important research topic of social development.

Generally, the unmanned aerial vehicle images have the characteristics that the images shot by the unmanned aerial vehicle are high in scene complexity due to the fact that shooting environments are influenced by factors such as atmospheric temperature and lighting conditions, the foreground objects in the images are various in types, changeable in shapes and different in sizes, and therefore target recognition is difficult, and accurate detection of interested targets in the unmanned aerial vehicle images is a challenging task. Because the limitation of the calculation resources and the capability of the airborne equipment of the unmanned aerial vehicle conflicts with the requirement of real-time detection of the unmanned aerial vehicle, the balance between the detection precision and the detection speed is always a challenge for all scientific researchers and technical developers, and the core problem of research is how to balance the precision and the speed of a target detection model under deep learning on the premise of the limitation of hardware resources. However, at present, most of unmanned aerial vehicle-based aerial target detection algorithms mainly aim at improving the target detection precision or improving the target detection speed, and the detection precision of a model and the detection speed of the model are not well balanced.

Disclosure of Invention

The present invention has been made to solve the above-mentioned problems occurring in the prior art. The invention provides an unmanned aerial vehicle target detection method based on lightweight feature fusion, and the technology can be used for carrying out target identification on images aerial by an unmanned aerial vehicle. For the problems that the current target detection network has low detection precision on small images and large network parameter quantity and is difficult to detect in real time and the like, firstly, a neck feature extraction module of YOLOv4-tiny is redesigned by using Depth Separable Convolution (DSC) and Coding Attachment (CA); then, an SPPF feature extraction module is used for increasing the receptive field information of the extracted features, and the calculation lightweight of the model is effectively kept; finally, the Decoupled Head detection Head module is used to enable the finally extracted characteristic information to have stronger spatial associativity, and the background and the target to be detected can be distinguished more easily, so that a brand-new lightweight target detection network is constructed.

The invention specifically adopts the following technical scheme:

an unmanned aerial vehicle target detection method based on lightweight feature fusion, the method comprising:

step 1, a training data set is obtained, wherein the training data set comprises a small target public data set for aerial photography of an unmanned aerial vehicle;

step 2, preprocessing the picture size of the training data set, enhancing mosaic data, randomly dividing and turning a plurality of pictures in the training data set, turning, respectively placing the pictures at corresponding dividing positions, and finally performing color gamut transformation and affine transformation on the pictures to obtain training samples;

step 3, based on the training sample, adjusting the size of a candidate frame of network training, and calculating the optimal number of candidate frames required by prediction of each feature extraction layer;

step 4, taking the training sample as an input image, respectively extracting semantic feature information of the input image through a 3 × 3 convolution module in a CBL module, finally increasing channel information of the extracted features to 64 dimensions, and carrying out Normalization processing on the extracted features by using a Batch Normalization layer in the CBL module; finally, enhancing the nonlinear factor of the network model by using a LeakyRelu activation function in the CBL module;

step 5, extracting image information features by utilizing three CSP residual error modules CSPBlock l, CSPBlock2 and CSPBlock3, wherein two feature layers of the middle layer CSPBlock2 and CSPBlock3 are respectively output and are respectively a second feature layer and a third feature layer;

step 6, based on an attention mechanism, respectively performing attention weight distribution on the second characteristic layer and the third characteristic layer output in the step 5 to update the second characteristic layer and the third characteristic layer;

step 7, carrying out reception field information fusion of different scales on the third characteristic layer by adopting a characteristic fusion module, and updating the third characteristic layer;

step 8, performing feature re-extraction on the extracted feature layer information to generate a first feature layer, and updating a second feature layer and a third feature layer;

step 9, performing final prediction analysis on the first feature layer, the second feature layer and the third feature layer;

and step 10, debugging the network structure hyper-parameters from the step 4 to the step 9, and setting network model parameters to train the network model to obtain a final training model.

Further, the preprocessing the picture size of the training data set includes:

the picture size of the training data set is adjusted to 640 x 640 pixels.

Further, in step 7, the feature fusion module is adopted to perform the receptive field information fusion of different scales on the third feature layer, and update the third feature layer, which specifically includes:

step 7-1, performing feature re-extraction on the feature information of the third feature layer obtained in the step 6 by using ThreeConv consisting of three convolutions, firstly performing dimensionality reduction on the third feature layer by using a convolution kernel with the size of 1 × 1, then performing feature extraction by using a 3 × 3 depth separable convolution, and finally updating the third feature layer by using the dimensionality of the feature layer adjusted and output by using the convolution with the size of 1 × 1;

step 7-2, based on a cascade feature extraction module SPPF, wherein the SPPF module is formed by three largest pooling layers with the size of 5 × 5 in a cascade manner, receptive field information with the same size as the largest pooling layers with the sizes of 5 × 5,9 × 9 and 13 × 13 is generated respectively, feature re-extraction is carried out on the updated third feature layer through the SPPF module, concat splicing the output features on each pooling layer and the input feature information of the SPPF module, and the third feature layer is further updated;

and 7-3, performing feature re-extraction on the updated third feature layer by using ThreeConv, and finally adjusting the dimension of the output feature layer by using a convolution of 1 × 1 in ThreeConv to further update the third feature layer.

Further, in step 8, feature re-extraction is performed on the extracted feature layer information to generate a first feature layer, and the second feature layer and the third feature layer are updated; the method specifically comprises the following steps:

step 8-1, adding a down-sampling feature extraction layer in a network backbone extraction module, and fusing space and position information based on an attention mechanism to generate a first feature layer;

step 8-2, carrying out deep layer feature fusion again on the first feature layer, the second feature layer and the third feature layer through a bottom-up path so as to update the second feature layer and the third feature layer;

and 8-3, adding a top-down path, fusing the first characteristic layer and the second characteristic layer again, and simultaneously not fusing the second characteristic layer and the third characteristic layer to update the first characteristic layer.

Further, in step 8-2, performing deep feature fusion again on the first feature layer, the second feature layer, and the third feature layer through a bottom-up path to update the second feature layer and the third feature layer, specifically including:

carrying out upsampling on the third feature layer by 2 times of adjacent interpolation to obtain a feature map of 40 × 40, then carrying out convolution, normalization and LeakyRelu activation functions of 1 × 1 to obtain a second feature layer with the channel number being 512 and the size being unchanged, and carrying out Concat fusion on the second feature layer and the first feature layer by using a CA attention mechanism to update the third feature layer;

the second feature layer is up-sampled by twice the adjacent interpolation, then convolution with 1 × 1 is performed to adjust the number of channels to 258, the CA attention mechanism is used for the second feature layer, and finally the second feature layer is fused with the first feature layer by Concat to update the second feature layer.

Further, in step 8-3, a top-down path is added to merge the first feature layer with the second feature layer again, and the second feature layer and the third feature layer are not used for merging at the same time, so as to update the first feature layer, which specifically includes:

and performing downsampling convolution on the first feature layer by a step pitch of 2 to obtain a 40 × 40 feature map, performing convolution and normalization of 1 × 1 and a LeakyRelu activation function to obtain a feature map with the channel number being 256 and the feature map being not changed in size, and finally fusing the feature map with the second feature layer by using Concat to update the first feature layer.

Further, the step 9 of performing final prediction analysis on the first feature layer, the second feature layer, and the third feature layer specifically includes:

and extracting classification information of the feature layer from the first feature layer, the second feature layer and the third feature layer respectively through two 1 × 1 convolutions and 3 × 3 convolutions in a first branch in a Decoupled Head module, extracting position information by using one 3 × 3 convolution after the two 1 × 1 convolutions of a second branch, extracting confidence information by using the 3 × 3 convolution of another branch after the second branch, and respectively completing classification and regression tasks of small target detection.

Further, the step 10 of debugging the network structure hyperparameters from the step 4 to the step 9 and setting network model parameters to train the network model to obtain a final training model specifically includes:

setting the network training times epochs to 300 rounds, adopting transfer learning training, freezing a main network in the first 60 epochs, setting the learning rate to be 0.001, loading a pre-training model, and setting the bath size to be 16; the last 240 epochs unfreeze the backbone network, and the learning rate is set to be 0.000l; bachsize is set to be 8, the learning rate is reduced to be 0.937 after each epoch, and a final training model is obtained after 300epochs of training.

Further, the method further comprises:

and 11, selecting partial data from the training data set in the step 1 as a test set, and inputting the test set into the training model in the step 10 to obtain a test result of the unmanned aerial vehicle small target detection.

Further, images with complex natural scenes, various angles and numerous small targets are selected from the training data set in the step 1 to serve as a test set.

The invention has at least the following technical effects:

(1) The reference model of the YOLOv4-UAV is a YOLOv4-tiny unmanned aerial vehicle (UAV 4-UAV) network, and compared with other one-stage target detection networks, the YOLOv4-tiny unmanned aerial vehicle has the best detection speed while keeping good detection precision, and can well meet the index requirement of the real-time performance of target detection of the UAV.

(2) After the characteristics are extracted by a trunk characteristic extraction module (backbone), the extracted characteristics are re-extracted by using a characteristic fusion module SPPF at a 32 x 32 down-sampling position by the YOLOv4-UAV algorithm, so that the receptive field information of the characteristics can be effectively increased, the multi-scale problem of the target can be solved to a certain extent, and the calculated amount of the model is reasonably reduced while the detection precision of the model is kept unchanged.

(3) A new neck feature fusion module CPAN is provided, the main body of the neck feature fusion module CPAN is composed of an up-sampling splicing feature layer from bottom to top and a down-sampling splicing feature layer from top to bottom, and the problems that the small target detection precision of an aerial image of an unmanned aerial vehicle is low, the detection network is difficult to distinguish the background from the foreground, the target detection confidence coefficient is low, the network calculated amount is huge, the real-time detection is difficult and the like can be effectively solved respectively by using an 8X 8 feature detection layer, a coding attribute and the depth separable convolution.

(4) The detection problem that the understanding of the spatial information is lacked between the target classification task and the positioning structure task can be effectively solved by using the decorumled Head detection Head module, so that the finally extracted characteristic information has good spatial associativity, the background and the target to be detected can be more easily distinguished, the position information of the whole object can be robustly regressed, and the capability of well distinguishing the complete target and a part of the target is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 shows a network block diagram of a YOLOv4-UAV and a CPAN therein according to an embodiment of the invention;

FIG. 2 illustrates a block diagram of an SPPF according to an embodiment of the present invention;

FIG. 3 illustrates a block diagram of a CA attention mechanism, according to an embodiment of the invention;

FIG. 4 is a schematic diagram of a Decoupled Head module according to an embodiment of the present invention;

FIG. 5 shows a graph of detection results of a DOTA dataset according to an embodiment of the invention;

fig. 6 shows a detection result display diagram of the Visdrone data set according to the embodiment of the invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the present invention will be described in detail below with reference to the accompanying drawings and the detailed description. The order in which the various steps described herein are described as examples should not be construed as a limitation if there is no requirement for a context relationship between each other, and one skilled in the art would know that sequential adjustments may be made without destroying the logical relationship between each other, rendering the overall process impractical.

The embodiment of the invention provides an unmanned aerial vehicle target detection method based on lightweight feature fusion, wherein a reference model YOLOv4-tiny adopted by the method is a one-stage target detection method, the one-stage target detection method can carry out self-adaptive scaling, data enhancement and the like on an image, the image is proportionally divided into a plurality of grids with equal sizes, if the center position of a target is in a certain grid, the grid is responsible for predicting the position and the category of the target, and the unmanned aerial vehicle target detection method has the advantages of simple model structure, high detection speed and the like.

The YOLOv4-UAV network is mainly divided into three modules, namely a network trunk feature extraction module (backbone), a neck feature fusion module (nic) and a detection head module (yolo head), wherein the backbone module main body uses CSPDarnet53-tiny, so that the structure of the network is simplified while the excellent detection precision of the CSPDarnet53 module is maintained, and the detection speed of the model is effectively increased; and then after the characteristics are extracted by using a trunk characteristic extraction module (backbone) in a YOLOv4-UAV algorithm, performing characteristic re-extraction on the extracted characteristics by using a characteristic fusion module SPPF at a 32 x 32 down-sampling position, and solving a target multi-scale problem to a certain extent. Since the feature fusion module SPPF is equivalent to extracting features using maximum pooling layers of different sizes for the features, the SPPF module can effectively increase the receptive field information of the features. Meanwhile, the SPPF module designs the internal structure of the SPPF module by using a receptive field, three pooling layers with the size of 5 multiplied by 5 are cascaded to generate the largest pooling with the size of 5 multiplied by 5,9 multiplied by 9 and 13 multiplied by 13 for feature re-extraction, so that the calculated amount of the model is reduced while the detection precision of the model is kept unchanged reasonably; and to the unmanned aerial vehicle image small target numerous, the background is complicated and be difficult to carry out real-time detection scheduling problem, propose a new neck characteristic fusion module CPAN, its main part from the top up sample concatenation of bottom-up and the downsampling concatenation feature layer of top-down constitutes. In addition to using the 32 × 32 and 16 × 16 times down-sampling output feature layer of the YOLOv4-tiny network, the CPAN module also uses a feature layer for outputting features in the 8 × 8 times down-sampling of the YOLOv4-tiny network, which can enhance the extraction of semantic information of small targets in images by the CPAN module, because the feature layer of the lower layer has a smaller receptive field generally and is suitable for predicting small targets with smaller sizes. And the CPAN module is constructed by using the Coordinate orientation, so that spatial information and channel characteristics can be weighted and fused, the purpose of giving consideration to channel information and position information is achieved, and the model can better position an interested object in the complex background of the aerial image of the unmanned aerial vehicle. Finally, aiming at the problem that real-time detection is difficult, the traditional standard convolution is replaced by the Deep Separable Convolution (DSC), and the parameters of operation are effectively reduced. Aiming at the fact that the unmanned aerial vehicle remote sensing data are mostly small targets in a centralized mode, the semantic information of a high layer is large in reception field and is more suitable for detecting large-size target objects, and therefore the last 16 x 16 to 32 x 32 times of downsampling splicing layer is cut, the parameter quantity of the established CPAN module is increased in a limited mode, and the lightweight of the network is kept; finally, the Decoupled Head detection Head module is used to enable the finally extracted characteristic information to have stronger spatial associativity, so that a brand-new YOLOv4-UAV network is formed.

Specifically, the unmanned aerial vehicle target detection method based on lightweight feature fusion comprises the following steps:

step 1, downloading unmanned aerial vehicle aerial small target public data sets VisDrone and DOTA1.0, and selecting images with complex natural scenes, various angles and numerous small targets as a test set.

And 2, preprocessing the picture size of the training data set, adjusting the picture size to 640 multiplied by 640 pixels, performing mosaic (mosaic) data enhancement on the data set in the first 80% of the total epochs of training, randomly dividing and overturning four pictures in the training set, overturning, placing the pictures at corresponding dividing positions respectively, and finally performing color gamut transformation, affine transformation and other operations on the pictures to obtain final training samples, wherein the number of the training sample sets is effectively increased.

And 3, adjusting the size of a candidate box (anchor) of the network training by using the generated new training data set and a Kmeans algorithm, and calculating the optimal number of candidate boxes required by prediction of each feature extraction layer by using a Kmeans + + algorithm.

Fig. 1 is a specific network model diagram of an unmanned aerial vehicle target detection method technology based on lightweight feature fusion, and in this embodiment, based on the network model, on the basis of steps 1 to 3, the method is performed according to the following steps:

and 4, preliminarily extracting the feature information of the image through two CBL modules, respectively extracting the semantic feature information of the input image through a 3 multiplied by 3 convolution (Conv) module in the CBL modules, and finally increasing the channel information of the extracted features to 64 dimensions. Then, carrying out Normalization processing on the extracted features by using a Batch Normalization (BN) layer in the CBL module; finally, the LeakyRelu activation function in the CBL module is used for enhancing the nonlinear factors of the network model.

And step 5, extracting image information features by using three CSP residual modules CSPBlock l, CSPBlock2 and CSPBlock3 in the graph 1, wherein two feature layers featl and feat2 of the middle layer CSPBlock2 and CSPBlock3 are respectively output, and the number of channels is 256 and 256 respectively.

And 6, adopting a Coordinate Attention (CA) Attention mechanism shown in fig. 3, respectively performing Attention weight distribution on the two feature layers featl and feat2 output in the step 5 to generate new featl and feat2, combining the spatial information and the channel feature weighting for fusion, and considering both the channel information and the position information.

And 7, carrying out reception field information fusion of different scales on the feature layer feat2 by adopting a feature fusion module to obtain a new fusion feature layer feat2, and specifically implementing the following steps:

7-1, performing feature re-extraction on the feat2 feature information obtained in the step 6 by using ThreeConv consisting of three convolutions, firstly performing dimension reduction processing on a feature layer by using a convolution kernel with the size of 1 × 1, then performing feature extraction by using a 3 × 3 depth separable convolution, and finally generating a new feature layer feat2 by using the dimension of the feature layer output by the convolution adjustment of 1 × 1;

step 7-2, a cascade feature extraction module SPPF as shown in fig. 2 is used, and the SPPF module is formed by cascading three maximal pooling layers of 5 × 5 size, and generates the same receptive field information as the maximal pooling layers of 5 × 5,9 × 9, and 13 × 13 size, respectively. Performing feature re-extraction on the feature layer feat2 through an SPPF (shortest Path first) module, and performing Concat splicing on output features on each pooling layer and input feature information of the SPPF module to generate a new feature layer feat2 with channel information of 1024;

7-3, performing feature re-extraction on the feat2 feature layer by using ThreeConv, and finally adjusting the dimension of the output feature layer by using convolution of 1 × 1 in the ThreeConv to generate a new feature layer feat2;

step 8, using the CPAN module in fig. 1 to re-extract the features of the extracted feature layer information, which is specifically implemented as follows:

step 8-1, adding an 8 × 8-time down-sampling feature extraction layer feat0 in a network backbone extraction module, and fusing space and position information by a Coding Attachment (CA) Attention machine shown in fig. 3 to generate a new feature layer feat0;

and 8-2, performing deep layer feature fusion on the feature layers feat0, feat1 and feat2 obtained in the step again through a bottom-up path to obtain feature layers feat0, feat1 and feat2 which are mutually fused in different scales, and specifically implementing the following steps:

performing upsampling on feat2 by 2 times of adjacent interpolation to obtain a characteristic diagram of 40 multiplied by 40, then performing convolution and normalization of 1 multiplied by 1 and a LeakyRelu activation function to obtain a characteristic layer feat2 with the unchanged channel number of 512, then using a CA attention mechanism for the characteristic layer feat2, and then performing Concat fusion on feat 1; similarly, feat1 is up-sampled by a double-time adjacent interpolation, then convolution of 1 × 1 is carried out to adjust the number of channels to 258, then a CA attention mechanism is used for the channel, and finally the channel is fused with feature layer feat0 by Concat;

and 8-3, adding a top-down path, fusing feat0 and feat1 again, and simultaneously not fusing feat1 and feat2 to obtain a new feature layer feat1, wherein the specific implementation is as follows:

performing downsampling convolution on feat0 at a step pitch of 2 to obtain a characteristic diagram of 40 multiplied by 40, then performing convolution and normalization of 1 multiplied by 1 and a LeakyRelu activation function to obtain a characteristic diagram with the number of channels with unchanged size of 256, and finally fusing the characteristic diagram with feat1 by using Concat;

step 9, using the decouplied Head shown in fig. 4 to perform the final prediction analysis on the final feat0, feat1, feat2 obtained in the above step, which is specifically implemented as follows:

extracting classification information of the feature layers feat0, feat1 and feat2 through two 1 × 1 convolutions and 3 × 3 convolutions in a first branch in a decorated Head module respectively, extracting position information by using one 3 × 3 convolution after two 1 × 1 convolutions in a second branch, extracting confidence information by using a 3 × 3 convolution in another branch after the second branch, and completing classification and regression tasks of small target detection respectively;

step 10, debugging the network structure hyper-parameters from step 4 to step 9, and setting network model parameters, wherein the network training times epochs are set to 300 rounds, the migration learning training is adopted, the main network is frozen in the first 60 epochs, the learning rate is set to 0.001, the pre-training model is loaded, and the bath size is set to 16; the last 240 epochs unfreeze the backbone network, and the learning rate is set to be 0.000l; bachsize is set to be 8, the learning rate is reduced to be 0.937 after each epochs, and a final training model is obtained after 300epochs of training;

and 11, inputting the test set in the step 1 into the training model in the step 10 to obtain a test result of the unmanned aerial vehicle small target detection.

The final test results are shown in fig. 5 and fig. 6, which respectively show the detection results of different data sources, and prove that the method can accurately realize target detection.

The above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims

1. An unmanned aerial vehicle target detection method based on lightweight feature fusion is characterized in that: the method comprises the following steps:

step 6, based on an attention mechanism, respectively carrying out attention weight distribution on the second characteristic layer and the third characteristic layer output in the step 5 to update the second characteristic layer and the third characteristic layer;

and step 10, debugging the network structure hyperparameters from the step 4 to the step 9, and setting network model parameters to train the network model to obtain a final training model.

2. The unmanned aerial vehicle target detection method based on lightweight feature fusion as claimed in claim 1, wherein the preprocessing of the picture size of the training data set comprises:

the picture size of the training data set is adjusted to 640 x 640 pixels.

3. The unmanned aerial vehicle target detection method based on lightweight feature fusion as claimed in claim 1, wherein in step 7, a feature fusion module is used to perform receptive field information fusion of different scales on the third feature layer, and update the third feature layer specifically comprises:

step 7-2, based on a cascade feature extraction module SPPF, wherein the SPPF module is formed by three maximal pooling levels with the size of 5 × 5, and the same receptive field information as the maximal pooling levels with the sizes of 5 × 5,9 × 9 and 13 × 13 is generated respectively, the updated third feature layer is subjected to feature re-extraction through an SPPF module, and output features on each pooling layer and input feature information of the SPPF module are subjected to Concat splicing to further update the third feature layer;

4. The unmanned aerial vehicle target detection method based on lightweight feature fusion of claim 1, wherein in the step 8, feature re-extraction is performed on the extracted feature layer information, a first feature layer is generated, and a second feature layer and a third feature layer are updated; the method specifically comprises the following steps:

8-2, carrying out deep feature fusion again on the first feature layer, the second feature layer and the third feature layer through a bottom-up path to update the second feature layer and the third feature layer;

5. The unmanned aerial vehicle target detection method based on lightweight feature fusion as claimed in claim 4, wherein in step 8-2, the first feature layer, the second feature layer and the third feature layer are subjected to deep feature fusion again through a bottom-up path to update the second feature layer and the third feature layer, and specifically comprises:

and performing up-sampling on the second feature layer by twice adjacent interpolation, performing convolution by 1 × 1 to adjust the number of channels to 258, using a CA attention mechanism for the second feature layer, and finally fusing the second feature layer with the first feature layer by using Concat to update the second feature layer.

6. The unmanned aerial vehicle target detection method based on lightweight feature fusion of claim 4, wherein in step 8-3, a top-down path is added, and the first feature layer and the second feature layer are fused again without using the second feature layer and the third feature layer for fusion, so as to update the first feature layer, specifically comprising:

7. The unmanned aerial vehicle target detection method based on lightweight feature fusion as claimed in claim 1, wherein step 9, performing final prediction analysis on the first feature layer, the second feature layer, and the third feature layer specifically includes:

and extracting classification information of the first feature layer, the second feature layer and the third feature layer respectively through two 1 × 1 convolutions and 3 × 3 convolutions in a first branch in the decorated Head module, extracting position information by using one 3 × 3 convolution after the two 1 × 1 convolutions of a second branch, extracting confidence information by using the 3 × 3 convolution of the other branch after the second branch, and completing classification and regression tasks of small target detection respectively.

8. The unmanned aerial vehicle target detection method based on lightweight feature fusion as claimed in claim 1, wherein in the step 10, the network structure hyper-parameters from the step 4 to the step 9 are debugged, and network model parameters are set to train the network model, so as to obtain a final training model, specifically comprising:

setting the network training times epochs to 300 rounds, adopting transfer learning training, freezing a main network in the first 60 epochs, setting the learning rate to be 0.001, loading a pre-training model, and setting the bath size to be 16; the last 240 epochs unfreeze the backbone network, and the learning rate is set to be 0.000l; bachsize is set to 8, the learning rate is reduced to 0.937 after each epochs, and a final training model is obtained after 300epochs of training.

9. The unmanned aerial vehicle target detection method based on lightweight feature fusion as claimed in claim 1, wherein the method further comprises:

10. The unmanned aerial vehicle target detection method based on lightweight feature fusion of claim 9, wherein images with complex natural scenes, diverse angles and numerous small targets are selected from the training data set in step 1 as a test set.