CN116469020A

CN116469020A - Unmanned aerial vehicle image target detection method based on multiscale and Gaussian Wasserstein distance

Info

Publication number: CN116469020A
Application number: CN202310402925.8A
Authority: CN
Inventors: 李红光; 孟令捷; 杨丽春
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2023-04-17
Filing date: 2023-04-17
Publication date: 2023-07-21

Abstract

The invention discloses an unmanned aerial vehicle image target detection method based on multiscale and Gaussian Wasserstein distance, which relates to the technical field of aviation image processing, and combines low-layer and high-layer feature fusion and scale insensitivity measurement thought, and comprises the following steps: s1: establishing an unmanned aerial vehicle image target data set, and preprocessing image data; s2: slicing the input image, and then splicing slicing results; s3: a receptive field integrating multi-scale pooling information enrichment features; s4: introducing an NWD measurement based on a Gaussian Wasserstein distance; s5: and for unmanned aerial vehicle images containing small targets in the test set, target prediction is carried out by utilizing a trained improved feature extraction network. By adopting the method, the small target detection precision is improved, the depth detection algorithm aiming at the conventional scale target is improved, the effective detection of the target with limited pixels is realized, and the accuracy and the recall rate are higher.

Description

Unmanned aerial vehicle image target detection method based on multiscale and Gaussian Wasserstein distance

Technical Field

The invention relates to the technical field of aviation image processing, in particular to an unmanned aerial vehicle image target detection method based on multiscale and Gaussian Wasserstein distances.

Background

A drone image finite pixel target refers to a target in which few pixels are occupied in the drone image. Under the condition of long-distance imaging, especially when the medium-high altitude unmanned aerial vehicle looks at the ground in a long-distance strabismus mode, the number of occupied pixels of the ground target in the image is small. The unmanned aerial vehicle image data is effectively analyzed and processed by the computer, targets of different categories are identified, the positions of the targets are marked, the targets are one of the basic problems in the computer vision task, the unmanned aerial vehicle image data is widely applied to various fields such as military, agriculture and forestry, maritime affairs, disaster prevention and relief, city planning and the like, and higher requirements are also put forward on target detection tasks of unmanned aerial vehicle images.

The detection of small targets in a complex background is an important research direction in the field of image analysis and processing, compared with images in natural scenes, the unmanned aerial vehicle image has the characteristics of high background complexity, small target size, weak characteristics and the like due to the fact that the imaging distance is far, and the unmanned aerial vehicle image has the problems of low resolution, low color saturation, environmental noise distortion and the like due to the fact that the imaging environment is complex, such as weather, platform speed, height and stability variability are large, so that the difficulty of target detection is increased.

Existing target detection algorithms are divided into two main classes, namely algorithms based on traditional image processing and algorithms based on deep learning. The target detection method based on traditional image processing is mostly applied to the field of infrared dim and small target detection, and a target region of interest is selectively found by introducing a visual attention mechanism and utilizing the difference between a target and the background and noise, but the manual design features have the defect of insufficient representativeness, are easily interfered by complex backgrounds, and cannot be directly applied to unmanned aerial vehicle image target detection tasks. The target detection algorithm based on the deep neural network is excellent in a conventional data set, but has lower detection precision on small targets, because the convolutional neural network is generally formed by stacked convolutional and pooling layers, as the network hierarchy is deepened, the characteristic dimension is gradually reduced, the information quantity of the target to be detected is further reduced, and the target to be detected is difficult to detect.

Therefore, it is necessary to provide an unmanned aerial vehicle image target detection method based on multiscale and gaussian waserstein distances to solve the above-mentioned problems.

Disclosure of Invention

The invention aims to provide an unmanned aerial vehicle image target detection method based on multi-scale and Gaussian Wasserstein distance, which improves the detection precision of small targets, improves the depth detection algorithm for conventional scale targets, realizes effective detection on targets with limited pixels, and has higher accuracy and recall rate.

In order to achieve the above purpose, the invention provides an unmanned aerial vehicle image target detection method based on multiscale and Gaussian Wasserstein distance, which comprises the following steps:

s1: establishing an unmanned aerial vehicle image target data set, and preprocessing image data;

s2: slicing the input image, and then splicing slicing results;

s3: a receptive field integrating multi-scale pooling information enrichment features;

s4: introducing an NWD measurement based on a Gaussian Wasserstein distance;

s5: and for unmanned aerial vehicle images containing small targets in the test set, target prediction is carried out by utilizing a trained improved feature extraction network.

Preferably, in step S1, the original image is cut into uniform sizes of 800×800 pixels, the target is determined according to the frequency and size of the target appearing in the image, the image is selected according to the proportion of the target in the image, samples of X categories are taken as a training set, and samples of the remaining categories are taken as a test set.

Preferably, in step S2, the slicing operation is to set a Focus structure, perform downsampling, split the high resolution image into a plurality of low resolution images, and retain the feature information of the small object.

Preferably, in step S3, an SPP module is introduced before the last convolution layer of the backbone network, so as to fuse the feature information of different scales.

Preferably, in step S4, the NWD metric design is performed by modeling the bounding box as a two-dimensional gaussian distribution, and for a horizontal bounding box, its inscribed elliptic equation is expressed as:

wherein (mu) _x ,μ _y ) Is the center coordinate of ellipse, sigma _x Sum sigma _y Respectively represent the half-axis length along the x and y axes, mu _x ＝c _x ，μ _y ＝c _y ，σ _x ＝w/2，σ _y ＝h/2。

Preferably, in step S4, the probability density function of the two-dimensional gaussian distribution is expressed as:

where x, μ and Σ represent the coordinates of the gaussian distribution, the mean vector and the covariance matrix, respectively.

Preferably, in step S4, (x- μ) ^T Σ ^-1 When (x- μ) =1, the horizontal bounding box r= (c) _x ,c _y W, h) is modeled as a two-dimensional gaussian distribution N (μ, Σ), where:

converting the similarity between two bounding boxes into the distance between two gaussian distributions, μ for two-dimensional gaussian distributions ₁ ＝N(m ₁ ,Σ ₁ ) Sum mu ₂ ＝N(m ₂ ,∑ ₂ )，μ ₁ Sum mu ₂ The second order Wasserstein distance between is abbreviated as:

wherein I II _F Representing the Frobenius norm;

for the slave bounding box a= (cx _a ,cy _a ,w _a ,h _a ) And b= (cx _b ,cy _b ,w _b ,h _b ) Modeled gaussian distribution N _a And N _b Further simplified into:

normalization using its exponential form is a measure of similarity of two bounding boxes:

wherein C is the average absolute size of the targets in the dataset, ioU curve increases in magnitude as the target size decreases, the index decrease due to the position offset.

Preferably, in step S4, the loss function is composed of a target confidence loss, a classification loss, and a bounding box regression loss weighting, where the target confidence loss and the classification loss use binary cross entropy, the bounding box regression loss is expressed as a normalized weighted sum of CIoU loss and NWD loss of the prediction bounding box and the real target bounding box, and the loss function is expressed as:

Loss＝λ ₁ L _cls +λ ₂ L _obj +λ ₃ [αL _CIoU +(1-α)L _NWD ]

L _NWD ＝1-NWD(N _p ,N _g )

wherein NWD (N) _p ,N _g ) Representing an exponentially normalized wasperstein distance between the predicted and real boxes.

Preferably, in step S5, the performance of the algorithm is evaluated by using the AP50, the AP75 and the mAP as the evaluation indexes of the model, the effect of the improved feature extraction network on the test data set is tested, and the influence of the NWD metric introduced on the performance of the model is analyzed.

Therefore, the unmanned aerial vehicle image target detection method based on the multi-scale and Gaussian Wasserstein distance has the following beneficial effects;

(1) According to the invention, a multi-scale feature extraction module is used for carrying out bidirectional fusion on low-level features and high-level features in a Neck network by adopting a bidirectional feature pyramid network (BiFPN), so that the expression of the target feature information of the limited pixels is enriched.

(2) The invention improves the recall rate of detection by fusing the space-time information of multi-frame images.

(3) The invention ensures that the detection result has reliability by extracting and combining various image visual characteristics.

(4) The invention adopts normalized Gaussian Wasserstein distance measurement with scale insensitivity in non-maximum value inhibition stage and boundary frame regression loss to evaluate the similarity of a prediction frame and a real frame, thereby improving the detection precision of a small target.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

FIG. 1 is a flow chart of an unmanned aerial vehicle image target detection method based on multi-scale and Gaussian Wasserstein distances;

FIG. 2 is a schematic diagram of a Focus architecture employed in the present invention;

FIG. 3 is a block diagram of an SPP module employed in the present invention;

FIG. 4 is a schematic diagram of a position offset curve under an NWD metric based on Gaussian Wasserstein distance employed in the present invention;

Detailed Description

The technical scheme of the invention is further described below through the attached drawings and the embodiments.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs. The terms "first," "second," and the like, as used herein, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

The invention adopts the unmanned aerial vehicle image target detection method based on the multiscale and Gaussian Wasserstein distance, combines low-layer and high-layer feature fusion and scale insensitivity measurement thought, and comprises the following steps: s1: establishing an unmanned aerial vehicle image target data set, and preprocessing image data; s2: slicing the input image, and then splicing slicing results; s3: a receptive field integrating multi-scale pooling information enrichment features; s4: introducing an NWD measurement based on a Gaussian Wasserstein distance; s5: and for unmanned aerial vehicle images containing small targets in the test set, target prediction is carried out by utilizing a trained improved feature extraction network.

In step S1, the original image is cut into uniform sizes of 800×800 pixels, the target is determined according to the frequency and size of the target appearing in the image, the image is selected according to the proportion of the target in the image, samples of X categories are taken as training sets, and samples of the remaining categories are taken as test sets.

The AI-TOD data sets collected from a plurality of large-scale public aerial remote sensing image data sets such as DIOR, DOTA, xView, visDrone are used as basis to form an unmanned aerial vehicle image limited pixel small target data set, and the types of targets are mainly determined to be planes, ships, automobiles, people and the like according to the occurrence frequency and the occurrence size of the targets in the image.

The original images are cut into uniform sizes of 800 multiplied by 800 pixels in an overlapping mode, images with sizes not larger than 64 pixels are selected according to the proportion of targets in the images, a data set contains 28036 images and 700621 target examples, the average size of the targets is 12.8 pixels, the variance is 5.9 pixels, the target size is far smaller than other remote sensing data sets, samples of X categories are taken as training sets, and samples of the other categories are taken as test sets.

In step S2, the slicing operation is to set a Focus structure, perform downsampling, split the high resolution image into a plurality of low resolution images, and retain the feature information of the small object. Focus is a special downsampling method, specific processing operation is as shown in figure 2, the distances of one pixel are separated to take values and are combined into a low-resolution image, the number of corresponding channels is changed to 4 times that of the original low-resolution image, and by splitting the high-resolution image into a plurality of low-resolution images, the wide and high information is unified and concentrated on the channel dimension, so that the calculated amount is reduced, the information loss caused by downsampling is avoided, the characteristic information of a small target is reserved more, and the network training and reasoning speed is improved.

In step S3, an SPP module is introduced before the final convolution layer of the backbone network, and feature information of different scales is fused. The SSP module structure is shown in fig. 3, the input features first pass through a 1×1 convolution layer, respectively pass through three largest pooling windows of 5×5, 7×7 and 13×13 different scales, connect the pooling features of the three scales with the input features, and then pass through a 1×1 convolution layer to finally obtain feature vectors with fixed sizes. The SPP layer enhances the feature expression capability of the feature map by fusing receptive fields of the multi-scale pooling information rich features.

In step S4, ioU, which represents the degree of overlap between the prediction frame and the real frame, is widely used in the target detection frame based on the anchor frame, for example, the Non-maximum suppression (Non-MaximumSuppression, NMS) stage filters the prediction frame with higher overlap rate by using IoU index, and replaces L2 loss with index based on IoU in the loss function as the regression loss of the boundary frame, but the evaluation index based on IoU is very sensitive to small target position offset, and small position offset can cause IoU to drop rapidly, thereby affecting the performance of the detector based on the anchor frame. The similarity between the two bounding boxes is calculated using a normalized gaussian wasperstein distance. The NWD metric design process is to model the bounding box as a two-dimensional gaussian distribution, and for a horizontal bounding box, its inscribed ellipse equation is expressed as:

In step S4, the probability density function of the two-dimensional gaussian distribution is expressed as:

In step S4, (x- μ) ^T Σ ^-1 When (x- μ) =1, the horizontal bounding box r= (c) _x ,c _y W, h) can be modeled as a two-dimensional gaussian distribution N (μ, Σ), where:

converting the similarity between two bounding boxes into the distance between two gaussian distributions, μ for two-dimensional gaussian distributions ₁ ＝N(m ₁ ,∑ ₁ ) Sum mu ₂ ＝N(m ₂ ,∑ ₂ )，μ ₁ Sum mu ₂ The second order Wasserstein distance between is abbreviated as:

wherein I II _F Indicating the Frobenius norm.

wherein C is the average absolute size of the targets in the dataset, ioU curve increases in magnitude as the target size decreases, the index decrease due to the position offset. As shown in fig. 4, the four curves corresponding to NWD are completely identical and insensitive to frame scale variation; the NWD curve is smoother, has lower sensitivity to offset, and when the bounding box a contains the bounding box B or the two bounding boxes have no intersection, the NWD index can still reflect the similarity of the two bounding boxes, with stronger robustness.

In step S4, the loss function is composed of a target confidence loss, a classification loss, and a bounding box regression loss weighting, where the target confidence loss and the classification loss use binary cross entropy, and the bounding box regression loss is expressed as a normalized weighted sum of CIoU loss and NWD loss of the prediction bounding box and the real target bounding box. The loss function is expressed as:

Loss＝λ ₁ L _cls +λ ₂ L _obj +λ ₃ [αL _CIoU +(1-α)L _NWD ]

L _NWD ＝1-NWD(N _p ,N _g )

In step S5, the performance of the algorithm is evaluated by using the AP50, the AP75 and the mAP as the evaluation indexes of the model, the effect of the improved feature extraction network on the test data set is tested, and the influence of the NWD metric introduced on the performance of the model is analyzed.

The algorithm performance is evaluated by taking the AP50, the AP75 and the mAP as evaluation indexes of the model, wherein the average precision (Average Precision, AP) under each category is the area under the P-R curve, and the mAP represents the average value of the average precision (mAP) under each category; the mAP in the COCO data set represents an index obtained by calculating ten mAPs at intervals of 0.05 between a IoU threshold value of 0.5 and 0.95 and averaging, and the AP50 and the mAP75 represent average accuracies of each class calculated by using 0.5 and 0.75 as IoU threshold values, respectively.

The algorithm implementation is carried out on a deep learning framework PyTorch, and the hardware is configured as a CPU: intel Xeon 24 core, 1.9ghz,64gb RAM; GPU: geForce RTX 3080Ti. And initializing parameters by using a YOLOv5 official pre-training model, and performing fine tuning on a remote sensing image target detection data set. The initial learning rate is set to be 0.01, a Warmup warm-up strategy is adopted before training, and the learning rate is dynamically attenuated through a cosine annealing algorithm. Each model is trained for 1000 cycles, and to prevent overfitting, the process is ended in advance when the index on the validation set is no longer increasing for 100 cycles. The training time batch_size was set to 128 and the test time batch_size was set to 1.

And adopting a multi-scale training mode, and automatically clustering according to the true boundary frame labels marked by the data sets by adopting a K-means algorithm to generate a new optimal anchor point frame size so as to adapt to targets with different scales in different data sets.

The improved effect of the feature extraction network on the test data set is tested, the multi-layer feature images are connected in a bidirectional mode by adopting a BiFPN structure, dynamic weighting and fusion are carried out according to the feature importance degree, therefore, the feature expression capacity of the network is improved, and a detection head is newly added for a target. The experimental results are shown in Table 1, with mAP values increased by 0.3%, APm increased by 2.2% and AP75 increased by 1.0% on the AI-TOD dataset.

Table 1 improved network fabric performance contrast

The influence of NWD measurement introduced to the model performance is analyzed, and the NWD measurement insensitive to the target scale is adopted to replace IoU in the NMS stage, so that the increase of redundant detection frames caused by IoU of a prediction frame and a highest score prediction frame being smaller than a threshold value can be effectively avoided, and the false positive rate is overlarge; for the bounding box regression loss function, the NWD loss is introduced to help alleviate the sensitivity of CIoU loss to small target position deviation, so that the network can learn and optimize better for small targets, and experimental results are shown in table 2.

Table 2 influence of NWD metrics on detection performance

The NWD metric was introduced at the NMS stage with an mAP of 16.2% raised by 1.2% compared to the IoU metric mAP used by YOLOv 5. NWD loss complements CIoU loss in a normalized weighted manner, increasing by 1.6% compared to CIoU loss mAP alone when NWD loss weight is set to 0.35. Experimental results show that the NWD metric is introduced into the NMS stage and the bounding box regression loss to have a certain improvement on the small target detection performance improvement.

Performance comparisons were made with other classical and advanced target detection methods on AI-TOD datasets, using the official-provided COCOAPI-aid interface to ensure objectivity and credibility of model performance comparisons, with the comparison results shown in table 3.

TABLE 3 (1) comparison of different Algorithm Performance on AI-TOD dataset

TABLE 3 (2) comparison of different Algorithm Performance on AI-TOD datasets

The mAP value of the multi-class average precision of the method reaches 17.8%, and the AP is realized ₅₀ And AP (Access Point) ₇₅ 41.4% and 12.4%, respectively. Compared with the baseline method YOLOv5, mAP is improved by 3.0%, and AP is improved ₅₀ Promote 4.6%, AP ₇₅ The method is improved by 3.3 percent, and the three indexes are higher than the classical target detection algorithm based on an anchor point frame and a non-anchor point frame; compared with classical multi-stage target detection methods such as Faster-RCNN, cascadeR-CNN, single-stage methods such as YOLOv3, SSD and RetinaNet have lower mAP values and poorer detection performance on small targets; the anchor-free detector CenterNet avoids the problem that the robustness of an anchor point frame with discrete size to a multi-scale target is poor, the anchor-free detector based on multiple center points further improves the detection performance of extremely small target detection through the design of the multiple center points and the bias target, and AP _vt The index is highest and reaches 6.1%, and the method disclosed herein gives consideration to all scale target examples of the whole data set, and has outstanding performance advantages as a whole. AP compared with advanced DetectorRS algorithm _t Index improvement 4.9%, AP _vt The method improves the performance by 3.4 percent, and has obvious performance improvement on the detection of very small targets. The results of the comparison experiments show that the method provided by the invention has better performance than the current partial method on the detection task of the small target of the remote sensing image, thereby proving the effectiveness of the method.

Therefore, the unmanned aerial vehicle image target detection method based on the multi-scale and Gaussian Wasserstein distance combines the low-layer feature fusion with the high-layer feature fusion and the scale insensitivity measurement thought to improve the accuracy of detecting the small target with limited pixels in the unmanned aerial vehicle image.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention and not for limiting it, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that: the technical scheme of the invention can be modified or replaced by the same, and the modified technical scheme cannot deviate from the spirit and scope of the technical scheme of the invention.

Claims

1. An unmanned aerial vehicle image target detection method based on multiscale and Gaussian Wasserstein distance is characterized by comprising the following steps of: the method comprises the following steps:

s2: slicing the input image, and then splicing slicing results;

s4: introducing an NWD measurement based on a Gaussian Wasserstein distance;

2. The unmanned aerial vehicle image target detection method based on multi-scale and Gaussian Wasserstein distance according to claim 1, wherein the unmanned aerial vehicle image target detection method is characterized by comprising the following steps of: in step S1, the original image is cut into uniform sizes of 800×800 pixels, the target is determined according to the frequency and size of the target appearing in the image, the image is selected according to the proportion of the target in the image, samples of X categories are taken as training sets, and samples of the remaining categories are taken as test sets.

3. The unmanned aerial vehicle image target detection method based on multi-scale and Gaussian Wasserstein distance according to claim 1, wherein the unmanned aerial vehicle image target detection method is characterized by comprising the following steps of: in step S2, the slicing operation is to set a Focus structure, perform downsampling, split the high resolution image into a plurality of low resolution images, and retain the feature information of the small object.

4. The unmanned aerial vehicle image target detection method based on multi-scale and Gaussian Wasserstein distance according to claim 1, wherein the unmanned aerial vehicle image target detection method is characterized by comprising the following steps of: in step S3, an SPP module is introduced before the final convolution layer of the backbone network, and feature information of different scales is fused.

5. The unmanned aerial vehicle image target detection method based on multi-scale and Gaussian Wasserstein distance according to claim 1, wherein the unmanned aerial vehicle image target detection method is characterized by comprising the following steps of: in step S4, the NWD metric design procedure is:

modeling a bounding box as a two-dimensional gaussian distribution, for a horizontal bounding box, its inscribed elliptical equation is expressed as:

6. The unmanned aerial vehicle image target detection method based on multi-scale and Gaussian Wasserstein distance according to claim 5, wherein the unmanned aerial vehicle image target detection method is characterized by comprising the following steps of: in step S4, the probability density function of the two-dimensional gaussian distribution is expressed as:

7. The unmanned aerial vehicle image target detection method based on multi-scale and Gaussian Wasserstein distance according to claim 6, wherein the unmanned aerial vehicle image target detection method is characterized by comprising the following steps of: in step S4, (x- μ) ^T Σ ^-1 When (x- μ) =1, the horizontal bounding box r= (c) _x ,c _y W, h) is modeled as a two-dimensional gaussian distribution N (μ, Σ), where:

two boundaries are setThe similarity between the boxes is converted into a distance between two gaussian distributions, μ for two-dimensional gaussian distributions ₁ ＝N(m ₁ ,Σ ₁ ) Sum mu ₂ ＝N(m ₂ ,∑ ₂ )，μ ₁ Sum mu ₂ The second order Wasserstein distance between is abbreviated as:

wherein I II _F Representing the Frobenius norm;

where C is the average absolute size of the targets in the dataset.

8. The unmanned aerial vehicle image target detection method based on multi-scale and Gaussian Wasserstein distance according to claim 7, wherein the unmanned aerial vehicle image target detection method comprises the following steps of: in step S4, the loss function is composed of a target confidence loss, a classification loss, and a bounding box regression loss weighting, where the target confidence loss and the classification loss use binary cross entropy, the bounding box regression loss is represented as a normalized weighted sum of CIoU loss and NWD loss of the prediction bounding box and the real target bounding box, and the loss function is represented as:

Loss＝λ ₁ L _cls +λ ₂ L _obj +λ ₃ [αL _CIoU +(1-α)L _NWD ]

L _NWD ＝1-NWD(N _p ,N _g )

9. The unmanned aerial vehicle image target detection method based on multi-scale and Gaussian Wasserstein distance according to claim 1, wherein the unmanned aerial vehicle image target detection method is characterized by comprising the following steps of: in step S5, the performance of the algorithm is evaluated by using the AP50, the AP75 and the mAP as the evaluation indexes of the model, the effect of the improved feature extraction network on the test data set is tested, and the influence of the NWD metric introduced on the performance of the model is analyzed.