CN118298319A

CN118298319A - Remote sensing image small target detection method and device, electronic equipment and storage medium

Info

Publication number: CN118298319A
Application number: CN202410285158.1A
Authority: CN
Inventors: 王莉; 吴鑫; 费爱国; 徐连明; 熊智宇
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2024-03-13
Filing date: 2024-03-13
Publication date: 2024-07-05

Abstract

The invention provides a method, a device, electronic equipment and a storage medium for detecting a small target of a remote sensing image, wherein the method for detecting the small target of the remote sensing image comprises the steps of extracting a backbone network based on multi-scale characteristics to obtain a plurality of common characteristic images and high-resolution characteristic images of the remote sensing image; cascading and extracting cross-scale semantic information of each feature layer to obtain a cross-scale semantic information feature representation vector of a high-resolution feature map; generating a density map representing foreground probability based on the cross-scale semantic information feature representation vector of the high-resolution feature map; the density map, a plurality of common feature maps and a high-resolution feature map are subjected to weighted fusion, and a small target detection result of the remote sensing image is output. Meanwhile, the density map and the feature map are fused, not only used for clipping images, and the problem of target feature representation, particularly small target feature blurring, in target detection of remote sensing images is solved.

Description

Remote sensing image small target detection method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of target detection technologies, and in particular, to a method and apparatus for detecting a small target in a remote sensing image, an electronic device, and a storage medium.

Background

Object detection is a basic task in the field of computer vision, aimed at locating bounding boxes of objects of interest in images, videos and discriminating their categories. Unlike general target detection, the data of remote sensing image target detection is usually acquired through a platform such as aviation or satellite, and the images have different resolutions and specific sensor properties. Aerial remote sensing images contain numerous small targets, and the target areas are dense and difficult for the detector to distinguish. In the related art, the target detection is carried out on the aerial remote sensing image based on the deep learning target model, but as the convolutional neural network is adopted as a main network, the convolutional neural network adopts the downsampling operation, so that the characteristics of the target with smaller scale on the characteristic map disappear or are difficult to distinguish from the background; for the detection head of the target detection model, the effective image features of the small target are fewer, so that the detection head is easily influenced by image noise; the model is more difficult to converge because small perturbations on the bounding box can cause large fluctuations in the metrics. The number of the small targets is more, the area distribution is more dense, and therefore the small target features identified by the model are fuzzy.

Disclosure of Invention

The invention provides a remote sensing image small target detection method, a remote sensing image small target detection device, electronic equipment and a storage medium, which are used for solving the defects that a detector is difficult to distinguish small targets and the identified small targets are easy to be influenced by image noise based on a traditional model detection method.

The invention provides a method for detecting a small target of a remote sensing image, which comprises the following steps:

extracting a backbone network based on the multi-scale features, and acquiring a plurality of common feature images and high-resolution feature images of the remote sensing image;

Dividing the plurality of common feature images into a plurality of feature layers, and extracting cross-scale semantic information of each feature layer in a cascading way to obtain a cross-scale semantic information feature representation vector of the high-resolution feature image;

Generating a density map representing foreground probability based on the cross-scale semantic information feature representation vector of the high-resolution feature map;

And carrying out weighted fusion on the density map, a plurality of common feature maps and high-resolution feature maps, enhancing target characterization of predicted target positions on the common feature maps and the high-resolution feature maps, and outputting a small target detection result of the remote sensing image, wherein the predicted target positions are areas with foreground probability values higher than a preset value in the density map.

According to the method for detecting the small target of the remote sensing image, which is provided by the invention, the backbone network of the multi-scale feature extractor is ResNet50 0-FPN network, the ResNet-FPN network comprises ResNet network and FPN network, the backbone network is extracted based on the multi-scale features, a plurality of common feature images and high-resolution feature images of the remote sensing image are obtained, and the method comprises the following steps:

Extracting multi-scale features of the remote sensing image based on the ResNet network;

And carrying out feature processing on the multi-scale features based on the FPN network to output a plurality of common feature graphs and high-resolution feature graphs.

According to the method for detecting the small target of the remote sensing image, the plurality of common feature images are divided into a plurality of feature layers, the cross-scale semantic information of each feature layer is extracted in a cascading way, and the cross-scale semantic information feature representation vector of the high-resolution feature image is obtained, and the method comprises the following steps:

Extracting cross-scale semantic information from each feature map by using a cavity convolution kernel with different cavity rates, and converting the cross-scale semantic information into a representation vector with the same size as the original map, wherein the number of channels is far smaller than that of the original map, so as to obtain a multi-scale representation vector corresponding to each layer;

and carrying out cascade updating on the representation vector of each scale of the current layer by using the multi-scale representation vector of the previous layer, and obtaining the trans-scale semantic information feature representation vector of the high-resolution feature map.

According to the method for detecting the small target of the remote sensing image, provided by the invention, the density map for representing the foreground probability is generated based on the trans-scale semantic information feature representation vector of the high-resolution feature map, and the method comprises the following steps:

And carrying out channel compression on the cross-scale semantic information feature expression vector of the high-resolution feature map, and judging the foreground and background probabilities by using a classification convolution to obtain a density map representing the foreground probability, wherein the density map representing the foreground probability is the same as the high-resolution feature map in size and contains cross-scale information.

According to the remote sensing image small target detection method provided by the invention, the value of the void ratio is determined according to the target size distribution of the data set.

According to the method for detecting the small target of the remote sensing image, which is provided by the invention, the density map is subjected to weighted fusion with a plurality of common feature maps and high-resolution feature maps, and the method comprises the following steps:

scaling the density map representing the foreground probability through secondary interpolation calculation to obtain a density map with the same size as the feature map of all the levels;

and carrying out weighted fusion on the density map and the plurality of common feature maps and the high-resolution feature map by using fusion factors.

According to the remote sensing image small target detection method provided by the invention, the density map representing the foreground probability based on the trans-scale semantic information feature representation vector of the high-resolution feature map is obtained by outputting a trained cascade density map generating module;

the output remote sensing image small target detection result outputs the remote sensing image small target detection result through a trained classification head and a trained regression head;

The classification head, the regression head and the cascade density map generation module are trained in a combined mode, and a loss function in the combined training is the sum of a classification loss function, a regression loss function and a density map loss function;

The classification loss function is a focus loss function, the regression loss function is a smooth L1 loss function, and the density map loss function is a mean square error loss function.

The invention also provides a device for detecting the small target of the remote sensing image, which comprises the following steps:

the acquisition module is used for extracting a backbone network based on the multi-scale characteristics and acquiring a plurality of common characteristic images and high-resolution characteristic images of the remote sensing image;

The extraction module is used for dividing the plurality of common feature images into a plurality of feature layers, extracting cross-scale semantic information of each feature layer in a cascading way, and obtaining a cross-scale semantic information feature representation vector of the high-resolution feature image;

the generation module is used for generating a density map representing the foreground probability based on the cross-scale semantic information feature representation vector of the high-resolution feature map;

And the output module is used for carrying out weighted fusion on the density map, a plurality of common feature maps and high-resolution feature maps, enhancing target characterization of predicted target positions on the common feature maps and the high-resolution feature maps, and outputting a small target detection result of the remote sensing image, wherein the predicted target positions are areas with the foreground probability values higher than a preset value in the density map.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the remote sensing image small target detection method is realized by the processor when the processor executes the program.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the remote sensing image small target detection method of any one of the above.

According to the method, the device, the electronic equipment and the storage medium for detecting the small target of the remote sensing image, provided by the invention, a plurality of common feature images and high-resolution feature images of the remote sensing image are obtained by extracting a backbone network based on multi-scale features; dividing a plurality of common feature graphs into a plurality of feature layers, and cascading and extracting cross-scale semantic information of each feature layer to obtain a cross-scale semantic information feature representation vector of a high-resolution feature graph; generating a density map representing foreground probability based on the cross-scale semantic information feature representation vector of the high-resolution feature map; the density map is subjected to weighted fusion with a plurality of common feature maps and high-resolution feature maps, target characterization of predicted target positions on the plurality of common feature maps and the high-resolution feature maps is enhanced, a small target detection result of a remote sensing image is output, the predicted target positions are areas with foreground probability values higher than a preset value in the density map, and the problem of target positioning of a detector is solved by utilizing target area position information of the density map; meanwhile, the density map and the feature map are fused, not only used for clipping images, and the problem of target feature representation, particularly small target feature blurring, in target detection of remote sensing images is solved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a remote sensing image small target detection method provided by the invention;

FIG. 2 is a schematic diagram of a remote sensing image small target detection method according to the present invention;

FIG. 3 is a schematic diagram of a cascade density map generating module according to the present invention;

FIG. 4 is a graphical representation of the results of a visualization on VisDrone datasets provided in accordance with the present invention;

FIG. 5 is a schematic diagram of a visual result of the coincidence degree of a high probability area and a labeling frame of a density map;

FIG. 6 is a schematic structural diagram of a remote sensing image small target detection device provided by the invention;

fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a flowchart of a remote sensing image small target detection method provided by an embodiment of the present invention, and as shown in fig. 1, the remote sensing image small target detection method provided by the embodiment of the present invention includes:

Step 101, extracting a backbone network based on multi-scale features, and acquiring a plurality of common feature images and high-resolution feature images of a remote sensing image;

step 102, dividing a plurality of common feature graphs into a plurality of feature layers, and cascading and extracting cross-scale semantic information of each feature layer to obtain a cross-scale semantic information feature representation vector of a high-resolution feature graph;

step 103, generating a density map representing foreground probability based on the cross-scale semantic information feature representation vector of the high-resolution feature map;

And 104, carrying out weighted fusion on the density map, the plurality of common feature maps and the high-resolution feature map, enhancing target characterization of the predicted target positions on the plurality of common feature maps and the high-resolution feature map, outputting a small target detection result of the remote sensing image, and predicting the target positions to be areas with the foreground probability values higher than a preset value in the density map.

Aerial remote sensing images contain numerous small targets, and the target areas are dense and difficult for the detector to distinguish. In the related art, the target detection is carried out on the aerial remote sensing image based on the deep learning target model, but as the convolutional neural network is adopted as a main network, the convolutional neural network adopts the downsampling operation, so that the characteristics of the target with smaller scale on the characteristic map disappear or are difficult to distinguish from the background; for the detection head of the target detection model, the effective image features of the small target are fewer, so that the detection head is easily influenced by image noise; the model is more difficult to converge because small perturbations on the bounding box can cause large fluctuations in the metrics. The number of the small targets is more, the area distribution is more dense, and therefore the small target features identified by the model are fuzzy.

According to the remote sensing image small target detection method provided by the invention, a plurality of common feature images and high-resolution feature images of the remote sensing image are obtained by extracting a backbone network based on multi-scale features; dividing a plurality of common feature graphs into a plurality of feature layers, and cascading and extracting cross-scale semantic information of each feature layer to obtain a cross-scale semantic information feature representation vector of a high-resolution feature graph; generating a density map representing foreground probability based on the cross-scale semantic information feature representation vector of the high-resolution feature map; the density map is subjected to weighted fusion with a plurality of common feature maps and high-resolution feature maps, target characterization of predicted target positions on the plurality of common feature maps and the high-resolution feature maps is enhanced, a small target detection result of a remote sensing image is output, the predicted target positions are areas with foreground probability values higher than a preset value in the density map, and the problem of target positioning of a detector is solved by utilizing target area position information of the density map; meanwhile, the density map and the feature map are fused, not only used for clipping images, and the problem of target feature representation, particularly small target feature blurring, in target detection of remote sensing images is solved.

The architecture of the remote sensing image small target detection method provided by the embodiment of the invention is shown in fig. 2, and the common characteristic map and the high-resolution characteristic map of the image are obtained based on ResNet-FPN backbone network, however, the characteristic information of the target on the image is not increased by directly detecting on the high-resolution map. Therefore, a cascade density map generation Module (CASCADE DENSITY GENERATING Module, CDGM) is added between the generic target detection model Backbone network (Backbone) and the Head network (Head). The cascade density map generation module is used for extracting cross-scale semantic information from different feature interlayer cascades of the feature map to generate a density map representing foreground probability. The density map itself represents the probability of the foreground area of the target, guides the target detection head to distinguish the target from the background area, and improves the problem that the detector is difficult to locate the target. Meanwhile, a density map generated by CDGM is fused with a high-resolution feature map, so that the problem that the conventional density map auxiliary detection method cannot improve the characterization of a fuzzy small target is solved.

Based on any of the above embodiments, the backbone network of the multi-scale feature extractor is ResNet50-FPN network, the ResNet-FPN network includes ResNet network and FPN network, the multi-scale feature extraction backbone network is used for obtaining a plurality of common feature images and high-resolution feature images of the remote sensing image, including:

extracting multi-scale features of the remote sensing image based on ResNet network;

and carrying out feature processing on the multi-scale features based on the FPN network to output a plurality of common feature images and high-resolution feature images.

In the embodiment of the invention, a ResNet model is used as an improved standard. The ResNet model includes two parts: a backbone network with FPN outputting a multi-scale feature map and two detection heads for classification and regression. In the original ResNet network implementation, the backbone network outputs P ₃、P₄、P₅ three feature maps. When the size of the input image is H W, the FPN output is characterized byL represents the level of the feature pyramid. In a typical feature pyramid, (H _l+1,W_l+1) is generally equal toI.e. the first layer has a characteristic pattern of length and width of the first layerC represents the number of channels and is determined by the convolutional layer. The detection head consists of 43×3 convolution layers, and one additional 3×3 convolution layer is used to output the target detection result. For parameter efficiency, the same detection head parameters are used for detection on different feature layers.

According to the embodiment of the invention, resNet-FPN is adopted as a backbone network, and as for the feature map to be detected output by the backbone network, a high-resolution feature map method widely used in the field of remote sensing image target detection is adopted except for a P ₃、P₄、P₅ feature map existing in the ResNet network, and a high-resolution feature map P ₂ is added to improve the resolution of a small target during detection.

Based on any of the above embodiments, dividing a plurality of common feature graphs into a plurality of feature layers, extracting cross-scale semantic information of each feature layer in a cascade manner, and obtaining a cross-scale semantic information feature representation vector of the high-resolution feature graph, including:

Step 201, extracting cross-scale semantic information from each feature map by using a cavity convolution kernel with different cavity rates, and converting the cross-scale semantic information into a representation vector with the same size as the original map but a channel number far smaller than that of the original map, so as to obtain a multi-scale representation vector corresponding to each layer;

Step 202, cascade updating the representing vector of each scale of the current layer by using the multi-scale representing vector of the previous layer to obtain the cross-scale semantic information feature representing vector of the high-resolution feature map.

Based on any of the above embodiments, generating a density map characterizing a foreground probability based on the cross-scale semantic information feature representation vector of the high resolution feature map includes:

In the embodiment of the invention, the generation process of the density map is cascade, the long-distance semantic information of a high layer is stronger, and the long-distance semantic information is continuously combined into a low layer, and finally the density map containing the comprehensive semantic information is obtained.

In the embodiment of the invention, the value of the void ratio is determined according to the target size distribution of the data set.

The cavity convolution with various cavity ratios is larger than the common convolution receptive field, so that information on a larger scale can be obtained.

The remote sensing image has a plurality of small targets with fuzzy characteristics, and is easy to be confused with the background. Conventional object detection models have difficulty in noticing these objects. In this regard, the embodiment of the invention provides a cascade density map generation module, which redefines the convolution receptive field based on the small target and focuses on the local information of the small target. The processing procedure of the cascade density map generating module comprises the following steps: firstly, according to a feature pyramid output by a model backbone network, namely different feature graphs of multiple layers and multiple scales, a cascade density graph generating module extracts cross-scale information by using hole convolution of multiple scales on the feature graph, and the expansion rate of a convolution kernel is equal to that of edge filling pixels and is used for extracting the cross-scale feature information to generate a multi-scale sub-feature graph. As shown in fig. 3, features are input for each feature layerThe CDGM module uses a pyramid pool module (Pyramid Pool Module, PPM) to extract cross-scale semantic information. The PPM module is a group of cavity convolution kernels with different cavity rates, the cavity rates are k _p respectively, different scale information of the feature map is extracted respectively, and the feature map is converted into a representation vector with the same size as the original map, but the number of channels is far smaller than that of the original map. The specific value of the void fraction k _p is determined according to the target size distribution of the dataset. Subsequently, the representation vector of each scale (referred to as) With upper-level multi-scale representation (called) Updating, wherein the updating formula is as follows:

where σ represents the sigmoid function and Conv _cas is a set of convolutions with 1×1 pooling for computing the upper-level representation Fused toIs a weight of (2). Then, the representation vectors from the same feature mapSplicing along the channel dimension to obtain a multi-scale representation corresponding to the feature map of the layerThis multi-scale representation will be used to update the underlying layersI.e. the updating of the scale in combination with the upper level feature map information is cascaded. This operation is repeated sequentially on P ₄、P₃、P₂, respectively, and a cross-scale semantic information feature representation on P ₂ is obtainedFinally, toAnd (3) carrying out channel compression, and judging the foreground probability and the background probability by using a classification convolution, wherein the obtained foreground probability map is a density map S which has the same size as the original P ₂ characteristic layer and contains cross-scale information, and the shape of the density map S is (H ₂,W₂, 1). In fusion, the density map is scaled to the same size as the feature map of all the levels (the same width and height of each feature map P _l) through a quadratic interpolation algorithm, channel-by-channel weighted summation is carried out, the density map containing the trans-scale information and the high-resolution feature map are weighted and fused respectively with other all the feature maps by using the following formula:

P_l←α×P_l+β×S

Wherein alpha and beta are fusion factors.

By fusing the density map generated by the cascade density map generation module with the feature map of each layer, the high probability area of the density map can enhance the representation of the target in the corresponding area on the feature map, so that the problem of fuzzy features of the target, particularly small targets, is relieved, and the information representation of the corresponding positions of the target on the feature map is improved.

Based on any of the above embodiments, performing weighted fusion on the density map and the plurality of normal feature maps and the high-resolution feature map, including:

step 301, scaling the density map representing the foreground probability through a quadratic interpolation calculation to obtain a density map with the same size as the feature map of all the levels;

and 302, carrying out weighted fusion on the density map and the plurality of common feature maps and the high-resolution feature map by using fusion factors.

Based on any embodiment, generating a density map representing foreground probability based on the cross-scale semantic information feature representation vector of the high-resolution feature map through output of a trained cascade density map generating module;

Outputting a detection result of the small target of the remote sensing image, and outputting the detection result of the small target of the remote sensing image through a trained classification head and a trained regression head;

In the embodiment of the invention, the cascade density map generation module is not trained alone, and is trained in combination with a classification head and a regression head. Loss of classification header _cls uses the Focal Loss used in RETINANET and Loss of regression header _reg is a smooth L1 Loss. The total Loss of the model, in addition to the original Loss _cls and Loss _reg, adds a Loss term for the density map, loss _den, expressed as:

Loss_all＝Loss_cls+Loss_reg+Loss_den

Loss _den uses the mean square error (mean square error, MSE) as a method of Loss calculation:

Loss_den＝MSE(S_gt-S)

S _gt is target marking data of an original image, and is generated according to the following algorithm: first, a batch of all 0 density maps S _gt having the same size as the target density map but the number of channels is initialized. Then, each annotation box truth value for each image is traversed. For each target frame, the center point and the lengths in the width and height directions are calculated. Subsequently, at S _gt, an elliptical probability size corresponding to the target box is drawn. The foreground probability of the pixel value on the corresponding S _gt of the pixel point at the center of the target frame is 1, the pixel value is diffused outwards, the probability is gradually reduced linearly, and the probability outside the frame is 0. Finally, the S _gt probability map is normalized. When the reverse derivation is performed, the Loss items Loss _den of the cascade density map generation module, the Loss _cls、Loss_reg of the classification head and the Loss of the regression head are summed together to obtain total Loss, and the total Loss is directly and reversely derived, so that the independent training process required in the traditional density map auxiliary detection is avoided.

According to the method for detecting the small target of the remote sensing image, provided by the embodiment of the invention, the target representation on the feature map is enhanced through the high-resolution feature map and the density map generation module, so that the detection precision in a remote sensing complex environment is effectively improved. And a new cascade density map generation module is introduced, and the density map generation is performed on the feature pyramid in a cascade manner to acquire cross-scale semantic information, so that the density map for generating accurate target area information is used for solving the problem that a detector is difficult to position a target on a remote sensing image. The density map is fused with the feature map rather than simply used to crop the image using a method in which the density map assists in object detection. The foreground information of the density map can enhance the characteristic information of the target, particularly the fuzzy small target, on the characteristic map, so that the problem that the target is influenced by noise and has missing characteristics on the remote sensing image and the target is difficult to detect by the detector is solved.

In some embodiments of the present invention, the effectiveness of the proposed method compared to the mainstream remote sensing image target detection method is verified by quantitative and qualitative results on the aerial remote sensing target detection dataset VisDrone.

VisDrone is a data set specially used for detecting images shot by the unmanned aerial vehicle, and comprises a plurality of tasks such as target detection, target tracking and the like. The target detection task dataset comprises 10 categories (pedestrian, person, car, van, bus, tree, monitor, bicycle, awing-tricycle, tricycle), a training set 6,471 pictures, a verification set 548 pictures, a test set 3190 pictures and 10209 pictures in total. The small object is defined by MSCOCO and is the object with the labeling frame area smaller than 32×32 pixels. In VisDrone datasets, small targets with target areas less than 32 x 32 account for over 60% of the total targets, accounting for the dominant role in all target scale distributions.

In the experiment, each image of VisDrone training set was divided into 4 non-overlapping sub-images and trained independently. In training, the model backbone network uses ResNet network with ImageNet pre-training weights, using Detectron2 default data enhancement. The optimizer is a random gradient descent optimizer (Stochastic GRADIENT DESCENT, SGD), the image batch size=8, the initial learning rate=0.01, the number of iterations is 50000, and the learning rate is attenuated to 1/10 of the original at 30000 times and 40000 times. Since VisDrone is less than 32 x 32 for 60% of the targets. At this time, on the level of the feature map P ₂, the feature area corresponding to the small object is reduced to 8×8. Therefore, the void fraction k _p range in CDGM modules is set to k _p e {1,2,3,4}, focusing only on smaller objects and their surrounding information, avoiding introducing additional noise. The experimental environment is Ubuntu 18.04, 2 NVIDIA are configuredRTX3090, CUDA version 12.0. The method of the embodiment of the invention is realized based on PyTorch and Detectron tool libraries, and the SAHI tool libraries are used for graph-cutting super-resolution reasoning. For the three models used in the experiment, the training was performed using exactly the same training configuration.

Evaluation of the detection results was performed using an evaluation tool provided by VisDrone official, and the evaluation index employed mainstream index average Precision mAP (MEAN AVERAGE Precision) and AP50 in the target detection field. For the average accuracy mAP, the specific calculation mode is as follows: first, under a specified IoU threshold, the accuracy rate P and recall rate R are calculated:

Wherein TP represents the correct prediction target number of the model; FP represents the number of mispredicted targets of the model; FN represents the number of targets that are not detected by the model. The AP below this threshold, and the average accuracy mAP, are then calculated according to the following formula:

Wherein AP _i represents the AP value calculated when IoU threshold is i. And IoU, when the threshold value, namely the total area ratio of the overlapping area of the prediction frame and the real frame, is larger than i, marking as correct prediction. For example, AP50 represents an AP value with a threshold of 0.50. VisDrone the same evaluation method as MSCOCO was used, and IoU was taken as 10 values on average from 0.5 to 0.95, and calculation of AP and mAP was performed.

In order to systematically verify the contribution of each component of the proposed method to the model performance, a series of detailed ablation experiments were performed on the VisDrone dataset. Each experiment was performed under exactly the same experimental configuration as the baseline model to ensure comparability of the results. Each detection strategy was added step by step in the experiment and the average accuracy (mAP) of the model was recorded to evaluate the performance change. The results of the ablation experiments are shown in table 1.

Table 1VisDrone results of ablation experiments on dataset

Model	mAP	AP50
			RetinaNet	31.03	53.60
RetinaNet+P₂	32.49(+1.46)	56.03
			RetinaNet+P₂+CDGM	32.84(+0.35)	56.83
RetinaNet+P₂+CDGM+SAHI	34.49(+1.65)	61.15

The impact of each component includes:

Effect of high resolution profile P ₂: first, the mAP of the baseline model was lifted from 31.03 to 32.49 by adding the high resolution profile P ₂. The result accords with the corresponding experiment of QueryDet, and shows that the high-resolution characteristic diagram can effectively improve the model performance in the remote sensing target detection task. Therefore, the embodiment of the invention obviously improves the detection capability of the model to the small target, thereby proving the value of the model in the remote sensing image detection task.

Influence of cascade density map generation module: after introducing the density mAP module CDGM, the mAP of the model was further raised to 32.84.CDGM when applied to high resolution feature maps, the increase in mAP was increased by an additional 24%. This enhancement illustrates the role of the density map module in improving the representation of target features on the feature map, especially when dealing with small objects and blurred objects.

Influence of the cut map super-resolution inference strategy SAHI: model detection mAP is further promoted to 34.49 by adding a cut mAP super-resolution reasoning strategy SAHI. The map-cutting super-resolution reasoning strategy carries out amplification detection on the local area of the image, fuses the results with the original detection results, increases the diversity in reasoning, enables the detection of the same area by the model to be more comprehensive, and further improves the detection capability of the model on various targets.

One advantage of the density map is that it has explicit semantic information, where the high probability region corresponds to the input picture foreground target region, and the density map effect can be verified by the association of the density map with the target region. Thus, the density map is visualized to see the effect. The results of the model visualization experiments are shown in fig. 4 below. From left to right: the visualized result after the channel mean value normalization of the characteristic pyramid P ₂ layer is the visualized result of the probability value of the density map generated by CDGM on the characteristic map of the P ₂ layer, and the visualized result after the channel mean value normalization after the fusion of the P ₂ layer and the density map is performed. The visualized result after the normalization of the channel mean value shows that the environmental noise on the original feature map, such as a building window on the right side of the image, is inhibited, meanwhile, the area with the target on the left side of the image is still brighter on the feature map, and the area with the smaller target on the left upper part of the image far away from the shooting lens can be displayed from the background, so that the enhancement effect of the density map on the feature map is reflected.

And then, visualizing the high probability area of the density map and the actual target frame on the same map, and checking the coincidence degree of the high probability area and the actual target frame. The visualization result shows that the high probability area of the density map is highly overlapped with the target position, and the density map can contain accurate position information of the target. And, even for occluded objects, as shown in FIG. 5, the VisDrone dataset is labeled as an occlusion rate < 50% and an occlusion rate > 50% and smaller objects, respectively, the density map can also label its location. The visualization result shows that the density map generated by the CDGM module provided by the embodiment of the invention can extract the position information of the target in the feature map, so that the small target representation on the feature map is enhanced, and the model is guided to detect the region possibly with the small target.

The method provided by the embodiment of the invention is compared with other most advanced small target detection methods. Table 2 gives the experimental results of the various latest target detection methods on the VisDrone dataset. Wherein FPN-ClusDet represent performance levels submitted in the paper, RETINANET-Ours represent performance levels obtained by the repeat experiments. Overall, the method of the present embodiment is superior to ClusDet, DMNet and czdet methods that use only density maps for cropping images. Compared with the performance of RETINANET original models, the mAP of the method is improved by 3.46 percentage points. QueryDet is also improved based on RETINANET, a high-resolution feature map P ₂ is added, and meanwhile, a sparse inference detection head is adopted to accelerate the detection speed. Even though QueryDet of the high-resolution feature mAP is adopted in comparison, the mAP performance of the embodiment of the invention still has improvement of 2.0 percent.

Table 2VisDrone average accuracy (mAP) and AP50 over target detection dataset

More specifically, all 10 categories of mAPs on the VisDrone verification set are viewed. On each class of targets in the validation set, the method provided by the embodiment of the invention has mAP improvement of 1-9 percentage points different from the baseline model RETINANET. Wherein 3.52 percentages are raised on pedestrain classes, 3.58 percentages are raised on people, 3.97 percentages are raised on bicycle, 1.48 percentages are raised on car, 3.09 percentages are raised on van, 2.73 percentages are raised on trunk, 4.69 percentages are raised on tricycle, 3.65 percentages are raised on awing-tricycle, 9.45 percentages are raised on bus, and 3.62 percentages are raised on a monitor. In general, the method proposed by the embodiment of the invention has obvious and comprehensive improvement on all categories of the verification set. The highest boost on the single class of bus, from the mAP of each class, is 9.45 percent. The lift on car is relatively small, 1.48 percent. This is because the car class of samples is most abundant and diverse in the dataset and most characteristic, and therefore more likely to be caused by model learning. In other categories, the number of samples is smaller, and model learning is more difficult, so that the improvement of the model detection performance is more attributed to good model design, and the improvement effect of the method for the targets difficult to detect is verified to be universal. On the other hand, even compared with the current advanced small target detector QueryDet, the method of the embodiment of the invention can still obtain the lifting of at most 4.83 percentage points (on a bus single class), and meanwhile, the lifting of 0.38 percentage points on a car single class is also realized, so that the effectiveness of the method for improving the target characterization in the aerial remote sensing image task is fully illustrated.

Evaluation of the test results was performed using an evaluation tool provided by VisDrone authorities, the evaluation results being shown in table 3,

Table 3VisDrone evaluation of accuracy over 10 categories of target detection dataset

Category(s)	RetinaNet	QueryDet	Ours
				pedestrian	31.18	33.32	34.70
people	21.45	23.07	25.03
				bicycle	17.44	18.04	21.41
car	59.44	60.54	60.92
				van	37.12	38.03	40.21
trunk	31.18	29.32	30.78
				tricycle	21.45	23.50	25.87
awning-tricycle	11.06	12.20	14.71
				bus	44.73	49.35	54.18
motor	29.02	29.87	32.64

Aiming at the problems of fuzzy target characteristics and difficult positioning in a remote sensing image target detection task, the embodiment of the invention provides a remote sensing image target detection network based on density map characteristic enhancement. By introducing CDGM modules, a density map of accurate target area information is obtained, and the problem that a detector is difficult to position a target on a remote sensing image is solved; meanwhile, the density map generated by CDGM is fused with the feature map, and the object representation on the feature map corresponding to the high probability area on the density map is enhanced, so that the problem that the small object feature is fuzzy due to the influence of noise in the traditional density map auxiliary detection method is solved. Compared with other existing schemes, the embodiment of the invention fuses the density map and the feature map, and assists in target positioning and target feature enhancement, so that the detection precision of the target detection model in the remote sensing complex scene can be effectively improved.

The remote sensing image small target detection device provided by the invention is described below, and the remote sensing image small target detection device described below and the remote sensing image small target detection method described above can be correspondingly referred to each other.

Fig. 6 is a schematic diagram of a remote sensing image small target detection device provided by an embodiment of the present invention, and as shown in fig. 6, the remote sensing image small target detection device provided by the embodiment of the present invention includes:

the acquiring module 601 is configured to acquire a plurality of common feature graphs and a high-resolution feature graph of a remote sensing image based on a multi-scale feature extraction backbone network;

the extracting module 602 is configured to divide the plurality of common feature graphs into a plurality of feature layers, extract cross-scale semantic information of each feature layer in a cascade manner, and obtain a cross-scale semantic information feature representation vector of the high-resolution feature graph;

A generating module 603, configured to generate a density map characterizing a foreground probability based on the cross-scale semantic information feature representation vector of the high resolution feature map;

And the output module 604 is configured to perform weighted fusion on the density map and a plurality of common feature maps and high-resolution feature maps, enhance target characterization of predicted target positions on the plurality of common feature maps and the high-resolution feature maps, and output a small target detection result of the remote sensing image, where the predicted target positions are regions in the density map where the foreground probability value is higher than a preset value.

According to the remote sensing image small target detection device, a plurality of common feature images and high-resolution feature images of a remote sensing image are obtained by extracting a backbone network based on multi-scale features; dividing a plurality of common feature graphs into a plurality of feature layers, and cascading and extracting cross-scale semantic information of each feature layer to obtain a cross-scale semantic information feature representation vector of a high-resolution feature graph; generating a density map representing foreground probability based on the cross-scale semantic information feature representation vector of the high-resolution feature map; the density map is subjected to weighted fusion with a plurality of common feature maps and high-resolution feature maps, target characterization of predicted target positions on the plurality of common feature maps and the high-resolution feature maps is enhanced, a small target detection result of a remote sensing image is output, the predicted target positions are areas with foreground probability values higher than a preset value in the density map, and the problem of target positioning of a detector is solved by utilizing target area position information of the density map; meanwhile, the density map and the feature map are fused, not only used for clipping images, and the problem of target feature representation, particularly small target feature blurring, in target detection of remote sensing images is solved.

Fig. 7 illustrates a physical schematic diagram of an electronic device, as shown in fig. 7, which may include: processor 710, communication interface (CommunicationsInterface) 720, memory 730, and communication bus 740, wherein processor 710, communication interface 720, memory 730 communicate with each other via communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a telemetry image small object detection method comprising: the method comprises the steps of obtaining a backbone network based on multi-scale feature extraction, and obtaining a plurality of common feature images and high-resolution feature images of a remote sensing image; dividing a plurality of common feature graphs into a plurality of feature layers, and cascading and extracting cross-scale semantic information of each feature layer to obtain a cross-scale semantic information feature representation vector of a high-resolution feature graph; generating a density map representing foreground probability based on the cross-scale semantic information feature representation vector of the high-resolution feature map; and carrying out weighted fusion on the density map, the plurality of common feature maps and the high-resolution feature map, enhancing target characterization of the predicted target position on the plurality of common feature maps and the high-resolution feature map, and outputting a small target detection result of the remote sensing image.

In another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the remote sensing image small target detection method provided by the above methods, the method comprising: extracting a backbone network based on the multi-scale features, and acquiring a plurality of common feature images and high-resolution feature images of the remote sensing image; dividing a plurality of common feature graphs into a plurality of feature layers, and cascading and extracting cross-scale semantic information of each feature layer to obtain a cross-scale semantic information feature representation vector of a high-resolution feature graph; generating a density map representing foreground probability based on the cross-scale semantic information feature representation vector of the high-resolution feature map; and carrying out weighted fusion on the density map, the plurality of common feature maps and the high-resolution feature map, enhancing target characterization of the predicted target position on the plurality of common feature maps and the high-resolution feature map, and outputting a small target detection result of the remote sensing image.

The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product, which may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the various embodiments or methods of some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The method for detecting the small target of the remote sensing image is characterized by comprising the following steps of:

2. The method for detecting a small target in a remote sensing image according to claim 1, wherein the backbone network of the multi-scale feature extractor is ResNet50 0-FPN network, the ResNet-FPN network includes ResNet network and FPN network, the extracting the backbone network based on the multi-scale features, obtaining a plurality of normal feature images and high-resolution feature images of the remote sensing image, includes:

3. The method for detecting a small target in a remote sensing image according to claim 2, wherein the steps of dividing the plurality of common feature images into a plurality of feature layers, extracting cross-scale semantic information of each feature layer in a cascading manner, and obtaining a cross-scale semantic information feature representation vector of the high-resolution feature image comprise the following steps:

extracting cross-scale semantic information from each feature map by using a cavity convolution kernel with different cavity rates, and converting the cross-scale semantic information into a representation vector with the same size as the original feature map but smaller channel number than the original feature map to obtain a multi-scale representation vector corresponding to each layer;

4. A method of detecting small objects in a remote sensing image according to claim 3, wherein said generating a density map characterizing foreground probability based on cross-scale semantic information feature representation vectors of said high resolution feature map comprises:

5. A method for detecting a small target in a remote sensing image according to claim 3, wherein the value of the void fraction is determined according to the target size distribution of the dataset.

6. The method for detecting a small target in a remote sensing image according to claim 1, wherein the weighted fusion of the density map with a plurality of normal feature maps and high-resolution feature maps comprises:

7. The method for detecting the small target of the remote sensing image according to claim 1, wherein the density map representing the foreground probability based on the cross-scale semantic information feature representation vector generation of the high-resolution feature map is obtained by outputting a trained cascade density map generating module;

8. A remote sensing image small target detection device, characterized by comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the remote sensing image small object detection method of any one of claims 1 to 7 when the program is executed by the processor.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the remote sensing image small target detection method according to any one of claims 1 to 7.