CN116824413A

CN116824413A - Aerial image target detection method based on multi-scale cavity convolution

Info

Publication number: CN116824413A
Application number: CN202310925097.6A
Authority: CN
Inventors: 王丽
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2023-07-25
Filing date: 2023-07-25
Publication date: 2023-09-29

Abstract

The invention discloses an aerial image target detection method based on multi-scale cavity convolution, which comprises the following steps: step one: acquiring an aerial remote sensing image of a target to be detected, and cutting the acquired large-resolution image; step two: based on a Yolov5 target detection model, enhancing the characteristic network feature extraction capacity in a main feature extraction network by using structural heavy parameterization convolution; step three: designing a new mixed domain attention mechanism EDM, adding multi-scale cavity convolution in a spatial attention mechanism of the EDM, and embedding the module into a feature fusion network; step four: introducing a SIOU regression loss function to accelerate the convergence speed; step five: training a model; step six: and testing the test pictures by using the trained model. The method and the device can detect various targets in the aerial photographing scene, improve the target recognition precision, reduce the omission ratio and can be effectively applied to the aerial photographing image target detection scene.

Description

Aerial image target detection method based on multi-scale cavity convolution

Technical Field

The invention belongs to the field of computer vision target detection, and particularly relates to an aerial image target detection method based on multi-scale cavity convolution, which can be used for detecting targets in aerial images.

Background

With the development of various aerial photographing devices and technologies, various aerial photographing devices such as unmanned aerial vehicles and the like are deployed in various fields, targets are accurately photographed on the premise of not interfering with lives of residents, and huge support is provided on tasks such as aerial photographing, search rescue, forest fire early warning, border patrol and the like. At present, a great deal of researches are developed by a plurality of researchers in the field of aerial image target detection, but the detection performance of the obtained target is difficult to meet the actual requirements due to factors such as complex target background, more small targets, dense distribution, mutual shielding among targets and the like under the aerial image. Therefore, research on a method for detecting the target of the aerial image is required to be carried out to improve the target detection performance.

For the problem of target detection in aerial images, the methods mainly adopted are divided into two types: firstly, a traditional target detection method; and secondly, a target detection method based on deep learning. The traditional target detection algorithm adopts a sliding window mode, the design and selection of the characteristics are highly dependent on manual work, the accuracy, the objectivity and the robustness are limited, and more time is required to be consumed. The target detection method based on deep learning is divided into a single stage and a double stage, and the double stage detection algorithm consists of a region proposal stage and a detection stage, but has a slow speed.

The one-stage algorithm converts target identification into regression classification, utilizes a single convolution network to predict the boundary box and class probability of the object, balances the detection speed and precision, and is more beneficial to deployment of a YOLO series algorithm in a re-aviation object detection scene. In an aerial image detection algorithm based on deep learning, firstly, the general convolution of a trunk feature extraction network cannot extract more sufficient feature information, so that feature extraction is insufficient; secondly, the detection effect is not friendly for smaller targets (lower than 32×32 pixels), and the condition of missed detection can occur; finally, because the background is complex and the targets are densely distributed in the aerial image, a large number of irrelevant backgrounds can generate interference in the process of extracting network characteristics, so that the model detection effect is poor.

Disclosure of Invention

In order to solve the problems, the invention aims to provide an aerial image target detection algorithm based on multi-scale cavity convolution, which can extract more abundant characteristic information, effectively improve the recognition accuracy of small targets in an aerial image, accelerate the model convergence speed and enable the positioning to be more accurate.

The aim of the invention is realized by the following technical scheme:

an aerial image target detection method based on multi-scale cavity convolution comprises the following steps:

step one, acquiring aerial images, and cutting pictures with larger resolution; step two,

Based on YOLOv5 target detection model, in the backbone feature extraction network

Enhancing the feature extraction capability of the backbone network by using a structural re-parameterized convolution;

step three, designing a new mixed domain attention mechanism EDM, adding multi-scale cavity convolution in a spatial attention mechanism, and embedding the module into a feature fusion network;

step four, introducing a SIOU regression loss function to improve the positioning accuracy of the target frame and accelerate the convergence rate of the model;

step five, model training;

step six, testing the test sample by using the model after training, and outputting the position and the category of the target;

the first step specifically includes two steps:

the pictures in the aerial image data set DOTA in step 1-1 come from aerial images with different resolutions shot by an airplane and a remote sensing platform, the sizes of the pictures in the aerial image data set are different, the sizes of the pictures in the aerial image data set are about 800 multiplied by 800 to 4000 multiplied by 4000, and larger pictures are selected for cutting. Cutting one picture into four pictures to generate images with the size of 1024 multiplied by 1024, filling the images with the size of less than 1024 multiplied by 1024 with black, and generating corresponding xml files by the cut pictures;

step 1-2, dividing the processed DOTA data set into a training set, a verification set and a test set according to the proportion of 7:1:2 respectively;

in the second step, a RepVGG structured heavy parameter structure is used to replace 3*3 convolution in the conventional network, and the structure is divided into a training phase and an reasoning phase: the training stage is a multi-branch structure, the reasoning stage re-parameterizes the multi-branch structure into a single-path structure, and the memory is saved and the network reasoning speed is increased; and constructing a large-scale target detection layer, fusing the large-scale target detection layer with the high-resolution feature map, retaining the detailed information of the shallow feature map, adding the last output detection layer from the original 3 layers to 4 layers, generating a smaller anchor frame by the added detection layer, and detecting a smaller target.

The third step specifically includes two steps:

step 3-1: the mixed domain attention mechanism EDM comprises a channel attention mechanism and a space attention mechanism, and the channel attention mechanism SAM helps the network to strengthen screening of important characteristic information and reduce characteristic detail loss; the spatial attention mechanism CAM selectively focuses on the characteristic information of the important area, so that the influence of the background on the detection result is reduced. Wherein the output of the channel attention mechanism is the input of the spatial attention mechanism.

Step 3-2: based on the structure of the step 3-1, a multi-branch cavity convolution module is added to a spatial attention mechanism in the structure. The module is composed of three expansion convolutions with expansion rates of 1, 2 and 3 and convolution kernel size of 3*3, so that a parallel multi-branch structure is formed, each branch has different acquired receptive fields, and the multi-scale feature extraction capability of the network is enhanced.

In the fourth step, the SIOU loss function is used to replace the GIOU regression loss in the original model, and vector angles between the real frame and the predicted frame are introduced, and the vector angles consist of four partial losses, which are respectively: the calculation formulas of the angle loss, the distance loss, the shape loss and the IOU loss, and the SIOU loss are as follows:

where Δ represents the distance loss, Ω represents the shape loss, σ is the distance between the center points of the real and predicted frames, and θ controls the degree of attention to the shape. In the fifth step, the step of model training includes the following steps:

step 5-1: setting the size and training parameters of a training picture;

step 5-2: loading budget training data;

step 5-3: training is carried out;

in the setting of the training picture size and parameters in step 5-1 described above, the input resolution of the picture is set to 1024×1024, the initial learning rate is set to 0.01, the cyclic learning rate is set to 0.02, the learning rate momentum is set to 0.937, the weight decay coefficient is set to 0.0005, and the training number is set to 300.

In the training process of the step 5-3, the processed DOTA data set picture is put into a network for training according to the parameter setting, and a Mosic data enhancement method carried by YOLOv5 is used in the training process. And storing the model with the minimum loss after each training round, and finally obtaining the best training model.

The invention has the beneficial effects that: compared with the prior art, the invention has the following advantages.

First, the average accuracy is improved: the main network uses the structure re-parameterized structure, the model improves the characteristic representation capability of the network by utilizing multi-branch convolution during training, and the multi-branch structure is re-parameterized into a single-branch structure during reasoning, so that the characteristic extraction capability of the network is improved, and the reasoning efficiency is improved.

Secondly, the condition of small target missing detection is improved: the invention constructs a large-scale target detection layer and is fused with the high-resolution feature map, thereby solving the problem of missed detection of small targets caused by numerous small targets, low resolution of the feature map and few extractable feature information in the aerial scene. The mixed domain attention mechanism is embedded into the feature fusion network, so that the position information and the classification information of the target are fully focused, the interference of complex background on network feature extraction is reduced, and important feature information is focused. The multi-scale cavity convolution is introduced into the spatial attention mechanism, convolution with different expansion rates has receptive fields with different sizes, the multi-scale feature extraction capacity of the network can be enhanced, and meanwhile, the high-resolution feature map has larger receptive field, so that the small target detection effect is further improved.

Thirdly, the convergence speed of the model is increased, and the target positioning is more accurate: the invention uses SIOU loss function, considers the direction matching problem between the real target frame and the prediction frame, can enable the prediction frame to move to the nearest axis more quickly, prevents the prediction frame from possibly wandering around in the training process, and improves the training speed and the reasoning accuracy of the model.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

Fig. 2 is a diagram showing an improved object detection structure.

Fig. 3 is a structural diagram of RepVGG.

Fig. 4 is a mixed domain attention mechanism diagram.

Detailed Description

The technical scheme of the invention is further described in detail below with reference to fig. 1-4.

Aerial image target detection algorithm based on multi-scale cavity convolution, as shown in figure 1

The method comprises the following steps:

step one, acquiring aerial images, and carrying out segmentation processing on images with larger resolution: the DOTA dataset contains 286 aerial images from different sensors and platforms, each picture having a resolution within 4000 x 4000 pixels, divided into 15 categories, respectively aircraft, vessel, tank, baseball field, tennis court, basketball court, track field, harbour, bridge, large vehicle, small vehicle, helicopter, roundabout, soccer field, swimming pool.

Cutting a picture with larger resolution, cutting one picture into four 1024×1024 pictures, filling images with sizes smaller than 1024×1024 with black, and generating corresponding xml files by the cut pictures. The processed dataset was processed as 7:1: the ratio of 2 is divided into a training set, a validation set and a test set, respectively.

And secondly, based on a YOLOv5 target detection model, selecting a structure heavy parameterization structure to replace part of 3X 3 convolution in the original backbone network, constructing a small target detection layer, fusing with a high-resolution feature map to retain shallow semantic information, and adding the output detection layer to 4 layers.

The structure repavgg is a multi-branch structure, and is divided into three parallel branches from an input feature, wherein the three branches are composed of 3×3 convolutions, 1×1 convolutions and Identity, the step length of the two convolutions is set to be 1, BN normalization processing is carried out on the three branches immediately after the three branches, the feature images after the three paths are processed are subjected to addition operation, and the feature images are fused, wherein the sizes of the input feature image and the output feature image are unchanged. While the structure re-parameterized structure of the inference phase converts the multi-branch structure into a single-path structure. The 3×3 convolution of the main branch and the BN normalization layer are fused first, the 1×1 convolution of the second branch and the BN layer are fused and then converted into a 3×3 convolution, and the Identity branch with only BN operation is converted into a 3×3 convolution, wherein the number of channels of the input feature map is the same as the number of channels of the convolution kernel. In the process of 3×3 convolution and BN layer fusion, the output of the convolution layer is used as the input of the BN layer, and the channels of the feature map are fused respectively.

In the process of 1×1 convolution and BN layer normalization fusion, zero is filled around the original 1×1 convolution kernel weight, the zero is converted into a 3×3 convolution layer, and padding is set to 1 to ensure that the input and output sizes of the feature map are unchanged. And fusing according to a fusion mode of the convolution of 3 multiplied by 3 and the BN layer. In the process of converting the BN layer into the convolution of 3 multiplied by 3, the original feature map is subjected to feature mapping to ensure that the input and the output are unchanged, and then the BN layer is fused according to the method. Then, the 3×3 convolutions on the three branches are fused into one 3×3 convolution layer, and the parameters of the three convolution layers are added. To this end, the reasoning phase converts this structure from a three-branch structure to a single-branch structure, comprising only one 3 x 3 convolutional layer. The conversion formula of the ith channel of the feature map and the weight and bias formula for generating the ith convolution kernel are as follows:

wherein M is a characteristic diagram representing an input BN layer, mu, sigma, gamma and beta are parameters generated by the BN layer, mu and sigma are obtained through statistics in a training process, and gamma and beta are obtained through training learning.

The small target detection layer starts to add a feature fusion network from the layer 2 of the main feature network, and fuses 160 multiplied by 64 feature graphs with newly added up-sampling operation and down-sampling operation for 1 time in the feature fusion network, the output detection layer is 4 layers, and the number of prediction frames is increased from 9 to 12. The added 3 prediction frames are aimed at the small target, so that the detection effect on the small target is effectively improved, and the occurrence of the condition of missed detection is relieved.

Step three, designing a new mixed domain attention mechanism EDM, wherein the module is formed by connecting a channel attention mechanism CAM and a space attention mechanism SAM in series, introducing multi-branch hole convolution in the space attention mechanism, and embedding the attention mechanism into a feature fusion network. The module effectively focuses on important information, suppresses interference generated by complex background on feature information extraction, enhances the multi-scale feature extraction capability of the network, and improves the detection effect on small targets.

The channel attention mechanism CAM generates a channel attention map by utilizing the channel relation among the features, and improves the target classification capability of the model. A spatial dimension method of compressing the input feature map is adopted to effectively calculate the channel attention. Firstly, a feature map is input, and the feature map is subjected to maximum pooling MaxPool and average pooling AvgPool respectively, so that the representation capability of the network is greatly improved, and average pooling features and maximum pooling features are generated. The two feature weight vectors are then forwarded to a multi-layer perceptron Module (MLP) with hidden layers, producing a channel attention map. And adding the mapped weights, and outputting feature vectors through a Sigmoid activation function. And multiplying the output feature vector with the input feature map to generate a new feature map, wherein the new feature map is the input feature map of the spatial attention mechanism. The channel attention calculation formula is:

wherein F represents an input feature map, M _C (F) Represents a channel attention output feature map, σ represents a Sigmoid activation function. W (W) ₀ ∈R ^C/r×C ，W ₁ ∈R ^C×C/r The MLP weights W0 and W1 are shared for both inputs.

The spatial attention mechanism CAM aims at improving the feature expression of the key region, transforming the spatial information in the picture into another space through a spatial conversion module, retaining the key information, generating weight for each important position and outputting. While maximum pooling and average pooling may quickly summarize feature information, the resolution of the feature map may be reduced resulting in loss of feature information. For smaller targets, high resolution, large receptive field feature maps are required, while the network gradually decreases in feature map resolution during the convolution process. Therefore, the mode of multi-scale cavity convolution is used for replacing the average pooling and maximum pooling operation used in a spatial attention mechanism, so that the module can enhance multi-scale feature extraction capacity under the condition of not changing the resolution of the feature map, expand the receptive field of the feature map and further help the network to inhibit irrelevant background areas.

In order to calculate the space attention mechanism, firstly taking an output characteristic diagram of channel attention as the input of the space attention, carrying out dimension reduction processing by a convolution of 1×1, wherein the number of channels is half of that of the original characteristic diagram; secondly, through three paths of expansion rates of 1, 2 and 3 respectively, a convolution kernel is a 3×3 mixed expansion convolution branch to form a parallel multi-branch structure, and the real receptive fields of the structure are 3×3, 5×5 and 7×7 respectively, so that the network is helped to extract deeper features from different scales; then, adding the three-way dimensionality halving features; finally, the features are again reduced to a 1×h×w spatial attention map using a 1×1 convolution, with a batch normalization layer applied at the end of the spatial branches. The calculation formula of the spatial attention is as follows:

F _conv1 ＝f ₁ ^1×1 (F)，

wherein F represents an input feature map, F ₁ ^1×1 Andrepresenting different convolution processes, F _conv1 And F _conv2 Representing a characteristic diagram after convolution processing, wherein +.>F _conv2 ∈R ^H×W×1 ，/>An expansion convolution representing an expansion rate of k, M _s (F) The spatial attention created for the spatial branches is sought.

Step four, introducing a SIOU regression loss function: the total loss of the YOLOv5 model consists of classification loss and regression loss, and the specific improvement method is to replace the regression loss in the original model by using a SIOU loss function, so that the target frame is positioned more accurately, and the model convergence speed is increased. The SIOU loss function consists of 4 parts, respectively: angle loss (Angle cost), distance loss (Distance cost), shape loss (Shape cost), and IoU loss (IoU cost). The loss function considers vector angle between the real frame and the predicted frame, redefines penalty measurement, and prevents the predicted frame from wandering around in the training process, which results in slow convergence speed and poor effect. The training speed and the reasoning precision of the network are improved, and the target frame is positioned more accurately. Wherein the angle loss is calculated as follows:

wherein sin (alpha) is the opposite side ratio bevel edge of a right triangle, sigma is the distance between the center point of a real frame and the center point of a predicted frame, and c _h For the difference in height between the center points of the real and predicted frames,and->For the coordinates of the center point of the real frame, bc _x And bc _y Is the predicted box center point coordinates. The distance loss calculation formula is as follows:

γ＝2-Λ

wherein, c _w And c _h The width and height of the minimum bounding rectangle for the real and predicted frames. It can be seen that the contribution of the distance loss is greatly reduced when α→0, whereas the closer α is to pi/4, the greater the contribution of the distance loss. Gamma is given time priority to the distance value as the angle increases. The calculation formula of the shape loss is as follows:

wherein w, h, w ^gt ，h ^gt The θ controls the degree of attention given to the shape loss, the width and height of the predicted and real frames, respectively. In order to avoid excessive attention to the shape and reduce the movement to the prediction frame, θ, if set to 1, immediately optimizes the shape, thereby harming the free movement of the shape. A genetic algorithm is used to calculate θ approaching 4, thus defining a range for this parameter definition between 2-6. The final IoU loss is defined as follows:

where Δ represents the distance loss, Ω represents the shape loss, σ is the distance between the center points of the real and predicted frames, and θ controls the degree of attention to the shape.

Fifth, model training: the invention is built on a deep learning target detection framework based on pyTorch, and uses a single GPU card for training and testing, and the model is NVIDIA GeForce RTX 2080Ti.

The method comprises the steps of setting the Size of a training picture and training parameters, wherein the input resolution of the picture is 1024×1024, a Batch training method is adopted in training, and the Batch Size is set to be 64. The learning rate was controlled using the degradation learning rate, the initial learning rate was set to 0.02 for the cyclic learning rate, 0.01 for the learning rate momentum was set to 0.937, the weight decay factor was set to 0.0005, and the training number was set to 300. The target frames in the training set are clustered using a K-means clustering algorithm to generate a priori frames of different sizes in 12 and ordered by size for initializing parameters in the four heads.

In the training process, a Mosic data enhancement method carried by YOLOv5 is used, and a model with minimum loss is saved after each training round. The trained models are evaluated using commonly used evaluation metrics in the target detection algorithm, such as average Precision (Average Precision, AP), average Precision (mean Average Precision, mAP), frame rate (FPS), precision (Precision), and Recall (Recall).

Step six: and testing the test set by using the trained model, and outputting the category and position information of the target on the aerial image. The model test result shows that different kinds of targets and smaller targets can be identified under the complex condition; the model can detect the presence of multiple similar objects in the image.

By utilizing the improved YOLOv5 target detection model designed by the invention, after the aerial image is given, the system can detect targets of various categories in the image through the trained model.

Claims

1. An aerial image target detection method based on multi-scale cavity convolution is characterized by comprising the following steps:

step one, acquiring aerial images, and cutting pictures with larger resolution;

step two, constructing a large-scale target detection layer and carrying out feature fusion by using structural heavy parameterization convolution in a trunk feature extraction network based on a YOLOv5 target detection model;

step three, designing a new mixed domain attention mechanism SCAM, adding multi-scale cavity convolution in a spatial attention mechanism, and embedding the module into a feature fusion network;

step five, model training;

and step six, testing the test sample by using the model after training, and outputting the position and the category of the target.

2. The method for detecting an aerial image target based on multi-scale hole convolution according to claim 1, wherein in the first step, the acquisition and processing of the data set comprises the following two steps:

step 1-2 divides the processed DOTA data set into a training set, a verification set and a test set according to the proportion of 7:1:2 respectively.

3. The method for detecting an aerial image target based on multi-scale hole convolution according to claim 1, wherein in the step 2, a RepVGG structured heavy parameter structure is used to replace 3*3 convolution in a traditional network, and the structure is divided into a training phase and an reasoning phase: the training stage is a multi-branch structure, the reasoning stage re-parameterizes the multi-branch structure into a single-path structure, and the memory is saved and the network reasoning speed is increased; and constructing a large-scale target detection layer, fusing the large-scale target detection layer with the high-resolution feature map, retaining the detailed information of the shallow feature map, adding the last output detection layer from the original 3 layers to 4 layers, generating a smaller anchor frame by the added detection layer, and detecting a smaller target.

4. The method for detecting the target of the aerial image based on the multi-scale hole convolution according to claim 1, wherein in the third step, a new mixed domain attention mechanism is designed, the multi-scale hole convolution is added in the spatial attention mechanism, and the module is embedded into the feature fusion network, comprising:

5. The method for detecting the target of the aerial image based on the multi-scale hole convolution according to claim 1, wherein in the fourth step, a SIOU loss function is used to replace a GIOU regression loss in an original model, and vector angles between a real frame and a predicted frame are introduced, and the method comprises four partial losses, which are respectively: the calculation formulas of the angle loss, the distance loss, the shape loss and the IOU loss, and the SIOU loss are as follows:

6. The method for detecting an aerial image target based on multi-scale hole convolution according to claim 1, wherein in the fifth step, the step of model training comprises the steps of:

step 5-1: setting the size and training parameters of a training picture;

step 5-2: loading budget training data;

step 5-3: training is performed.

7. The aerial image target detection method based on multi-scale hole convolution according to claim 6, wherein the step 5-1 sets a training picture size and training parameters, specifically comprises the following steps: the input resolution of the picture was 1024×1024, the initial learning rate was set to 0.01, the cyclic learning rate was set to 0.02, the learning rate momentum was set to 0.937, the weight decay coefficient was set to 0.0005, and the number of training times was set to 300.

8. The aerial image target detection method based on multi-scale hole convolution according to claim 6, wherein the training in step 5-3 specifically comprises the steps of putting the processed DOTA dataset picture into a network for training according to the parameter setting, and using a Mosic data enhancement method carried by YOLOv5 in the training process. And storing the model with the minimum loss after each training round, and finally obtaining the best training model.