CN111462050B

CN111462050B - YOLOv3 improved minimum remote sensing image target detection method and device and storage medium

Info

Publication number: CN111462050B
Application number: CN202010172524.4A
Authority: CN
Inventors: 张孙杰; 陈磊; 肖寒臣
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2020-03-12
Filing date: 2020-03-12
Publication date: 2022-10-11
Anticipated expiration: 2040-03-12
Also published as: CN111462050A

Abstract

The invention relates to the technical field of target detection, in particular to a method, a device and a storage medium for improving the detection of a YOLOv3 minimum remote sensing image target, and provides a method for improving the performance of low-resolution features by adding extra bottom-up and transverse connection paths on an FPN module, constructing a top-down and bottom-up feature pyramid network, fusing a bidirectional combined pyramid feature layer, applying the pyramid feature layer to the target detection of a remote sensing image, reducing the dimensionality of a network model by adopting 1 x 1 convolution, and improving the detection speed of the network. And finally, carrying out quantitative and qualitative comparative analysis on VEDAI and NWPU VHR remote sensing vehicle data sets and the most advanced YOLOv3 network. The result shows that the improved network detection performance is obviously improved compared with the original network, the detection speed of the network is almost unchanged, and the problems of low target detection rate, high false alarm rate and low detection speed of the minimum remote sensing image in the prior art are solved.

Description

YOLOv3 improved minimum remote sensing image target detection method and device and storage medium

Technical Field

The invention relates to the technical field of target detection, in particular to a method and a device for detecting a minimum remote sensing image target by improving YOLOv3 and a storage medium.

Background

In recent years, target detection has become a research hotspot of computer vision, and is widely applied to various fields of robot navigation, intelligent video monitoring, industrial detection, aerospace and the like, and the category of a target in an image needs to be determined and the accurate position of the target needs to be given.

In the last years, researchers find that designing a deep convolutional neural network can greatly improve the performance of image classification and target detection, because the deep convolutional neural network can not only extract high-semantic high-level features of a target from an image, but also integrate the feature extraction, feature selection and feature classification into a model, perform function optimization on the whole through end-to-end training, and enhance the separability of the features. Current convolutional neural networks for target detection are largely classified into the following two categories: (1) The two-stage detection algorithm firstly generates a series of candidate regions and then accurately classifies the candidate regions, and typical algorithms of the type include Fast R-CNN, mask R-CNN and the like, but the methods are not suitable for real-time scene detection; (2) In contrast, the single-stage detector directly performs target prediction from the original image, and typical algorithms include three versions of YOLO, SSD, retina-Net, and the like. Therefore, the single-step detection algorithm has higher detection precision and higher detection speed at the same time. At present, target detection research or invention aiming at remote sensing images is based on deep learning, and is improved on a target detection algorithm YOLOv3 which gives consideration to real-time property and high precision. However, due to the characteristics of low target presentation resolution and unobvious features in the remote sensing image, the target detection technology of the minimum remote sensing image at the present stage has the problems of low detection rate and high false alarm rate.

Disclosure of Invention

Aiming at the defects of the prior art, the invention discloses a method, a device and a storage medium for improving the detection of a target of a YOLOv3 minimum remote sensing image, which are used for solving the problems of low target detection rate, high false alarm rate and low detection speed of the minimum remote sensing image in the prior art.

The invention is realized by the following technical scheme:

in a first aspect, the invention discloses a method for detecting a minimal remote sensing image target by improving YOLOv3, which comprises the following steps:

s1, acquiring target data, and fusing convolution layer characteristic outputs of a YOLOv3 network to form a pyramid characteristic layer;

s2, combining the feature outputs of the convolution layer of the YOLOv3 shallow layer to form a pyramid feature layer;

s3, fusing the two-way combined pyramid feature layer;

s4, changing a down-sampling layer of the YOLOv3 network into a 3x3 convolutional layer;

and S5, reducing the dimensionality of the network model by adopting 1 × 1 convolution and outputting data.

Furthermore, in S1, the convolution layer feature output of the last layer of the YOLOv3 network is fused with the convolution layer feature output of the next previous layer to form a top-down pyramid feature layer.

Furthermore, in S2, the convolution layer feature output of the YOLOv3 shallow layer is combined with the convolution layer feature output of the next adjacent layer to form a bottom-up pyramid feature layer.

Furthermore, in S4, the first down-sampling layer of the YOLOv3 network is changed to two 3 × 3 convolutional layers, so that the YOLOv3 network retains more small target location feature information in the initial stage.

Still further, the method adds additional bottom-up, cross-connect paths on the FPN module.

In a second aspect, the present invention discloses a minimal remote sensing image target detection apparatus for improving YOLOv3, including an FPN module, a memory, a processor, and a computer program stored in the memory and operable on the processor, where when the computer program is executed by the processor, the minimal remote sensing image target detection method for improving YOLOv3 as described in the first aspect is implemented.

Furthermore, additional bottom-up and cross-connect paths are added to the FPN modules.

In a third aspect, the present invention discloses a storage medium, on which a computer program is stored, which, when executed by a processor, implements the method for detecting an object in an extremely small remotely sensed image of improved YOLOv3 as described in the first aspect.

The beneficial effects of the invention are as follows:

the invention provides a method for improving the performance of low-resolution features by adding additional bottom-up and transverse connection paths on an FPN module, constructing a top-down and bottom-up feature pyramid network, fusing a bidirectional combined pyramid feature layer, and applying the bidirectional combined pyramid feature layer to the target detection of a remote sensing image, wherein the main improvement on the network is as follows: first, the first downsampling layer is changed into two convolution layers of 3x3, which is beneficial to the first downsampling layer of the network layer to keep more small target position information. And then, changing three different scale outputs of YOLOv3 into a feature map output aiming at the small target, and then fusing the features of the high and low levels of the bidirectional pyramid to be favorable for detecting the small target. In addition, the dimensionality of a network model is reduced by adopting 1 x 1 convolution, and the detection speed of the network is improved. And finally, carrying out quantitative and qualitative comparative analysis on VEDAI and NWPU VHR remote sensing vehicle data sets and the most advanced YOLOv3 network. The result shows that the improved network detection performance is obviously improved compared with the original network, and the detection speed of the network is almost unchanged.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a diagram of a modified YOLOv3 network architecture according to an embodiment of the present invention;

FIG. 2 is a comparison (VEDAI) graph of the detection result of the improved YOLOv3 and the original network according to the embodiment of the present invention;

FIG. 3 is a graph of the comparison (NWPU VHR) between the improved YOLOv3 and the detection result of the original network according to the embodiment of the present invention;

fig. 4 is a diagram of a conventional YOLOv3 network structure according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

Example 1

The embodiment discloses a minimum remote sensing image target detection method for improving YOLOv3, which comprises the following steps:

s3, fusing the two-way combined pyramid feature layer;

In S1, the feature output of the last layer of the YOLOv3 network is fused with the feature output of the convolution layer of the previous adjacent layer to form a pyramid feature layer from top to bottom.

In S2, the feature output of the convolutional layer of the yollov 3 shallow layer is combined with the feature output of the convolutional layer of the next adjacent layer to form a bottom-up pyramid feature layer.

In S4, the first down-sampling layer of the YOLOv3 network is changed to two 3x3 convolutional layers, so that the YOLOv3 network retains more small target location feature information at the initial stage.

The embodiment proposes a bidirectional pyramid feature fusion architecture. Based on single-stage target detection Yolov3, the method adds additional bottom-up and transverse connection paths on an FPN module to improve the performance of low-resolution features, and improves the network architecture of the Yolov 3. Firstly, the feature output of the last layer of the network is fused with the feature output of the last layer of the network to form a pyramid feature layer from top to bottom. It is also contemplated to combine the convolutional layer feature output of the shallow layer with the convolutional layer feature output of the next adjacent layer to form a bottom-up pyramid feature layer, and to merge the bi-directionally combined pyramid features. The method not only combines the global semantic information of deep features, but also adds the detail local texture information of shallow convolution. And the first down-sampling layer of the original network is changed into two convolution layers of 3 multiplied by 3, so that the network can keep more small target position characteristic information in the initial stage. Finally, in order to improve the detection speed of the network, the dimensionality of the network model is reduced by adopting 1 × 1 convolution. And comparing the remote sensing image VEDAI-10 and the NWPU VHR public data set with the most advanced target detection network YOLOv3 at present.

Example 2

The embodiment discloses a minimum remote sensing image target detection device of improved YOLOv3, which comprises an FPN module, a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the computer program is executed by the processor, the minimum remote sensing image target detection method of improved YOLOv3 in the embodiment is realized.

Additional bottom-up, cross-connect paths are added to the FPN modules.

The embodiment also discloses a storage medium, wherein a computer program is stored on the storage medium, and when the computer program is executed by a processor, the method for detecting the minimum remote sensing image target of the improved YOLOv3 in the embodiment is realized.

Example 3

In this embodiment, image classification may predict the class of a single target in an image, and target detection has its own challenges, and needs to predict multiple target classes and corresponding positions in a single image. To solve this problem, a pyramid feature method is proposed that represents the target feature layer in multiple scales, and a feature pyramid (feature pyramid) is one of the representative model architectures that generate pyramid feature representations for target detection. A characteristic pyramid is constructed by connecting two adjacent characteristic layers in a combined backbone network model from top to bottom and transversely, and high-resolution and strong-semantic characteristics are generated. YOLOv3 mirrors the idea of the above feature pyramid model, as shown in fig. 4. The method is based on a Darknet53 trunk network model, has 75 convolutional layers and 5 residual blocks in total, carries out 5 times of downsampling, and respectively uses 8 times, 16 times and 32 times of downsampling feature maps output by the trunk model to predict a target. The large target is predicted by the strong semantic information provided by the deep layer. And the method is fused with shallow features containing more spatial position information in a top-down and transverse connection mode and used for predicting the small target. Therefore, the model can well consider the detection of the targets with three different sizes, namely large, medium and small.

The invention provides a method for improving the performance of low-resolution features by adding additional bottom-up and transverse connection paths on an FPN module, constructing a top-down and bottom-up feature pyramid network, fusing a bidirectional combined pyramid feature layer, and applying the bidirectional combined pyramid feature layer to the target detection of a remote sensing image, wherein the main improvement on the network is as follows: the first downsampling layer is changed into two convolution layers of 3x3, so that the first downsampling layer of the network layer can retain more small target position information. And then changing three different-scale outputs of YOLOv3 into a feature map output for the small target, and then fusing the high-level and low-level features of the bidirectional pyramid to be favorable for detecting the small target. In addition, the convolution of 11 is adopted to reduce the dimensionality of the network model and improve the detection speed of the network. And finally, carrying out quantitative and qualitative comparative analysis on VEDAI and NWPU VHR remote sensing vehicle data sets and the most advanced YOLOv3 network. The result shows that the improved network detection performance is obviously improved compared with the original network, and the detection speed of the network is almost unchanged.

Example 4

The embodiment respectively uses remote sensing images VEDAI and NWPU VHR-10 public data sets to train and detect the network identification performance, wherein the VEDAI data sets comprise visible images and infrared images, comprise various vehicles and complex backgrounds, and show different changeability besides small vehicles, such as multiple directions, light, shadow changes and shelters, which are scenes typically applied to monitoring and reconnaissance. To make detection of the target more challenging, both dataset images are scaled or cropped to 512pixel × 512pixel and 416pixel × 416pixel. The VEDAI dataset contains nine classes of vehicle objects, namely "car, truck, pickup, sector, campingcar, boat, van, other and airplan". On average, each image contained 5.5 vehicle targets, accounting for approximately 0.7% of the total pixels in the image.

The target size in this dataset is between 11.5pixel by 11.5pixel and 24.1pixel by 24.1 pixel. The NWPU VHR-10 dataset is a public aerial image dataset for 10-level geospatial object detection, and comprises 10 categories of "airfane, ship, storage tank, basebal intensity, tenis court, basebal court, group track field, harbor, bridge and vehicle". Since most target classes in the dataset are large in size, such as "tenis court, basketbal court, and ground track field," the "vehicle" is selected as the target for single class detection. Each image contains on average 11 targets, about twice as many as the VEDAI dataset, which has target sizes between 19 pixels by 19 pixels and 94 pixels by 94 pixels. The YOLOv3 predicts the target on three different scales of large, medium and small, and improves the small target recognition rate of the remote sensing image. Therefore, comparative experiments were performed herein with the YOLOv3 target detection algorithm.

In order to verify the recognition performance of the improved network, the improved network is compared with a YOLOv3 network model, images with two sizes of 512pixel multiplied by 512pixel and 416pixel multiplied by 416pixel are used as input in an experiment, training and testing are carried out on a 1080GPU workstation with a CPU (central processing unit) of 2.10GHz and a video memory of 16GB, a deep learning frame is Pythrch, and a data set is enhanced and expanded by using methods such as image rotation, clipping and the like. On the VEDAI data: the batch size was 8, the initial learning rate was 0.001, the momentum term coefficient =0.9, and the learning rate dropped to 0.0001 and 0.00001 when iterating through steps 45000 and 60000, respectively. On the NWPU VHR dataset, the batch size was 8, the initial learning rate was 0.001, the momentum term coefficient was set to =0.9, and the learning rate dropped to 0.0001 and 0.00001 when iterated to steps 30000 and 45000, respectively.

Example 5

The results are analyzed in this embodiment, and table 1 shows the detection results of the model tested on the VEDAI dataset, and on the image with resolution of 416pixel × 416pixel, the detection accuracy and speed are 47.7% and 42.3F/S, respectively, which is 5% higher than that of the original YOLOv3 model, and the detection speed is almost unchanged. On an image with the resolution of 512 pixels multiplied by 512 pixels, the detection precision and the speed are respectively 66.8 percent and 38.4F/S, compared with the original YOLOv3 model, the detection precision is improved by 12.8 percent, and the detection speed is almost kept unchanged.

TABLE 1 comparison of detection Performance on VEDAI datasets

Table 1Comparison of detection performance on VEDAI dataset

Experiment 2:

table 2 shows the model detection results on the single-class data set VEDAI of the NWPU VHR, the detection precision and the detection speed are respectively 75.5% and 19.2F/S on an image with the resolution of 416pixel × 416pixel, the detection precision is improved by 1.1% compared with the original YOLOv3 model, and the detection speed is almost kept unchanged. On an image with the resolution of 512 pixels multiplied by 512 pixels, the detection precision and the speed are respectively 88.3 percent and 40.0F/S, compared with the original YOLOv3 model, the detection precision is improved by 3.2 percent, and the detection speed is almost kept unchanged.

TABLE 2 comparison of detection Performance on NWPU VHR dataset

Table 2Comparison of detection performance on NWPU VHR dataset

Fig. 2 and 3 are graphs of the detection results of 512pixel × 512pixel input sizes, the first line is an original graph, the second line is an original network detection graph, and the third line is an improved YOLOv3 detection graph. Viewed from the longitudinal direction: the improved network in the group of experiments can solve the problem of missed detection of the original network, (b) the group of experiments show that the improved network can solve the problem of false detection of the original network, and (c) the group of experiments show that the improved network can enable the positioning of the prediction frame to be more accurate, and the improved network improves the small target recognition rate of the remote sensing image.

Example 6

In this embodiment, YOLOv3 uses an 8-fold downsampling feature map output by a backbone network to detect a small target, when the resolution of a target in a remote sensing image is lower than 8pixel × 8pixel, the feature of the detected target is almost not present in the output feature map, and in order to enable the network to obtain position information of more small targets, as shown in fig. 3, a first downsampling layer of the network is changed into two 3 × 3 convolutional layers to retain more position feature information of the small target. In addition, it is contemplated herein that the last layer of convolutional layer feature output of the network is fused with the convolutional layer feature output of the next layer above to form a top-down pyramid feature layer, and that the shallow layer of convolutional layer feature output is fused with the convolutional layer feature output of the next layer below to form a bottom-up pyramid feature layer, fusing the bi-directionally combined pyramid features. The method not only utilizes global semantic information of deep features, but also adds detail local texture information of shallow convolution, so that the middle layer can more accurately detect the small target, and finally adopts 1 multiplied by 1 convolution to reduce the dimensionality of a network model and improve the detection speed of the network.

In conclusion, the method solves the problems of low target detection rate, high false alarm rate and low detection speed of the minimum remote sensing image in the prior art.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. The method for detecting the minimum remote sensing image target of the improved YOLOv3 is characterized by comprising the following steps:

s3, fusing the two-way combined pyramid feature layer;

s4, changing a down-sampling layer of the YOLOv3 network into a convolution layer of 3 multiplied by 3;

s5, reducing the dimensionality of the network model by adopting 1 × 1 convolution and outputting data;

in the S1, the feature output of the convolution layer at the last layer of the YOLOv3 network is fused with the feature output of the convolution layer at the adjacent upper layer to form a pyramid feature layer from top to bottom;

in the step S2, combining the feature output of the convolution layer of the yollov 3 shallow layer with the feature output of the convolution layer of the next adjacent layer to form a bottom-up pyramid feature layer;

in the step S4, the first downsampling layer of the YOLOv3 network is changed into two 3 × 3 convolutional layers, so that the YOLOv3 network retains more small target position feature information at the initial stage;

the method adds additional bottom-up, cross-connect paths on the FPN module.

2. An improved YOLOv3 minimal remote sensing image target detection device, comprising an FPN module, a memory, a processor and a computer program stored on the memory and operable on the processor, wherein when the computer program is executed by the processor, the improved YOLOv3 minimal remote sensing image target detection method of claim 1 is realized.

3. A storage medium having stored thereon a computer program which, when executed by a processor, implements the method of improving the object detection of very small remotely sensed images by YOLOv3 as claimed in claim 1.