CN112101153A

CN112101153A - Remote sensing target detection method based on receptive field module and multiple characteristic pyramid

Info

Publication number: CN112101153A
Application number: CN202010906088.9A
Authority: CN
Inventors: 赵丹培; 刘子铭; 姜志国; 史振威; 张浩鹏
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-09-01
Filing date: 2020-09-01
Publication date: 2020-12-18
Anticipated expiration: 2040-09-01
Also published as: CN112101153B

Abstract

The invention discloses a remote sensing target detection method based on a receptive field module and a multiple characteristic pyramid, which comprises the following steps: performing feature extraction on the visible light remote sensing image through a VGG network to obtain feature maps with different sizes; carrying out cascade fusion on the feature maps with different sizes according to the fusion module to obtain a multi-scale cascade feature map; sequentially convolving the multi-scale cascade characteristic graphs through the step length convolution characteristic pyramid to obtain an optimized characteristic graph; inputting the first layer of feature layer of the optimized feature map into a receptive field module, and extracting information to obtain a receptive field layer; inputting a second layer feature layer, a third layer feature map, a fourth layer feature map and a fifth layer feature map of the receptive field layer and the optimized feature map into an anchor point optimization module for anchor point second classification; and inputting the optimized anchor points into a target detection module, and classifying the targets to obtain the multi-scale targets. The invention can simultaneously detect the targets with different sizes, and improves the detection precision and speed.

Description

Remote sensing target detection method based on receptive field module and multiple characteristic pyramid

Technical Field

The invention relates to the technical field of digital image processing, in particular to a remote sensing target detection method based on a receptive field module and a multi-characteristic pyramid.

Background

The optical remote sensing image target detection has many detection challenges, such as images of a large number of examples, large image breadth, complex background texture and other difficulties, and meanwhile, the optical remote sensing image has high utilization value in the civil and military fields and is not thoroughly mined. In recent years, as the resolution of optical images of satellites has increased, finer objects can be identified in remote sensing images. Although high resolution remote sensing images can provide more detailed information about the target, the background of the image will therefore become more complex, which also makes target detection more difficult. Especially, the detection of different kinds of targets such as multi-scale targets becomes more difficult, and the remote sensing targets are mostly distributed targets, and the distribution positions and distances are not fixed, which has become a big problem in the field of target detection.

Patent CN111126359A proposes a high-definition image small target detection method based on an autoencoder and a YOLO algorithm, which mainly solves the problem that the accuracy and speed of the high-definition image small target detection in the prior art cannot be considered at the same time. Although the detection speed and the detection accuracy of the network are improved, the training process of the network needs to be carried out in two steps, is very complicated and is not suitable for detecting the size mixing target.

Patent CN110378242A proposes a method for detecting a remote sensing target with a dual attention mechanism. Because the algorithm is a two-stage detection algorithm, the detection speed cannot be greatly improved, and meanwhile, the structure design is not carried out aiming at different targets with multiple scales.

Therefore, how to provide an efficient multi-scale target detection method is a problem that needs to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of this, the invention provides a remote sensing target detection method based on a receptive field module and a multi-feature pyramid, which can simultaneously detect targets with different sizes, and improve detection accuracy and speed.

In order to achieve the purpose, the invention adopts the following technical scheme:

the remote sensing target detection method based on the receptive field module and the multi-characteristic pyramid comprises the following steps:

step 1: the method comprises the following steps of performing feature extraction on an input visible light remote sensing image according to a multi-feature pyramid, and specifically comprises the following steps:

step 11: performing feature extraction on the visible light remote sensing image through a VGG network to obtain feature maps with different sizes;

step 12: carrying out cascade fusion on the feature maps with different sizes according to a fusion module to obtain a multi-scale cascade feature map;

step 13: sequentially convolving the multi-scale cascade feature map by using a step convolution feature pyramid to obtain an optimized feature map;

step 2: carrying out target identification and classification on the optimized feature map through a detection network to obtain a multi-scale target, which specifically comprises the following steps:

step 21: inputting the first layer of feature layer of the optimized feature map into a receptive field module, and extracting information to obtain a receptive field layer;

step 22: inputting the receptive field layer and the second layer feature layer, the third layer feature map, the fourth layer feature map and the fifth layer feature map of the optimized feature map into an anchor point optimization module for anchor point second classification;

step 23: and inputting the optimized anchor points into a target detection module, and classifying the targets to obtain the multi-scale targets.

Further, the characteristic map output by the step 11 includes Conv3_3, Conv4_3, Conv5_3 and Conv 7.

Further, the step 12 specifically includes:

upsampling the Conv4_3, the Conv5_3, and the Conv7 to the Conv3_3 size, a spatial dimension size of 80 × 80;

and performing cascade fusion on the Conv3_3, the upsampled Conv4_3, the Conv5_3 and the Conv7 to obtain the multi-scale cascade characteristic diagram of 1024 channels.

Further, the step 13 specifically includes: and sequentially carrying out 3 × 3 × 512 convolution with the step size of 1 and 4 step sizes of 2 on the multi-scale cascade feature map to obtain the optimized feature map.

Further, the step 21 specifically includes:

the first branch is sequentially subjected to 1 × 1 convolution and 3 × 3 convolution with the void rate of 1;

the second branch is sequentially subjected to 1 × 1 convolution, 1 × 3 convolution and 3 × 3 convolution with the void rate of 3;

the third branch is sequentially subjected to 1 × 1 convolution, 3 × 1 convolution and 3 × 3 convolution with the void rate of 3;

the fourth branch is sequentially subjected to 1 × 1 convolution, 3 × 3 convolution and 3 × 3 convolution with the void rate of 5;

the fifth branch is operated through shortcut;

and integrating the output characteristic graphs of the first four branches, performing 1 × 1 convolution, and summing the result of the shortcut operation to obtain the receptive field layer.

Further, the loss function of the detection network is:

in the formula (I), the compound is shown in the specification,

a ground truth value category label representing anchor point i,

the position and the length and the width of the ground surface truth value of the anchor point i are shown; p is a radical of_iAnd x_iRespectively representing anchor points i as pairsThe prediction confidence of the elephant hour and the accurate coordinate of the anchor point i in the anchor point optimization module; c. C_iAnd t_iCoordinates representing the predicted object class and the final bounding box in the target detection module; n is a radical of_aAnd N_oRespectively representing the number of anchor points detected as positive in the anchor point optimization module and the target detection module, L_rRepresents the regression loss function, L_bRepresenting a binary classification penalty, L_mIndicating a multi-classification penalty, when the target is true,

the output of (a) is 1; if the goal is to be false,

the value of (d) is 0.

According to the technical scheme, compared with the prior art, the invention discloses a remote sensing target detection method based on a receptive field module and a multi-feature pyramid, and discloses a new network model based on multi-scale feature fusion and a receptive field mechanism. More abundant information can be obtained by mining the receptive field information on the characteristic diagram, and multi-scale output detection is carried out on the characteristic diagram based on the anchor point optimization module and the target detection module, so that the detection speed and efficiency are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart of a remote sensing target detection method based on a receptive field module and a multi-feature pyramid according to the present invention.

FIG. 2 is a schematic diagram of a pyramid structure with multiple features.

FIG. 3 is a schematic view of a receptor field module.

Fig. 4 is a diagram showing the effect of experiments performed by the algorithm of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a remote sensing target detection method based on a receptive field module and a multiple characteristic pyramid, which comprises the following steps:

step 1: the method comprises the following steps of utilizing a VGG network to carry out feature extraction on an input visible light remote sensing image, and specifically realizing the following processes:

and sending the visible light remote sensing image into a VGG feature extraction network, performing continuous local convolution and pooling operation to obtain the depth feature of the input image, and obtaining a feature map after convolution or pooling every time. Four layers of Conv3_3, Conv4_3, Conv5_3 and Conv7 of the VGG backbone network are selected as feature layer output results, and each layer of pyramid feature map contains different semantic and spatial information, so that the bottom-up process is completed. Wherein, the sizes of Conv3_3, Conv4_3, Conv5_3 and Conv7 are 80 × 80,40 × 40,20 × 20 and 10 × 10 respectively.

In order to prevent the problem that the training of the network from 0 causes slow speed and improve the training speed of the network, the feature extraction network is pre-trained by using ILSVRC CLS-LOC data sets.

Step 2: the fusion module is used for combining the four feature graphs extracted from the VGG backbone network, and the specific implementation process is as follows:

and performing cascade fusion on the feature maps by using a fusion module, namely setting a basic feature map size based on the size of Conv3_3, and upsampling three layers of Conv4_3, Conv5_3 and Conv7 to 80 × 80 sizes so that all feature space dimension sizes are kept consistent. And (3) reserving the original channel number of the channel, and combining the channel number into a multi-scale cascade feature map, wherein the multi-scale cascade feature map comprises all semantic features from low level to high level. Thereby completing the fusion of the multi-layer features at the bottom layer.

Different from the traditional characteristic pyramid network, the method directly adds the characteristics of each layer, which wastes time and labor.

Also, the present invention selects Conv3_3 instead of Conv4_3 of the general network as a result of statistics on the data set. Because the target of the remote sensing image is generally small, especially for automobiles and ships, the size of part of the target is smaller than 20 multiplied by 20, and by using Conv4_3, the receptive field of each element in the characteristic diagram is large, the semantic information is prominent, and the detail characteristic is not obvious.

And step 3: and performing secondary feature extraction on the multi-scale cascade feature map by using the step convolution feature pyramid, wherein the specific implementation process is as follows:

the multiscale concatenated feature map is sequentially convolved by steps 1 and 4 by 3 × 3 × 512 with steps 2. Thus completing the bottom-up second feature extraction.

In the second bottom-up stage, a new pyramid network is reconstructed. As shown in FIG. 2, the normalization of the input layers and the input data by using batch normalization for each convolutional layer can speed up the training of the model and reduce the variation of uncertainty. Moreover, the convolution is used for layering, the detection precision of small targets is improved, the reduction of the feature map is determined by the convolution kernel step length, the region information is enriched, and the features are more diverse.

The multi-feature pyramid structure can neutralize the respective advantages and disadvantages of pooling and convolution extraction features. The characteristic pyramid is divided into three parts, namely bottom-up characteristic extraction, fusion of multilayer characteristics in a bottom layer and bottom-up secondary characteristic extraction.

And 4, step 4: and extracting the receptive field information of the combined characteristic diagram, wherein the concrete implementation process is as follows:

inputting the first layer of feature layer of the optimized feature map into a receptive field module, and totally processing five branches, which are respectively:

the fifth branch is operated through shortcut;

The invention places the receptive field module on the characteristic diagram after cascade fusion, can extract each element of the characteristic diagram and the characteristics of the receptive field around the characteristic diagram, and constructs the target and the context information around the target, thereby improving the detection precision of the small target.

And 5: the anchor point optimization module is used for carrying out anchor point two classification, and the concrete implementation process is as follows:

in the anchor point optimization stage, the anchor point and the boundary frame are optimized once, and the surrounding frame and the class prediction are generated by utilizing the last four layers and the receptive field layer of the step convolution characteristic pyramid feedforward network. The last four layers of the receptive field layer and step convolution pyramid feed-forward network are {80 × 80,40 × 40,20 × 20,10 × 10,5 × 5) in size.

And distributing an anchor point for each feature point in each feature layer, distributing a surrounding frame by taking each anchor point as a center, comparing the features surrounded by the surrounding frame with the surface real value frame, judging whether each surrounding frame contains the target or not through the intersection and comparison of the surrounding frame and the target, and assigning a value to each target through the intersection and comparison value. The bounding box assigned a value of 0 does not contain a target, and is a bad detection box that is easy to detect.

This link is used for getting the detection frame that encloses the frame and get rid of the detection simultaneously and be not good, remains to detect the detection frame that contains the target and be difficult to judge whether good and not good detection frame, thereby can guarantee that the ratio of negative sample and positive sample is too big influence classification accuracy, can also guarantee simultaneously that the network can learn the sample that is difficult to judge more easily, increases the throughput of network to difficult sample. So this is actually a two-branch path, one branch path is the regression of the bounding box for learning and correcting the difference between the bounding box and the target box, and the other branch path is used for two-classification to screen out the correct and wrong bounding box. Here we use multi-layer features in order to obtain the result of the target on multiple scales.

And the anchor point optimization module identifies and deletes the wrong anchor frame, so that the calculation amount of the subsequent classifier is reduced.

Step 6: the anchor point target classification is carried out through a target detection module, and the specific implementation process is as follows:

after eliminating the surrounding frames which do not contain the target from the rear four layers of the step length convolution characteristic pyramid and the receptive field layer in the anchor point optimization module, sequentially inputting the surrounding frames into a target detection module through a connecting block for target classification, and specifically:

and the last layer of feature graph of the step length convolution feature pyramid is sent into a connecting block after eliminating a bounding box which does not contain a target, the output of the connecting block is two, one output is a feature graph which contains the bounding box and is required by a target detection module, the second output is the feature graph which is input into the connecting block after the input feature graph is subjected to up-sampling and is subjected to additive fusion with the feature graph input into the connecting block from the previous layer, and the purpose is to add higher-level features into the feature graph of the previous layer so as to inherit larger-scale context information.

And after eliminating the surrounding frame which does not contain the target from the penultimate layer feature graph of the step length convolution feature pyramid, inputting the feature graph into a connecting block, performing additive fusion after the feature graph is up-sampled with the penultimate layer feature graph, outputting two paths of the feature graph after additive fusion, sending the fused feature graph into a target detection module from one path, performing additive fusion after the feature graph after the additive fusion is up-sampled with the feature graph which is input into the connecting block from the penultimate layer, repeating the steps until the surrounding frame which does not contain the target is eliminated from the receptive field layer, sending the surrounding frame into the connecting block, and directly performing additive fusion with the feature graph which is up-sampled with the penultimate layer to obtain the multi-scale target.

The relationship between the anchor point optimization module and the target detection module is constructed in this way. In the target detection module, secondary regression is carried out on coordinates of the remaining refined bounding boxes obtained by the anchor point optimization module, and meanwhile, the category of the target is judged by using a softmax multi-classification loss function, so that a more accurate bounding box result and a category identification result can be obtained.

And the two detection modules are linked through a conversion connecting block. The regression feature graph containing the bounding box information can be converted into all required forms of a top-down detection module, and meanwhile, context features can be provided by utilizing an upsampling mode, so that the detection effect on large targets is better, and the detection accuracy is further improved.

In the step of loss function, the loss function for constructing the network comprises the loss of a bottom-up feature pyramid structure (anchor point optimization module) and the loss of a top-down structure (target detection module). And for the anchor point optimization module, the optimized anchor point with the negative confidence coefficient smaller than the threshold value is sent to the target detection module to be used for further predicting the class of the object and obtaining more accurate regression coordinates. The loss function of the network is defined as:

wherein, the loss function of the anchor point optimization module is:

the loss function of the target detection module is:

wherein i is the information of all anchor points in each batch,

a ground truth value category label representing anchor point i,

position and length and width, p, of ground truth value representing anchor point i_iAnd x_iRespectively representing the prediction confidence coefficient when the anchor point i is taken as an object and the accurate coordinate of the anchor point i in the anchor point optimization module, c_iAnd t_iCoordinates representing the predicted object class and the final bounding box in the object detection Module, N_aAnd N_oRespectively representing the number of anchor points detected as positive by the anchor point optimization module and the target detection module. L is_rRepresenting a regression loss function, a binary classification loss L_bIs the cross-entropy log-loss of two classes (target and non-target), while the multi-class loss L_mIs a softmax loss function for multiple classes of confidence. When the target is true, the target is,

the output of (a) is 1; if the goal is to be false,

the value of (d) is represented as 0. If the denominator N is used_aAnd N_oWhen a result of 0 occurs, the values of the numerator are all set to 0 for the purpose of result consideration.

On the top-down target detection module, the size and the position of a target frame are further improved, and meanwhile, the category of the detection frame is predicted. And performing multi-class target identification and regression by using different feature levels, so that multi-scale targets can be accurately detected at the same time. Meanwhile, the target frame is optimized once by the bottom-up pyramid feature network, so that an optimized regression frame is obtained in the top-down link. Meanwhile, the shallowest layer feature map of the top-down network also fuses information contained in the deep layer feature map, so that the features of the target are further enhanced. It is very beneficial for the detection of small objects to be able to repeatedly highlight the features of the small objects themselves.

The experimental graph of the present invention is shown in FIG. 4, which shows the results of the detection using the method under several complex conditions. The first situation is when a large number of small objects are present densely, such as tanks and ships. It can be seen that even if the target is distributed at each corner of the image, the target at the corner can be found because the distribution of the anchor frame is constructed based on the full graph, and the smaller target can be captured more accurately because the feature pyramid is built by using the feature graph with a larger scale. The second situation is that when a large target and a small target appear in an image at the same time and the targets are mutually shielded, such as ships and ports, vehicles and bridges, it can be seen that the invention can well distinguish and detect the targets with different sizes, because the pooling and convolution characteristics are mutually combined, so that the large and small targets are better considered. In the third case, when facing objects with complex texture features, such as golf courses and airports, the information of the object itself and the surrounding background thereof can be extracted due to the addition of the receptive field module, thereby greatly improving the detection accuracy.

The invention has the following advantages:

1. the multi-feature pyramid comprises a pooling feature pyramid (VGG network) and a step size convolution feature pyramid, and the two feature pyramids respectively highlight the global information and the local information of the remote sensing image and the target.

2. The cascade characteristic layer is combined with local and global characteristics of the visible light remote sensing image, and the effect on dispersed targets and small targets is better.

3. The model of the invention breaks through the deficiency of the traditional one-stage detection network in the target detection precision, and has better result compared with the two-stage remote sensing target detection network.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The remote sensing target detection method based on the receptive field module and the multi-characteristic pyramid is characterized by comprising the following steps of:

2. The method for remote sensing target detection based on the receptive field module and the multi-feature pyramid as claimed in claim 1, wherein the feature map output in step 11 includes Conv3_3, Conv4_3, Conv5_3 and Conv 7.

3. The method for detecting the remote sensing target based on the receptive field module and the multi-feature pyramid as claimed in claim 2, wherein the step 12 specifically comprises:

4. The method for detecting the remote sensing target based on the receptive field module and the multi-feature pyramid as claimed in claim 3, wherein the step 13 specifically comprises: and sequentially carrying out 3 × 3 × 512 convolution with the step size of 1 and 4 step sizes of 2 on the multi-scale cascade feature map to obtain the optimized feature map.

5. The method for detecting the remote sensing target based on the receptive field module and the multi-feature pyramid as claimed in claim 4, wherein the step 21 specifically comprises:

carrying out short operation on the fifth branch;

and integrating the output characteristic graphs of the first four branches, and summing the output characteristic graphs of the first four branches and the result of the shortcut operation through 1 × 1 convolution to obtain the receptive field layer.

6. The method for detecting the remote sensing target based on the receptive field module and the multi-feature pyramid as claimed in claim 5, wherein the loss function of the detection network is:

in the formula (I), the compound is shown in the specification,

a ground truth value category label representing anchor point i,

the position and the length and the width of the ground surface truth value of the anchor point i are shown; p is a radical of_iAnd x_iRespectively representing the prediction confidence coefficient when the anchor point i is taken as an object and the accurate coordinate of the anchor point i in the anchor point optimization module; c. C_iAnd t_iCoordinates representing the predicted object class and the final bounding box in the target detection module; n is a radical of_aAnd N_oRespectively representing the number of anchor points detected as positive in the anchor point optimization module and the target detection module, L_rRepresents the regression loss function, L_bRepresenting a binary classification penalty, L_mIndicating a multi-classification penalty, when the target is true,

the output of (a) is 1; if the goal is to be false,

the value of (d) is 0.