CN112270279B

CN112270279B - Multi-dimensional-based remote sensing image micro-target detection method

Info

Publication number: CN112270279B
Application number: CN202011204146.XA
Authority: CN
Inventors: 李嫄源; 张源川; 朱智勤; 李鹏华; 冒睿睿
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-11-02
Filing date: 2020-11-02
Publication date: 2022-04-12
Anticipated expiration: 2040-11-02
Also published as: CN112270279A

Abstract

The invention relates to a multi-dimensional remote sensing image micro target detection method, and belongs to the field of target detection. And performing multi-depth and multi-resolution feature extraction by adopting a multi-depth image pyramid network, extracting high-resolution image features by using a shallow layer flow, and extracting low-resolution image features by using a deep layer flow. Meanwhile, in order to relieve information imbalance among multi-scale and multi-depth features, a multi-scale feature pyramid network is also provided, so that semantic information and feature texture information of the micro target are transmitted from a high layer to a low layer and from a deep stream to a shallow stream, and the recognition result of the micro target is improved. And finally, adding two simple full connection layers for classification and regression tasks, wherein the whole model is called a multi-depth, multi-scale and multi-resolution micro target detection network M3 DNet. Finally, the method for detecting the micro target of the remote sensing image with multiple depths, multiple scales and multiple resolutions can effectively complete the detection of the micro target.

Description

Multi-dimensional-based remote sensing image micro-target detection method

Technical Field

The invention belongs to the field of target detection, and relates to a multi-dimensional remote sensing image micro target detection method.

Background

Object detection has been an important research direction in the field of computer vision and pattern recognition research, where tiny object detection is more challenging because small objects do not contain detailed information and may even disappear in a deep network. There are many methods for detecting and locating targets from images captured by satellites or drones, but the detection effect is not ideal for images with high noise and low resolution, especially for small targets. Even on high resolution images, the detection performance of small objects is lower than that of large objects. In general, small objects in an image generally refer to objects with a resolution of less than 32 x 32 pixels, while tiny objects generally refer to objects with a resolution of less than 15 x 15 pixels. Generally, feeding a high-resolution remote sensing image into a network can alleviate this problem, and can improve the recognition result of a tiny target to some extent. However, simply increasing the resolution causes more problems, for example, it exacerbates large variations in scale and incurs prohibitive computational costs and memory consumption. Directly feeding the low-resolution remote sensing image into the network also brings problems, for example, the information of the tiny target is seriously deficient, and the detailed information of the tiny target is gradually lost along with the down-sampling process of the network. Meanwhile, remote sensing images generally have problems of complex background, multi-resolution and the like, and how to process characteristic information under different scales is also a challenging problem. In order to alleviate the above problems without new problems, the present invention proposes a Multi-Depth, Multi-Scale, Multi-Resolution micro target Detection Network (M3 DNet), which mainly includes two modules: a multi-depth image pyramid network and a multi-scale feature pyramid network. The multi-depth image pyramid network is mainly composed of backbone networks with different depths, namely a deep convolutional neural network is adopted for a low-resolution remote sensing image, and the purpose is to extract semantic information of more micro targets; and a shallow convolutional neural network is adopted for the high-resolution remote sensing image, so that the calculated amount is reduced while more characteristic texture information is kept. The multi-scale feature pyramid network is composed of a plurality of traditional feature pyramid networks and is mainly used for aligning and fusing multi-scale feature maps generated by the multi-depth image pyramid network, information imbalance among the multi-scale and multi-depth features is reduced, meanwhile, different features of micro targets are extracted by fusing shallow and deep backbone networks to improve the recognition performance of the micro targets, and the performance of medium and large objects is maintained. And finally, obtaining a detection result by passing the fusion result of the multi-scale feature pyramid through a simple classification and regression network. The invention discloses a remote sensing image micro-target detection algorithm combining multiple depths, multiple scales and multiple resolutions, which can balance the instability of a micro target caused by image information of different depths and characteristic information of different scales, well reserve the characteristic texture information and semantic information of the micro target, and has better effect and robustness on the detection of the micro target in the remote sensing image.

Disclosure of Invention

In view of the above, the present invention provides a method for detecting a micro-target in a remote sensing image based on multiple dimensions.

In order to achieve the purpose, the invention provides the following technical scheme:

a multi-dimensional based remote sensing image micro-target detection method comprises the following steps:

1) the M3DNet network can extract the characteristic information of a tiny target by using a plurality of streams, firstly, N streams with different depths are selected to construct a multi-depth image pyramid network, and remote sensing images with different resolutions are input into different streams in the network;

2) inputting a high-resolution image by using a shallow layer flow, and focusing on texture information of a tiny target; inputting a low-resolution image by the deep layer flow, extracting deep semantic information of a tiny target, and adjusting images with different resolutions by using a scaling factor beta;

3) constructing a multi-scale feature pyramid according to the number of the streams, and performing multi-scale fusion on the features of the corresponding N streams to relieve the imbalance among the multi-scale and multi-level features; dimension matching is carried out between different dimension characteristic layers by using a convolution network of [1 multiplied by 1,1], and dimension matching is carried out between different dimension characteristics by using a bilinear interpolation algorithm;

4) the multi-scale feature pyramid network transmits semantic information and feature texture information of the tiny target from a high layer to a low layer and from a deep stream to a shallow stream; the matched and fused multi-scale features are processed by [1 x 1,1]]Is performed to obtain an output F 'of the multi-scale feature pyramid'_iWherein the value range of i is determined by the number of the selected characteristic layers;

5) calculating by using a formula to obtain a selected value of k, and using the selected value as a selected characteristic layer; completing classification and regression tasks through two full connection layers; the whole network is trained in an end-to-end mode, and the weight is updated in a gradient descending mode until the network converges.

Optionally, in 1), the specific steps of the multi-depth image pyramid network are as follows:

11) giving a remote sensing image, marking as I, and marking the resolution as R;

12) taking a high resolution image as I₀Resolution R, into a shallow convolutional neural network denoted C₀；

13) Taking the medium-resolution image as I₁Resolution β R, into the middle layer convolutional neural network, denoted C₁；

14) Taking a low-resolution image as I₂Resolution of beta²R, feeding into deep convolutional neural network, denoted as C₂(ii) a Wherein, β is a hyper-parameter for adjusting resolution, β ∈ (0,1), and β ═ 0.5 by default; the multi-depth image pyramid network is capable of constructing N streams, C_iAnd the network is utilized to improve the feature extraction capability of the tiny target, so that the extracted features comprise strong semantic information and strong feature texture information of the tiny target, and the detection effect of the tiny target is improved.

Optionally, in 2), N is 3M 3DNet network and ResNet backbone network, C₀、C₁And C₂ResNet18, ResNet34 and ResNet50, or ResNet18, ResNet34 and ResNet101, ResNet34, ResNet50 and ResNet101, respectivelyCombinations of (a) and (b).

Optionally, in the 3), note

For the input multi-resolution remote sensing image,

output characteristics of N streams corresponding to the multi-depth image pyramid network, output characteristic O of each stream_iAnd includes a multi-layer feature { F_i,jThe method comprises the steps of obtaining a backbone network, wherein i is a multi-scale index (indexes of different streams), i belongs to {0,1, 2.., N-1}, j is a multi-layer index of the backbone network, j belongs to {0,1, 2.., M-1}, and represents feature layer indexes corresponding to different down-sampling rates in the backbone network, if the down-sampling rate of a ResNet backbone network is 32, feature layers of the different down-sampling rates have 5 layers, namely feature layers corresponding to 2, 4, 8, 16 and 32 of down-sampling rates, and if four feature layers are taken and input into the multi-scale feature pyramid network, M is 4;

n ═ 3 and M ═ 4, O_iExpressed as the following equation:

O_i＝C_i(I_i)＝{F_i,0,F_i,1,F_i,2,F_i,4} (1)

wherein i ∈ {0,1, 2.

Optionally, in the 4), 1 × 1,1 is used]The convolution network carries out dimension matching, the up-sampling adopts a bilinear interpolation algorithm, and the feature graphs with the same size adopt an element addition mode; highest resolution feature F_0,0The strong texture features of the tiny target are kept, and the strong semantic features in the multi-scale stream are combined, so that the method is expressed by the formula (2):

wherein, F_i,jRepresents the j-th layer feature in the i-th stream, Up (-) represents bilinear interpolated upsampling with upsampling rate of 2, and Conv (-) represents [1 × 1,1]The convolution operation of (2); output feature set of multi-scale feature pyramid networkIs O', and is defined by the formula:

O′＝{F′₀,F′₁,F′₂,...,F′_i,...} (3)

wherein, F'_iIs defined as follows:

F′_i＝Conv(F_0,i) (4)

wherein, F_0,iIs O₀I.e. the output feature level at which the highest resolution is at.

Optionally, in the step 5), the final output result O ', O' of the multi-scale feature pyramid network includes M feature layers with different resolutions, and the most suitable feature layer is selected through the formula (5), as shown in the following formula:

wherein w and h respectively represent the width and the height of the remote sensing image I; k is a radical of₀Is a hyper-parameter, and the default value is M-2; k represents the number of selected feature layers, and k belongs to {0,1, 2., M-1 };

the selected feature layer passes through a simple classification and regression network, namely, a full connection layer is respectively used, one is used for a classification task, and the other is used for a regression task; the whole network adopts an end-to-end training mode, and the network weight is continuously updated until the model converges.

The invention has the beneficial effects that: the invention provides a method for combining multiple depths, multiple scales and multiple resolutions, which adopts a mode of extracting high-resolution image features by a shallow convolutional neural network and extracting low-resolution image features by a deep convolutional neural network, thereby not only reducing the calculated amount and the memory consumption, but also improving the extraction capability of tiny targets in remote sensing images without increasing the parameters. Meanwhile, a multi-scale feature pyramid structure is adopted for feature fusion, so that feature information can be transmitted from a high layer to a low layer (multi-scale), and can be transmitted from a deep stream to a shallow stream (multi-depth and multi-resolution), the micro target has strong feature texture information, and strong semantic information is kept, and the detection result of the micro target is improved.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a diagram of a multi-depth image pyramid network structure;

FIG. 2 is a diagram of a multi-scale feature pyramid network structure;

fig. 3 is an overall structure diagram of a network model.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

Referring to fig. 1 to 3, the present invention includes the following steps:

1) the multi-depth image pyramid network is mainly composed of backbone networks with N different depths, and is used for processing an image pyramid, namely a multi-resolution remote sensing image, and each backbone network is called as a 'stream'. M3DNet networks are able to use multiple streams (backbone networks) to build a multi-depth image pyramid network. To better explain the feasibility of the method of the invention, an M3DNet network using 3 streams is taken as an example. The specific steps of the multi-depth image pyramid network are as follows:

a) giving a remote sensing image, marking as I, and marking the resolution as R;

b) high resolution image (image is I)₀Resolution R) into a shallow convolutional neural network (denoted C)₀)；

c) The medium resolution image (image is I)₁Resolution β R) into a middle convolutional neural network (denoted C)₁)；

d) Low resolution image (image is I)₂Resolution of beta²R) feed-in deep convolutional neural network (denoted C)₂) (ii) a Where β is a hyper-parameter used to adjust the resolution size, β ∈ (0,1), and β ═ 0.5 by default. Overall, the multi-depth image pyramid network is able to construct N streams, C_iN-1, i ═ 0,1,2, can be very large with this networkThe feature extraction capability of the tiny target is improved to a certain extent, so that the extracted features not only contain strong semantic information of the tiny target, but also contain strong feature texture information, and the detection effect of the tiny target is improved.

2) With the development of deep learning, there are many mature and efficient backbone networks such as darkennet 53 used by the YOLO series of the one-phase network, fasternn used by the fasternn, ResNet used by the MaskRCNN, and resenxt. Taking ResNet backbone network as an example, the network is divided into different depth networks such as ResNet18, ResNet34, ResNet50 and ResNet101, and the networks can well fit the micro-object detection algorithm provided by the invention. As described above, C is an example of an M3DNet network and a ResNet backbone network in which N is 3₀、C₁And C₂It can be ResNet18, ResNet34 and ResNet50, or ResNet18, ResNet34 and ResNet101, ResNet34, ResNet50 and ResNet101, etc. combinations, and backbone networks such as ResNeXt, DenseNet, DarketNet53 are the same.

3) Note the book

For the input multi-resolution remote sensing image,

output characteristics of N streams corresponding to the multi-depth image pyramid network, output characteristic O of each stream_iAnd includes a multi-layer feature { F_i,jAnd if the downsampling rate of the ResNet backbone network is 32, the feature layers with different downsampling rates have 5 layers in total (namely, the feature layers corresponding to the downsampling rates of 2, 4, 8, 16 and 32 respectively), and if the four layers of feature layers are taken and input into the multi-scale feature pyramid network, M is 4, and the other conditions are the same. Taking the case where N is 3 and M is 4 as an example, O_iCan be expressed as the following equation:

O_i＝C_i(I_i)＝{F_i,0,F_i,1,F_i,2,F_i,4} (1)

wherein i ∈ {0,1, 2.

4) In different streams and different feature layers, the semantic information and the feature texture information of the tiny target have strong or weak strength, and in order to make the feature information of the tiny target richer and more comprehensive, a feature pyramid structure is generally used. Feature Pyramid Networks (FPN) are one of the key components of most target detection algorithms, combining low-resolution, strong semantic Feature information with high-resolution, strong Feature texture information via top-down paths and multiple connections. In the M3DNet network, multi-scale (different resolutions) and multi-level (different-level features) features are generated by the multi-depth image pyramid network, and in order to relieve the imbalance among the features, the invention also provides the multi-scale feature pyramid network. Unlike conventional FPNs, the semantic information and feature texture information of the multi-scale feature pyramid network is propagated not only from higher layers to lower layers, but also from deep streams (low resolution, deep convolutional neural networks) to shallow streams (high resolution, shallow convolutional neural networks).

Because the pixels of the micro target are too small, the detail information of the micro target is gradually lost along with the increase of the down-sampling rate, the deep network often obtains strong semantic information under the condition of losing the texture information of the deep network, and on the contrary, the shallow network can obtain the strong texture information but the semantic information is weaker. The multi-scale characteristic pyramid network can well fuse multi-scale and multi-level characteristics, so that the fused characteristics comprise strong semantic information and strong texture information, and the identification capability of the tiny target is improved. As with conventional FPN, [ 1X 1,1] is used]The convolution network carries out dimension matching, the up-sampling adopts a bilinear interpolation algorithm, and the feature graphs with the same size adopt an element addition mode. After such processing, the highest resolution feature F_0,0Not only the strong texture features of the tiny target are maintained, but also the strong semantic features in the multi-scale stream are combined, and the process can be expressed by formula (2):

wherein, F_i,jRepresents the j-th layer feature in the i-th stream, Up (-) represents bilinear interpolated upsampling with upsampling rate of 2, and Conv (-) represents [1 × 1,1]The convolution operation of (1). Finally, the output feature set of the multi-scale feature pyramid network is O', which is defined as follows:

O′＝{F′₀,F′₁,F′₂,...,F′_i,...} (3)

wherein, F'_iIs defined as follows:

F′_i＝Conv(F_0,i) (4)

5) Through the steps, a final output result O 'of the multi-scale feature pyramid network can be obtained, where O' includes M feature layers with different resolutions, and the most suitable feature layer is selected through formula (5), as shown in the following formula:

wherein w and h respectively represent the width and the height of the remote sensing image I; k is a radical of₀Is a hyper-parameter, and the default value is M-2; k represents the number of selected feature layers, k ∈ {0,1, 2.

And (4) the selected feature layers pass through a simple classification and regression network, namely, a full connection layer is respectively used, one is used for a classification task, and the other is used for a regression task. The whole network adopts an end-to-end training mode, and the network weight is continuously updated until the model converges.

The specific implementation details of each part of the invention are as follows:

1. extracting high-level semantic and low-level semantic features of the remote sensing image by using the multi-depth stream, and extracting strong semantic information and strong texture information of a tiny target after passing through a multi-depth image pyramid network;

2. by utilizing a multi-scale characteristic pyramid structure, the characteristics of different scales and different levels are fused, semantic information is transmitted from a high layer to a low layer and from a deep stream to a shallow stream, the characteristic information of the tiny target is further enriched, the imbalance among the characteristics is relieved, and the identification result of the tiny target is improved;

3. after obtaining a plurality of groups of characteristics O', the characteristics are selected by using a formula (5), and a detection result can be obtained through a simple classification and regression structure. Meanwhile, the whole network adopts an end-to-end training mode, and weight parameters are continuously updated until the network converges.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A multi-dimensional based remote sensing image micro-target detection method is characterized in that: the method comprises the following steps:

5) calculating by using a formula to obtain a selected value of k, and using the selected value as a selected characteristic layer; completing classification and regression tasks through two full connection layers; training the whole network in an end-to-end mode, and updating the weight in a gradient descending mode until the network converges;

in the step 1), the specific steps of the multi-depth image pyramid network are as follows:

14) Taking a low-resolution image as I₂Resolution of beta²R, feeding into deep convolutional neural network, denoted as C₂(ii) a Wherein, beta is a hyper-parameter used for adjusting the resolution, and beta belongs to (0, 1); the multi-depth image pyramid network is capable of constructing N streams, C_iThe network is utilized to improve the feature extraction capability of the tiny target, so that the extracted features comprise strong semantic information and strong feature texture information of the tiny target, and the detection effect of the tiny target is improved;

in said 4), use is made of [ 1X 1,1]]The convolution network carries out dimension matching, the up-sampling adopts a bilinear interpolation algorithm, and the feature graphs with the same size adopt an element addition mode; highest resolution feature F_0,0Keeping strong texture features of tiny objects, combining strong semantic features in multi-scale streams, and passing through publicThe formula (2) represents:

where i is the multi-scale index, i.e. the index of the different streams, j is the multi-level index of the backbone network, F_i,jRepresents the j-th layer feature in the i-th stream, Up (-) represents bilinear interpolated upsampling with upsampling rate of 2, and Conv (-) represents [1 × 1,1]The convolution operation of (2); the output feature set of the multi-scale feature pyramid network is O', and is defined as follows:

O′＝{F₀′,F₁′,F₂′,...,F′_i,...} (3)

wherein, F'_iIs defined as follows:

F′_i＝Conv(F_0,i) (4)

2. The method for detecting the micro target of the remote sensing image based on the multiple dimensions as claimed in claim 1, wherein: m3DNet network and ResNet backbone network with N ═ 3 in 2) above, C₀、C₁And C₂Respectively, ResNet18, ResNet34, and ResNet50, or a combination of ResNet18, ResNet34, and ResNet101, ResNet34, ResNet50, and ResNet 101.

3. The method for detecting the micro target of the remote sensing image based on the multiple dimensions as claimed in claim 2, wherein: in said 3), note

For the input multi-resolution remote sensing image,

output of corresponding N streams for multi-depth image pyramid networkCharacteristic, output characteristic O of each stream_iAnd includes a multi-layer feature { F_i,jI is a multi-scale index, namely indexes of different streams, i belongs to {0,1, 2.., N-1}, j is a multi-layer index of the backbone network, j belongs to {0,1, 2.., M-1}, and M is the number of feature layers with different resolution sizes; the characteristic layer indexes corresponding to different down-sampling rates in the backbone network are represented, the down-sampling rate of the ResNet backbone network is 32, the characteristic layers with different down-sampling rates have 5 layers in total, namely the characteristic layers corresponding to the down-sampling rates of 2, 4, 8, 16 and 32 respectively, the four layers of characteristic layers are taken and input into the multi-scale characteristic pyramid network, and M is 4;

n ═ 3 and M ═ 4, O_iExpressed as the following equation:

O_i＝C_i(I_i)＝{F_i,0,F_i,1,F_i,2,F_i,4} (1)

wherein i ∈ {0,1, 2.

4. The method for detecting the micro target of the remote sensing image based on the multiple dimensions as claimed in claim 1, wherein: in said 5), the final output result O ', O' of the multi-scale feature pyramid network includes M feature layers with different resolution sizes, and the feature layers are selected according to formula (5), as shown in the following formula:

wherein w and h respectively represent the width and the height of the remote sensing image I; k is a radical of₀Is a hyperparameter and is M-2; k represents the number of selected feature layers, and k belongs to {0,1, 2., M-1 };