CN116977872A

CN116977872A - CNN+ transducer remote sensing image detection method

Info

Publication number: CN116977872A
Application number: CN202310885532.7A
Authority: CN
Inventors: 杨海光; 黄钰林; 刘泽林; 裴季方; 唐雪; 霍伟博; 张寅�; 杨建宇
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-07-19
Filing date: 2023-07-19
Publication date: 2023-10-31

Abstract

The invention discloses a CNN+transducer remote sensing image target detection method, which is applied to the field of target detection and aims at the problem of low detection accuracy of the existing target detection algorithm based on a convolutional neural network; firstly, constructing a CNN and Transformer parallel target detection network, improving the effect of a multi-scale target detection task by utilizing a global information and local information interaction mode, fusing multi-scale information by using a top-down path and a bottom-up path in the network, and finally finishing feature selection by using a coordinate attention CA mechanism, and endowing each channel with self-adaptive weight to accurately detect and position the multi-scale target of a remote sensing image; the method combines the advantages of CNN and Transformer, accurately locates the target area, can ensure the detection rate higher after training, and realizes the accurate detection of the remote sensing image target.

Description

CNN+ transducer remote sensing image detection method

Technical Field

The invention belongs to the field of target detection, and particularly relates to a remote sensing image multi-scale target detection technology.

Background

Remote sensing technology has played a vital role in a number of fields, such as marine monitoring, geological surveying, military target detection, natural disaster monitoring, and the like. The target detection of the remote sensing image is an important direction in the development of remote sensing technology, and the main task is to find the position of a target object in the remote sensing image and identify the category of the target object. Because the remote sensing image has the characteristics of complex background, various target sizes, low target pixels and the like, the traditional target detection method uses a sliding window to traverse the original remote sensing image to extract a target frame, so that the problems of long time consumption, low detection precision and the like are easily caused. With the rapid development of deep learning and the development of a convolutional neural network CNN, an end-to-end artificial intelligent model is provided for the field of computer vision, and because the convolutional neural network has the characteristics of high automatic learning feature extraction detection precision and the like, a target detection algorithm based on the convolutional neural network is gradually replacing a traditional target detection algorithm, and becomes a current mainstream target detection algorithm.

At this time, the multi-scale problem of the remote sensing image is solved by using methods such as context information, attention mechanism and multi-scale feature fusion. The novel architecture of the natural language processing domain up to Transfomer was introduced into the target detection domain to realize further development of the domain. Meanwhile, the model which brings the transducer into the target detection field at the earliest time proposed by DETR in 2020 realizes the accuracy equivalent to that of Faster R-CNN. Firstly, CNN is extracted, then output flattening is carried out to obtain a sequence vector, and finally a final prediction result is obtained through conversion processing and output.

The literature "Xiao T, singh M, mintun E, et al early Convolutions Help Transformers See Better,2021, 14881," proposes a new hybrid architecture named CMT for visual recognition and other downstream visual tasks to address the limitations of using convertors in a rough manner in the computer vision field. The proposed CMT takes advantage of both CNN and Transformers to capture local and global information, facilitating the representation capabilities of the network. The weight sharing method used by the model may result in some information being lost because they are shared in multiple locations of the model.

The literature "Li J, xia X, li W, et al Next-ViT: next Generation Vision Transformer for EfficientDeployment in Realistic Industrial Scenarios,2022, 4698" proposes a Next-ViT model, proposes a learning paradigm of two information NCB, NTB to capture local representation and global information, and finally explores a hybrid paradigm thereof to obtain an efficient deployment architecture model of CNN and transformers. Due to limitations in its applicability and versatility, the Next-ViT model may not have better performance on some datasets than other advanced visual models. The model is shown in the paper to be advantageous in CPU-based deployments, but for GPU-based or other accelerator deployments, further evaluation of whether its performance is competitive is still required. The model is mostly used in the field of image classification, and the use in the field of target detection is still in the development stage.

Disclosure of Invention

In order to solve the technical problems, the invention provides a remote sensing image multi-scale target detection method of a CNN+transducer parallel structure.

The invention adopts the technical scheme that: a remote sensing image multi-scale target detection method of a CNN+transducer parallel structure comprises the following specific steps:

s1, constructing a convolution layer structure of a plurality of layers of CNNs;

s2, constructing a multi-layer transducer structure;

s3, setting up an information exchange path for the structures of the steps S1 and S2 to obtain a preliminary parallel backbone network;

s4, constructing a feature pyramid structure for the preliminary parallel backbone network obtained in the step S3 to form a neck network;

s5, connecting the network in the step S4 with the CA attention module to construct a deep neural network;

s6, acquiring target detection data of the remote sensing image, and generating a sample training set, a verification set and a test set according to the acquired target detection data set of the remote sensing image;

s7, training the deep neural network constructed in the step S5 according to the sample training set and the verification set in the step S6;

s8, inputting the test set into the deep neural network trained in the step S7, and accordingly performing target detection on the remote sensing image.

The invention has the beneficial effects that: according to the method, a CNN and Transformer parallel target detection network is firstly constructed, the effect of a multi-scale target detection task is further improved by utilizing a global information and local information interaction mode, multi-scale information is fused in the network by using a top-down path and a bottom-up path, and finally feature selection is completed by using a coordinate attention CA mechanism, self-adaptive weight is given to each channel, and accurate detection and positioning of the multi-scale target of the remote sensing image are achieved. And further, a sample training set, a verification set and a test set can be generated by utilizing the target detection data set of the remote sensing image, and the target detection task of the remote sensing image is completed by using the method. The method combines the advantages of CNN and a transducer, accurately locates the target area, can ensure the detection rate higher after the training of the sample, and realizes the accurate detection of the multi-scale target of the remote sensing image.

Drawings

Fig. 1 is a flowchart of a remote sensing image multi-scale target detection method of a cnn+transducer parallel structure of the present invention.

Fig. 2 is a schematic diagram of an information exchange path according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a CA attention mechanism according to an embodiment of the present invention.

Fig. 4 is a schematic SPP structure diagram of a neural network according to an embodiment of the present invention.

FIG. 5 is a schematic diagram of a multi-scale target detection result of a remote sensing image according to an embodiment of the present invention;

fig. 5 (a) shows a real frame of an original image, fig. 5 (b) shows a detection result of the fast-RCNN-FPN, fig. 5 (c) shows a detection result of YoLov3, and fig. 5 (d) shows a detection result of the present invention.

Detailed Description

The present invention will be further explained below with reference to the drawings in order to facilitate understanding of technical contents of the present invention to those skilled in the art.

As shown in fig. 1, a flow chart of a remote sensing image multi-scale target detection method of a cnn+transducer parallel structure of the present invention specifically comprises the following steps:

s2, constructing a multi-layer transducer structure;

s3, constructing an information exchange path for the structures of the steps S1 and S2 to obtain a preliminary parallel backbone network;

s6, reasonably generating a sample training set, a verification set and a test set by using a remote sensing image target detection data set in the embodiment;

s7, training the deep neural network.

In this embodiment, in the step S1, five convolution layers are constructed by introducing residual connection, where the first convolution layer is mainly used for channel adjustment, and the rest is used for extracting local information.

In this embodiment, in the step S2, a four-Layer transform structure is formed by LayerNorm (Layer normalization), MHSF (Multi-head self-Attention), and MLP (Multi-Layer Perceptron) for extracting global information.

In this embodiment, in the step S3, the dimension transformation is performed by interpolation and downsampling, so as to implement information exchange, as shown in fig. 2, maxpooling represents maximum pooling, interpolating represents nearest neighbor interpolation operation, and Reshape represents dimension adjustment operation.

Since the calculation characteristics of CNN and Transformer are different, the feature map of CNN is three-dimensional, and the feature map of Transformer is two-dimensional, so the feature maps of CNN and Transformer need to be deformed during information interaction. When the local information of the CNN is fused to the transducer branch, the feature map is subjected to the maximum pooling operation, and the maximum pooling formula is as follows:

input is the input feature map, output is the pooled result, m and n are the positions of the output feature map, and p and q are the positions within the pooled window.

The max pooling operation takes as output the maximum value in each pooling window on the input feature map, so each element in the output feature map corresponds to the maximum value in one pooling window in the original input map. Typically, the size of the maximum pooling window and the pooling stride are fixed and may be adjusted according to the needs of the network.

And (3) finishing downsampling through maximum pooling, then storing the channel dimension and the height dimension of each pixel for deformation operation, and erasing the width dimension. When the transform global information is fused to the branches of the CNN, the inverse operation of the above operation is performed, firstly, the feature map is deformed to construct a width dimension, then, the nearest neighbor interpolation operation is performed, and the feature map is up-sampled, so that the final information interaction is completed. The nearest neighbor interpolation formula is as follows:

x _out (m,n)＝x _in (round(m/scale),round(n/scale))(2)

x _in and x _out Pixel values representing input and output, respectively, round is a rounding function, scale is the scaling factor between input and output.

The information interaction ensures that global information always exists in CNN branches, and the information can effectively improve the expression capability of target features. For example, in the sea, the ship is more likely to appear, so that the ship has the characteristic features of the ship, and the information of the scene is fused together. This is also the principle for the tanks of airplanes and factories at airports.

In this embodiment, in the step S4, features (P2, P3, P4, P5) of different levels are extracted from the backbone network obtained in the step S3, because P5 is the deepest feature, P5 is defined as the top, and P2 is defined as the bottom. The top-down path is used for information transfer by interpolation, deep semantic information is transferred to a shallow layer for fusion (corresponding element addition operation of each position), and { C2, C3, C4 and C5} feature images are obtained. The bottom-down path is transmitted to a deep network by the shallow position information, so that the positioning accuracy is improved, and the 3×3 convolution is utilized to downsample the shallow feature map, so as to complete the matching and fusion of the feature map.

In this embodiment, in the step S5, feature selection of the feature map in S4 is implemented by a CA attention mechanism (as shown in fig. 3). Input and output in fig. 3 represent an Input feature map and an output feature map, respectively; H. w, C and r represent the height dimension, width dimension, channel dimension, and downscaling of the feature map, respectively; residual represents the Residual structure; avg Pool represents average pooling; concat+Conv2d represents a splice plus two-dimensional convolution operation; batchNorm+non-linear represents channel normalization and Non-linearization operations; sigmoid represents an activation function; re-Weight represents the Re-weighting.

As shown in fig. 4, in the specific network structure of the present embodiment, the convolution layer is represented by "CNN block", and is represented by a step module in fig. 4, where three convolution operations are included, and the convolution operation is represented by "conv. (convolution kernel size)) @ (convolution kernel size)". The transducer layer is denoted "Transformer block" and is represented in fig. 4 by the trans module, which includes LayerNorm, MHSF and MLP. Coordinate Attention the coordinate attention mechanism is denoted CA in FIG. 4. feature map represents a feature map. { P2, P3, P4, P5} and { C2, C3, C4, C5} represent feature maps of different scales, respectively. Finally, three characteristic diagrams with different scales are obtained after adjustment is completed through a 1X 1 convolution layer. The classfield and regressions represent the classifier and regressive, respectively.

In this embodiment, in step S6, a sample training set, a verification set and a test set are reasonably generated by using a target detection dataset of a remote sensing image. In this embodiment, a LEVIR remote sensing image dataset is used, which includes aircraft, ship and tank targets of different scales, and the pictures are divided into a training set, a verification set and a test set according to a ratio of 6:2:2.

In this embodiment, in step S7, the training set obtained in step S6 is input into the deep convolutional neural network constructed in step S5 to perform forward propagation, the cost function value is calculated, the multi-input parallel deep convolutional neural network parameter is updated by using a backward propagation algorithm based on gradient descent, and the forward and backward propagation is iterated until the cost function converges, which specifically includes the following steps:

s71, forward propagation is performed, and,

to be used forThe nth feature map of the mth layer (m is larger than or equal to 2) is shown, and if the mth layer is a convolution layer:

wherein, the liquid crystal display device comprises a liquid crystal display device,a convolution kernel representing a connection between the s-th input feature map and the n-th output feature map,/->Representing the bias term, σ (·) representing the nonlinear activation function, the symbol "×" representing the convolution operation;

if the mth layer is the maximum pooling layer, then:

wherein, the liquid crystal display device comprises a liquid crystal display device,coordinate values representing the (x ', y') position in the nth feature map of the mth layer (m.gtoreq.2), r ₁ 、r ₂ The size of the pooling window is represented, and u and v represent variables preset in the pooling window;

if the mth layer is the average pooling layer, then:

wherein r is ₁ 、r ₂ Representing the size of the pooling window;

if the m-th layer is a full connection layer, then:

f ^(m) ＝σ(w ^(m) f ^(m-1) +b ^(m) )(7)

wherein f ^(m) Feature map representing the mth layer, w ^(m) Represents the weight of the m layer, b ^(m) The term is biased for this layer.

S72, calculating a cost function value,

taking the cross entropy function as a cost function, and the calculation formula is as follows:

wherein the first two terms of equation (8) represent regression bounding box loss, the third and fourth terms represent confidence loss, and positive samples (with objects in the cell grid) and negative samples (cells) are calculatedNo object in the grid, background) the final feature map is divided into M x M grids, each of which can predict a bounding boxes. (x) _i ,y _i ) Representing coordinates of the center of the object bounding box within the ith cell grid, (w) _i ,h _i ) Representing the width and height of the object bounding box within the ith cell grid, then the true position coordinates are available (x _i ,y _i ,w _i ,h _i ) The representation is made of a combination of a first and a second color,representing predicted position coordinates; c (C) _i Indicating whether or not there is a target in the grid cell, 1 if there is a target in the grid cell, 0 if there is no target in the grid cell,>representing the prediction result in the grid cell, wherein if a target exists, the prediction result is 1, and if the target does not exist, the prediction result is 0; />Cell grid i,/representing objects>Representing the jth bounding box in grid cell i responsible for predicting the object +.>Representing the jth bounding box, λ, in grid cell i responsible for predicting the absence of objects _coord 、λ _obj 、λ _noobj Respectively representing the weights of the losses.

S73, updating network parameters based on a backward propagation algorithm of gradient descent, wherein the method comprises the following specific steps:

wherein alpha is a learning rate, L represents a loss function, and w and b represent sets of weights and bias terms in the network respectively.

In this example, the method further includes step S8, performing a detection performance test on the network trained in step S7, inputting the pictures in the test set into the network trained in step S7, performing forward propagation, and describing the results of target detection and positioning by using the MAP evaluation index. MAP (MAP) ^.5 And MAP ^.75 MAP values with thresholds of 0.5 and 0.75 are respectively indicated; MAP (MAP) ^.5:.95 Representing an average MAP value between 0.5 and 0.95 for the threshold. MAP (MAP) ^s 、MAP ^m And MAP ^l MAP values of the scale of small, medium and large targets are represented, respectively.

The verification set is used for effect verification during network training, and one round of training is used for verifying one round.

Table 1 is the number of pictures used for training, verifying and testing the network in this embodiment, table 2 is the detection result obtained in this embodiment, and at the same time, the comparison is performed with some mainstream target detection networks, fig. 5 is a schematic diagram of the multi-scale target detection result of the remote sensing image in this example, fig. 5 (a) shows the real frame of the original image, fig. 5 (b) shows the detection result of the fast-RCNN-FPN, fig. 5 (c) shows the detection result of YoLov3, and fig. 5 (d) shows the detection result of this example. As can be seen from the numerical results, the method of the invention can realize the target detection task and target detection MAP ^.5 Can reach 87.2 percent. The result picture shows that compared with other network model detection results in table 2, the method has good detection results and accurate positioning.

Table 1 number of pictures used for network training, verification and testing in this embodiment

Category(s)	Training set	Verification set	Test set
				Picture (Zhang)	2275	758	758

TABLE 2 detection results obtained in this example

Method/index

MAP ^.5:.95

MAP ^.5

MAP ^.75

MAP ^s

MAP ^m

MAP ^l

Faster-RCNN

47.7％

77.5％

54.1％

12.7％

54.1％

63.8％

Faster-RCNN-FPN

56.0％

85.9％

64.8％

30.1％

58.6％

70.0％

YoLov3

52.4％

72.6％

61.2％

25.4％

55.2％

67.5％

YoLov5s

47.3％

68.4％

55.1％

18.2％

48.7％

66.0％

YoLov5m

49.7％

69.1％

57.9％

19.1％

51.7％

67.2％

Ours

57.1％

87.2％

66.5％

32.4％

60.1％

71.8％

Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A remote sensing image multi-scale target detection method of a CNN+transducer parallel structure is characterized by comprising the following specific steps:

s2, constructing a multi-layer transducer structure;

2. The method for detecting a multi-scale target of a remote sensing image in a parallel structure of cnn+transformer according to claim 1, wherein the convolution layer structure of the multi-layer CNN in step S1 includes six layers of CNNs, a first layer of CNNs is used for channel adjustment, and the remaining layers of CNNs are used for extracting local information, and each layer of CNNs specifically includes: a first 1 x 1 convolution layer, a 3 x 3 convolution layer, a second 1 x 1 convolution layer.

3. The method for multi-scale target detection of remote sensing images in a cnn+transducer parallel structure according to claim 2, wherein the multi-layer transducer structure comprises four layers of transducers, each layer of transducer comprising: first layer normalization, multi-head self-attention, second layer normalization, multi-layer perceptron.

4. The method for multi-scale target detection of remote sensing images in a cnn+transducer parallel structure according to claim 3, wherein the preliminary parallel backbone network in step S3 is specifically:

the input image adjusted by the first layer CNN channel is input into a first layer transducer and a second layer CNN;

the global information extracted by the first layer of transformers is input into the second layer of CNNs after up-sampling, local features extracted by the second layer of CNNs are used as the input of the third CNNs, and the local features extracted by the second layer of CNNs are input into the second layer of transformers after down-sampling;

the global information extracted by the second layer of the Transformer is input into the third layer of CNN after up-sampling, the local feature extracted by the third layer of CNN is used as the input of the fourth CNN, and the local feature extracted by the third layer of CNN is input into the third layer of the Transformer after down-sampling;

the global information extracted by the third layer of transformers is input into a fourth layer of CNNs after up-sampling, local features extracted by the fourth layer of CNNs are used as the input of a fifth CNN, and the local features extracted by the fourth layer of CNNs are input into the fourth layer of transformers after down-sampling;

the global information extracted by the fourth layer of transformers is input into a fifth layer CNN after being up-sampled, and local features taken by the fifth layer CNN are used as the input of a sixth CNN.

5. The method for multi-scale target detection of remote sensing images in parallel cnn+transducer structure according to claim 4, wherein the neck network in step S4 is specifically: the local feature extracted by the sixth CNN in the preliminary parallel backbone network is marked as P5, the feature obtained by adding the local feature extracted by the fifth CNN after up-sampling P5 is marked as P4, the feature obtained by adding the local feature extracted by the fourth CNN after up-sampling P4 is marked as P3, and the feature obtained by adding the local feature extracted by the third CNN after up-sampling P3 is marked as P2;

the feature obtained by downsampling the P2 is marked as C2, and the feature obtained by adding the downsampled C2 and the P3 is marked as C3; the characteristic obtained by adding the downsampled C3 and the added P4 is marked as C4; the feature obtained by adding P5 after C4 downsampling is denoted as C5.

6. The method for detecting the multi-scale target of the remote sensing image with the cnn+transducer parallel structure according to claim 5, wherein the network connection CA attention module for step S4 in step S5 specifically comprises: c3, C4 and C5 are respectively connected with a CA attention module.

7. The method for multi-scale object detection of remote sensing images in a cnn+transducer parallel architecture of claim 6, wherein the deep neural network further comprises a 1 x 1 convolutional layer connected after each CA attention module.