CN114926747A

CN114926747A - Remote sensing image directional target detection method based on multi-feature aggregation and interaction

Info

Publication number: CN114926747A
Application number: CN202210609304.2A
Authority: CN
Inventors: 丁宁; 陆贵荣; 王进; 张燕新
Original assignee: Changzhou University
Current assignee: Changzhou University
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-08-19

Abstract

The invention discloses a remote sensing image directional target detection method based on multi-feature aggregation and interaction, which comprises the following steps: and obtaining a high-resolution remote sensing image, cutting the high-resolution remote sensing image into small images and sending the small images into a network. And after feature extraction, performing multi-feature aggregation and interaction. First, local features and global information are obtained by a shallow feature fusion module and an attention module. And then, obtaining a multi-scale attention feature map by using a bidirectional fusion feature pyramid network and a convolution attention mechanism module. And finally, sending the feature mapping containing the important information acquired by the cross-scale fusion module to a detection head to acquire the target category and the directional bounding box. The detection results of the small pictures are mapped back to the original picture and invalid detection results are filtered. The method can solve the problems that the target direction in the remote sensing image is arbitrary, the change range is large, the information quantity is small, and the interference of a complex background is easy to happen. The method has excellent performance in the detection of the remote sensing image directional target.

Description

Remote sensing image directional target detection method based on multi-feature aggregation and interaction

Technical Field

The invention belongs to the technical field of computer vision based on deep learning, and particularly relates to a remote sensing image directional target detection method based on multi-feature aggregation and interaction.

Background

The remote sensing image can efficiently and economically collect a large amount of ground feature information. Automatic interpretation of the remotely sensed images allows for screening of important information in a short time. In these analyses, multi-class object detection plays an important role. The multi-class target detection can be used for quickly positioning and classifying various objects (such as ships, airplanes, vehicles, buildings and the like) of the image, and plays an important role in the practical application of geological exploration, infrastructure inspection, urban planning, national defense construction, maritime search and rescue and the like. In recent years, the task of target detection of remotely sensed images has attracted increasing research interest.

Unlike natural images, which are typically taken from a horizontal angle, remote sensing images are typically taken from an overhead angle, which means that objects tend to be distributed in different directions. If horizontal box target detection is used, the detected horizontal bounding box is not matched with an actual target closely, so that the directional enclosing rotating box target detection algorithm is a preferred method for capturing the remote sensing target. At present, a rotating frame target detection algorithm is mainly based on a detector of an anchor frame, wherein the anchor frame is densely distributed on a characteristic diagram, and then parameter offset between the target frame and the anchor frame is regressed. The method is easy for network convergence during training, however, parameters of the anchor frame need to be preset manually, parameters need to be adjusted in different scenes, and certain priori knowledge is relied on. Although the anchor frame can be automatically determined by means of K-means clustering and the like, a few targets with rare aspect ratios are difficult to match with the anchor frame, and missed detection of the targets is caused. In addition, a picture may only have a small number of targets, but a large number of anchor frames need to be preset, which easily causes extreme imbalance of the number of positive and negative anchor frames, resulting in poor training results.

A keypoint-based directional detector is another rotating frame target detection method. The method can directly predict the corner points or the central points of the target without introducing an anchor frame. Then the width w and height h of the bounding box are regressed. Finally, rotating frame regression is performed by adding a new angle parameter θ. Detectors based on keypoint detection, which typically require joint learning of w and h for each arbitrary directional object in a cartesian coordinate system, are difficult to train. The BBAVector algorithm provides a solution to obtain a rotating target bounding box by learning box boundary sensing vectors. The excellent multi-head detection network design enables the model to have better detection speed and accuracy. However, the potential of the algorithm is not fully mined, and a simple local feature fusion network cannot realize good interaction and aggregation on shallow position information and deep semantic information. Due to the large coverage area of the remote sensing image, a lot of complex ground feature information is usually available. Various types of targets are mixed in complex information, and have high possibility of generating serious interference on classification and regression.

Disclosure of Invention

Aiming at the problems of the prior art, the invention provides a remote sensing image directional target detection method based on multi-feature aggregation and interaction. The invention discloses a method for detecting a remote sensing image directional target, which is an identification system combining a convolutional neural network and a Transformer.

The technical scheme is as follows:

a method for detecting a directional target of a remote sensing image comprises the following steps:

acquiring a high-resolution remote sensing image;

cutting the remote sensing image into a plurality of small images with preset sizes;

inputting the cut small graph into a directional target detection model to obtain an output directional target detection result; the directional target detection model comprises a feature extraction module, a multi-feature aggregation and interaction (a shallow feature fusion module, a self-attention module, a bidirectional interaction feature pyramid network module, a convolution attention mechanism module and a cross-scale fusion module) and a detection head;

the feature extraction module configured to: and (3) carrying out feature extraction on the cut small graph to obtain five layers of feature graphs with different scales, wherein the feature graph scales are as follows from large to small: feature map 1, feature map 2, feature map 3, feature map 4, feature map 5;

the shallow feature fusion module configured to: inputting the feature map 1 into a shallow feature fusion module for processing to obtain an output shallow feature 1, and fusing the shallow feature 1 with the feature map 2 to obtain a shallow feature fusion map;

the self-attention module configured to: inputting the feature map 5 into an attention feature map that is output from the attention module;

the bidirectional interactive feature pyramid network module configured to: inputting the shallow feature fusion graph, the feature graph 3, the feature graph 4 and the attention feature graph into a bidirectional interactive feature pyramid network to carry out multi-scale feature interaction to obtain a multi-scale interactive feature graph;

the convolution attention mechanism module configured to: inputting the multi-scale interactive feature map into a convolution attention mechanism module to extract an attention area to obtain a multi-scale attention feature map;

the cross-scale fusion module configured to: inputting the multi-scale attention feature map into a cross-scale fusion module for cross-scale interaction to obtain a cross-scale fusion feature map;

the detection head configured to: inputting the feature map subjected to cross-scale fusion into a detection head for directional detection to obtain an output target category and a directional bounding box, namely a detection result of a small map; and after all the cut small images are detected, mapping the detection results of the small images back to the original remote sensing image, and keeping the targets with the same positions, the same category and the highest confidence coefficient as final detection results through non-maximum suppression.

In some embodiments, cropping the remotely sensed image into a plurality of preset sized thumbnails comprises:

and cutting the remote sensing image with an overlapping rate, and controlling the scale of the cut image through the overlapping scale coefficient to realize multi-scale cutting.

In some embodiments, the feature extraction module is implemented by a ResNet network, and five layers of feature maps with different scales are obtained through the ResNet network.

Further, the feature extraction module employs a ResNet101, ResNet 18, ResNet34, or ResNet50 network as a feature extraction network.

In some embodiments, the shallow feature fusion module fuses two layers of feature maps with the largest scales, the shallow feature fusion firstly acts on the shallowest feature, the receptive field range is expanded by connecting convolution kernels with different sizes in parallel, feature extraction is performed from different scales through spatial pyramid pooling, then feature information is aggregated, and the aggregated feature map and the feature map of the previous layer are fused to obtain a shallow fusion feature map.

In some embodiments, the self-attention module includes multiple self-attention and fully-connected layers, each connected with a residual block; the self-attention module processes the deepest level of the feature map to construct a feature map global relationship.

In some embodiments, the specific processing steps of the bidirectional interactive feature pyramid network module include: from top to bottom, the deep layer characteristics output by the shallow layer characteristic fusion module are fused with the upper layer information layer by layer, and then from bottom to top, the information of the shallowest layer is transmitted back to the characteristics of the deep layer; the input and output of the same layer establish a short-circuit connection.

In some embodiments, the detection head is divided into four parts, respectively: heatmap responsible for classification and center point positioning; calculating Offset of Offset loss; capturing Box Param of the directional regression Box parameters; the principle is Orientation of the angle problem.

Further, the Heatmap is obtained by maximal pooling of 3 × 3 using the Heatmap value as the detection confidence score, and the K central points with the largest value are obtained on the Heatmap using NMS.

In some embodiments, the data set of the directed target detection model is divided into a training set, a validation set, and a test set; the training set and the verification set comprise images and corresponding label files; the label file contains four corner point coordinates and categories corresponding to each object of the image.

The detection process of the method comprises the following steps: inputting a high-resolution remote sensing image, cutting the image, extracting features, fusing the features, detecting by an orientation detector head, splicing the image and outputting a detection result.

The input high resolution remote sensing image has a large size span, for example, a DOTA data set, which comprises 2806 remote sensing images with an image size span from 400 × 400 to 12000 × 5000. If the high resolution image is directly adjusted to a preset small-scale length and width (e.g. 608 × 608) without cropping and sent to the network for detection, most of the small and medium targets are missed. The invention can cut the high-resolution remote sensing image in an overlapping way (see the detailed implementation mode) and then input the cut image into the network for training and testing.

The ResNet101 network adopted for feature extraction is widely used for a mainstream object detection backbone network. The invention uses ResNet101 to extract features and outputs five layers of feature maps with different scales.

The present invention focuses on improving feature fusion. By designing a multi-feature aggregation and interaction module, shallow position information and deep semantic information of the feature map are fully fused. The overall design is as follows: and a shallow feature fusion module is used for fusing shallow information, so that the network learns more features. In general, most feature fusion schemes build a feature pyramid model with more than three layers. Fusing shallower features means that more shallow information can be retained, and the probability of capturing important information is higher. The detection result of the method is obviously improved, but the calculation cost is also greatly increased. In order to reduce the calculated amount as much as possible, keep more shallow features and improve the target detection precision, a shallow feature fusion method is adopted, and more features can be learned by a network. This means that more sufficient local information can be obtained, alleviating the problems of information imbalance and feature misplacement. The self-attention module is used for processing the deep feature map output by the feature extraction module. The feature map obtained at the position has the smallest size and has rich semantic information. The multi-head self-attention module can capture global relation and rich context information, and is very important for the overall information understanding of the image. On the other hand, the computational cost of a multi-headed self-attention module increases sharply with the increase in the size of the input feature map, and therefore the amount of computation to be processed on a low-resolution feature map is small. The addition of a multi-headed self-attention module at this location enables a global model to be built quickly with low computational cost. A bidirectional interactive feature pyramid network is designed, and multi-scale feature interaction is carried out on feature graphs output by a feature extraction module, a shallow feature fusion module and a multi-head self-attention module. The method comprises the steps of firstly fusing the deep layer characteristics output by a shallow layer characteristic fusion module with the upper layer information layer by layer from top to bottom, then transmitting the information of the shallowest layer back to the deep layer characteristics from bottom to top, and establishing a short circuit connection between the input and the output of the same layer in order to better retain the transmitted information. The generated feature map contains more remarkable difference features, and the description capacity of various objects is improved. A simple and powerful lightweight convolution attention mechanism module is added to each layer of the fused feature pyramid network to find attention areas with dense target scenes. The feature map obtained by the fused feature pyramid network output has rich information, however, the information also contains more confusion areas, and the convolution attention mechanism module can further extract the attention area to assist the network to better filter complex background information, highlight important target objects and generate the attention feature map. The cross-scale fusion module can realize cross-scale interaction on the multi-scale attention feature map to obtain more obvious and balanced features.

The directional detection head adopts BBAVector. The device is divided into four parts which are respectively: heatmap, Offset, Box Param, Orientation. The Heatmap is responsible for the center point and the classification function of the positioning frame, can detect the center point of the directional target, and calculates the Heatmap value as the confidence of target detection. Offset is the Offset loss of the target, and is used to handle the loss of accuracy of the down-sampled feature map and the feature map mapped back to the normal scale. In order to capture the bounding box of a directional target, the conventional method is to detect the width w, height h, angle θ and the center point of the target. However, learning so many parameters jointly poses great difficulties for the network. Therefore, four box boundary sensing vectors, i.e., upper left t, upper right r, lower right b, and lower left l, are respectively set up in four quadrants of a cartesian coordinate system, and the rotation frame is described by the four vectors belonging to different coordinate systems. Because all the targets with any orientation have the same coordinate system, information transfer is relatively easy, and the difficulty of model learning is reduced. Thus, the parameters used by Box Param for bounding boxes are { r, t, l, b, w, h }, where w and h are the width and height of the target bounding Box, respectively. Due to the division of four quadrants, when an object is just surrounded by a horizontal frame in an image, the box boundary sensing vectors are distributed in coordinate axes, so that the detection head cannot directly predict the object. Orientation can solve this problem well. Orientation has the effect of classifying targets into two categories, HBB and RBB, respectively. Wherein, RBB is all directional enclosing frames except the horizontal enclosing frame. The trigger mode of origin is as follows:

IoU is the intersection of the union of the directional bounding box (OBB) and the Horizontal Bounding Box (HBB). α is a coefficient that decides whether RBB or HBB is used. Thus, when the horizontal box target is encountered, the horizontal box target can be processed in a manner of processing the horizontal box HBB.

And after the detection is finished, mapping all the subgraphs back to the corresponding positions of the original image. In this case, there is a target that is repeatedly detected in the overlap region, and it is necessary to perform an operation of suppressing the non-maximum value, and the target having the same position, the same type, and the highest confidence is retained.

The method can solve the problems of random target direction, large change range, small information amount and easy interference of complex background in the remote sensing image. The method has excellent performance in the detection of the remote sensing image directional target.

Drawings

FIG. 1 is a diagram of a directional target detection model of the present invention;

FIG. 2 is a schematic diagram of a shallow feature fusion module;

FIG. 3 is a two-way interactive feature pyramid network diagram;

FIG. 4 is a comparison graph of input and output of various modules;

FIG. 5 is a graph showing the results of the test.

Detailed Description

The model and operation flow of the present invention will be further explained with reference to the drawings and the embodiment examples.

acquiring a high-resolution remote sensing image;

the two-way interactive feature pyramid network module configured to: inputting the shallow feature fusion graph, the feature graph 3, the feature graph 4 and the attention feature graph into a bidirectional interactive feature pyramid network to carry out multi-scale feature interaction to obtain a multi-scale interactive feature graph;

the cross-scale fusion module is configured to: inputting the multi-scale attention feature map into a cross-scale fusion module for cross-scale interaction to obtain a cross-scale fusion feature map;

the detection head configured to: inputting the feature map subjected to cross-scale fusion into a detection head for directional detection to obtain an output target class and a directional bounding box, namely a detection result of a small graph; and after all the cut small images are detected, mapping the detection results of the small images back to the original remote sensing image, and keeping the target with the same position, the same category and the highest confidence coefficient as a final detection result through non-maximum value inhibition.

In an example of the present invention, the data sets are divided into a training set, a validation set, and a test set. The training set and the validation set contain images and corresponding label files. The label file corresponding to each picture records the corner coordinates and the category corresponding to each type of object in the image. The coordinates of the corner points refer to the x-axis and y-axis coordinates of each corner point of the frame of the directional target, that is, each target contains 8 pieces of coordinate information and one piece of category information. This information is used to train multiple batches in a training set to obtain a weight file. And obtaining an evaluation index by testing the verification set and comparing the verification set with the real label. The method adopts the average accuracy as an evaluation index, and the higher the average accuracy is, the better the model performance is represented. And selecting the weight file which shows the best performance in the verification set, and testing in the test set.

Before the high-resolution remote sensing image is input into the training network, overlapped clipping is carried out. Specifically, the original image of the data set is cut into small images, the size of each small image is 600 × 600, and the overlapping area is 100. And controlling the scale of the cut image by coefficients of 1 and 0.5 to respectively obtain a training set, a verification set and a test set. The training set and the verification set are used for being sent to a network for training to obtain a weight file, and a test set detection result is obtained. And when the images of the training set and the verification set are cut, the coordinate position of the target is updated by the label file. Each label file marks the relative position of the corresponding cropped small image in the original high-resolution image, so that the detection result is conveniently mapped back to the corresponding area of the original image.

Fig. 1 is a diagram of a directional target detection model. In the training and testing stage, the input image with width and height w × h is adjusted to 608 × 608 and sent to the network. The network adopts a random gradient descent method (SGD) to optimize the loss function, and the initial learning rate is 1.25 multiplied by 10 ^-4 . The backbone network of the network adopts ResNet101, and five layers of feature maps with different sizes are obtained by extracting features. The largest size feature map 1 is sent to the shallow feature fusion module for processing, and the flow of the processing is shown in fig. 2. Has a size of

Is fed into a parallel convolution operation in which the parallel convolution kernel sizes are 5 x 5 and 7 x 7, respectively. Then, the results obtained from the parallel configuration are fused. Finally, the fused results are fed into four parallel processes, namely the maximal pooling with window sizes of 5, 9 and 13 and the residual block without any operation. The result of the shallow feature fusion is fused with the feature map 2 to obtain the shallow feature fusion map shown in fig. 3. The smallest size feature map 5 is fed into the multi-head self-attention module to calculate the spatial relationship of the images in the average distance. When data x is input, M weighted feature matrices are obtained through M different self-attention structures. Then, these feature matrices are spliced to obtain a matrix Z. And finally, the matrix Z passes through a full connection layer to obtain a self-attention feature map of an output result.

The multi-scale feature map composed of the shallow feature fusion map, the feature map 3, the feature map 4 and the attention feature map is fed into the bidirectional interactive feature pyramid network, as shown in fig. 3. The generated multi-scale interactive feature maps are respectively sent to a convolution attention mechanism module, and the feature maps P are input by the user _i The channel and spatial attention modules are applied in turn to obtain weighted attention maps from the channel and spatial dimensions, respectively. Then, the attention map is compared with P _i Multiplication for realizing self-adaptive optimizationAnd (4) transforming. The process of the convolution attention mechanism module is as follows:

P _i the feature map is output by the ith layer of the fused feature pyramid network, and is also an input feature map of the convolution attention mechanism module of the same layer. M _c Is the estimation process of the channel attention. M _s Is the process of estimating the spatial attention. A. the _i Is the ith feature map of the convolution attention mechanism module output. The convolution attention mechanism module can further extract attention areas to assist the network to better filter complex background information and highlight important target objects.

After passing through the convolution attention mechanism module, the four layers of feature maps with different scales have a large amount of important information. However, this information is not balanced. The cross-scale fusion scheme can obtain more obvious and balanced characteristics. A. the ₃ Is the corresponding minimum feature map, is up-sampled eight times and adjusted to be A ₀ The same size. By analogy, A ₂ Up-sampling by four times, A ₁ Up-sampling is doubled. Finally, the results of all upsampling are compared with A ₄ And fusing the feature maps to generate a cross-scale fused feature map P.

The process of the cross-scale fusion module is as follows:

A _i is the ith characteristic diagram of the convolution attention mechanism output, Upesample is up-sampling, and the up-sampling multiple is 2 ⁱ It was determined that Concat is the result of the concatenated multiscale upsampling.

Fig. 4 shows a graph of a DOTA dataset as input (targets marked with red boxes), visualized by a feature map through input and output of a shallow feature fusion module, a self-attention module, a bidirectional interactive feature pyramid, a convolution attention mechanism module, and a cross-scale fusion module. It should be noted that the input and output characteristics of each module are separated by a dotted line, the upper half of the dotted line is the input characteristic diagram, and the lower half is the output characteristic diagram. These feature maps reflect the confidence scores of the object and the background, respectively, with the higher the confidence, the brighter the corresponding location of the feature map. As the change situation of the feature map shows, the shallow feature fusion module successfully filters background information such as lawns, lane lines and the like, and retains potential important information. The self-attention module establishes a global relationship, and the information of the feature map becomes abstract. The bidirectional interactive characteristic pyramid further filters noise information, and it can be observed that most bright spots in the image are located at the positions of real targets, but the spots are still very small. After passing through the convolution attention mechanism module, the positions corresponding to the important information are gradually enlarged by further extracting the attention area, and the positions corresponding to all real targets are basically and accurately covered. And finally, after the cross-scale fusion, all the bright spot positions in the image are the positions corresponding to the real targets. Fig. 5 contains the detection result of this image.

The characteristic diagram P is sent to a detection head for directional detection. Four parts of the head are detected, and Heatmap is responsible for positioning the central point of the target and calculating the confidence of the target category. Offset calculates the loss of precision of the downsampling and mapping. And the Box Param sets four Box boundary sensing vectors in four quadrants of a Cartesian coordinate system respectively, so that the information transfer is facilitated, and the model learning difficulty is reduced. Orientation is responsible for processing horizontal frames and oriented frames separately.

After all the cut small images are detected, the detection results are mapped back to the original image, and the objects with the same positions, the same category and the highest confidence coefficient are reserved through non-maximum value suppression and serve as final detection results.

To verify the validity of the application examples, the present invention takes DOTA and HRSC2016 datasets as examples and comparative experiments were performed.

The experiment is based on the pytorch1.8.1 deep learning framework, and the device operating system is Red Hat 8.4.1-1. The hardware configuration is Intel (R) Xeon (R) Gold 5218R CPU, two NVIDIA GeForce RTX3090 GPUs (24G video memory).

The evaluation indexes commonly used in the field of target detection are adopted: average Precision (AP) for single class targets and average precision (mAP) for all classes. Higher AP and mep values represent better model performance.

The DOTA data set has a plurality of targets, and small targets are dense. The invention belongs to an end-to-end single-stage algorithm. Without test data enhancement (TTA) and other data preprocessing data enhancement schemes, the test results achieved 75.19% of the maps performance on the evaluation results of the DOTA official server. The invention ensures higher precision and has fast detection speed, and the image reasoning speed of the algorithm reaches 23.22fps which is 2 times of ROI Trans on an RTX3090 display card. The experimental result verifies the high efficiency and accuracy of the algorithm.

Table 1 shows the comparison results of the test set of DOTA data set by the method of the embodiment of the present application and other methods.

The HRSC2016 dataset is a remotely sensed dataset collected specifically for ship inspection, containing 1061 ship images. The resolution of the images is distributed from 400 multiplied by 400 to 1500 multiplied by 900.

Table 2 shows the comparison of the HRSC2016 dataset test set with the methods of the present embodiment.

Method	BL2	R ² PN	RRD	ROI Trans	Gliding Vertex	BBAVectors	Ours
								mAP(％)	69.6	79.6	84.3	86.2	88.20	88.39	89.50

Table 3 shows ablation experiments performed on the HRSC2016 dataset by the method of the present embodiment, and the evaluation index is the mAP at a cross-over ratio (IoU) of 0.5.

Experiment of	Bidirectional interactive feature pyramid	Convolution attention + self-attention	Superficial feature fusion	mAP0.5
					A	-	-	-	88.39
B	√	-	-	89.17
					C	√	√	-	89.38
D	√	√	√	89.50

After the bidirectional interactive feature pyramid is added, the mAP index of the model is improved by 0.78%, the AP index of the model is improved by 0.21% by convolution attention and self-attention, and the AP index of the model is improved by 0.12% by shallow feature fusion.

Because the same detection head is used as the BBAVector algorithm, the method of the embodiment of the application and the BBAVector still perform AP index comparison on the same detection result of IoU. As shown in the following table:

Method	mAP0.5	mAP0.6	mAP0.7	mAP0.8	mAP0.9
						BBAVector	88.39	78.87	64.62	32.61	9.09
OURS	89.50	88.55	75.76	39.92	9.09

the technical process of the present invention can be clearly understood by those skilled in the art through the description of the above implementation method. The above implementation method relies on software and necessary general hardware platform for implementation. The invention does not use preprocessing and post-processing skills, and if the preprocessing schemes such as mosaic data enhancement, Mixup data enhancement and the like or TTA data enhancement post-processing methods are adopted, the detection effect of the invention can be further improved. For a hardware platform, RTX3090 is used in the experiment of the implementation method, and if a Quadro RTX6000GPUs or a better display card is used, the detection result can be further improved.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention, and such modifications and adaptations are intended to be within the scope of the invention.

Claims

1. A method for detecting a directional target of a remote sensing image is characterized by comprising the following steps:

acquiring a high-resolution remote sensing image;

inputting the cut small graph into a directional target detection model to obtain an output directional target detection result; the directional target detection model comprises a feature extraction module, a shallow layer feature fusion module, a self-attention module, a bidirectional interactive feature pyramid network module, a convolution attention mechanism module, a cross-scale fusion module and a detection head;

2. The method of claim 1, wherein cropping the remote sensing image into a plurality of preset sized thumbnails comprises:

3. The method of claim 1, wherein the feature extraction module is implemented by a ResNet network, and five layers of feature maps with different scales are obtained through the ResNet network.

4. The method of claim 3, wherein the feature extraction module employs a ResNet101, ResNet 18, ResNet34, or ResNet50 network as a feature extraction network.

5. The method according to claim 1, wherein the shallow feature fusion module fuses two layers of feature maps with the largest scales, the shallow feature fusion firstly acts on the shallowest feature, the scope of a receptive field is expanded by connecting convolution kernels with different sizes in parallel, feature extraction is carried out from different scales through space pyramid pooling, feature information is aggregated, and the aggregated feature map is fused with the feature map of the previous layer to obtain the shallow fusion feature map.

6. The method of claim 1, wherein the self-attention module comprises multiple self-attention and fully-connected layers, each connected by a residual block; the self-attention module processes the deepest level of the feature map to construct a feature map global relationship.

7. The method of claim 1, wherein the specific processing steps of the bidirectional interactive feature pyramid network module comprise: from top to bottom, the deep layer characteristics output by the shallow layer characteristic fusion module are fused with the upper layer information layer by layer, and then from bottom to top, the information of the shallowest layer is transmitted back to the characteristics of the deep layer; the input and output of the same layer establish a short-circuit connection.

8. The method according to claim 1, characterized in that the head is divided into four parts, respectively: heatmap responsible for classification and center point positioning; calculating Offset of Offset loss; capturing Box Param of the parameters of the directional regression Box; the principle is Orientation of the angle problem.

9. The method of claim 8, wherein the Heatmap is obtained by a 3 x3 max pooling using the Heatmap value as the detection confidence score, and wherein the K largest values of center points are obtained on the Heatmap using NMS.

10. The method of claim 1, wherein the data set of the directed target detection model is divided into a training set, a validation set, and a test set; the training set and the verification set comprise images and corresponding label files; the label file contains four corner point coordinates and categories corresponding to each object of the image.