CN115223056A

CN115223056A - Multi-scale feature enhancement-based optical remote sensing image ship target detection method

Info

Publication number: CN115223056A
Application number: CN202210848740.5A
Authority: CN
Inventors: 周黎鸣; 李亚辉; 饶晓晗; 杨文成; 左宪禹; 乔保军; 葛强; 刘扬
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2022-07-19
Filing date: 2022-07-19
Publication date: 2022-10-21

Abstract

The invention discloses a multi-scale feature enhancement-based optical remote sensing image ship target detection method, which comprises the following steps: constructing a multi-scale feature enhancement-based optical remote sensing image ship target detection network; the ship target detection network comprises a CSP Darknet53 with a mixed inverted residual block for feature extraction, a PANet with a multi-size feature enhancement function for feature fusion and a YOLO Head network part for ship target detection; carrying out ship target detection network training based on the optical remote sensing image ship target detection data to obtain an optical remote sensing image ship target detection model based on multi-scale feature enhancement; and inputting the optical remote sensing image into the obtained ship target detection model, and carrying out ship target detection on the optical remote sensing image based on the ship target detection model. Compared with a base line network, the method of the invention has better effect and meets the requirement of real-time detection.

Description

Multi-scale feature enhancement-based optical remote sensing image ship target detection method

Technical Field

The invention relates to the technical field of remote sensing image processing, in particular to a multi-scale feature enhancement-based optical remote sensing image ship target detection method.

Background

The target detection plays an important role in the military and civil fields and has wide application scenes. The ship target detection is an important technology for ocean detection and monitoring, and has important significance in the aspects of military reconnaissance, marine transportation safety and the like.

With the development of aerospace technology, optical remote sensing image data is increasing day by day. Meanwhile, satellite remote sensing is not limited by airspace, and the earth surface can be continuously observed. The optical remote sensing image can provide visual information such as geometric shape, texture, color and the like, and is convenient to detect. However, the ship target in the optical remote sensing image often has a complex background and is very susceptible to weather and illumination, so that the imaging quality is poor. Therefore, the optical remote sensing image multi-scale ship target detection makes a significant and challenging task.

Most of the traditional target detection methods are based on sliding windows and manual feature extraction, and although good effects are obtained, a series of defects still exist. Firstly, its running cost and time complexity are high, and secondly the feature robustness of manual design is poor.

With the development of deep learning, the target detection algorithm based on deep learning gradually replaces the traditional detection method. However, due to the imaging quality of the optical remote sensing image, the scale characteristic of the ship target and the background characteristic. The accuracy is lower when the natural image target detection algorithm is applied to optical remote sensing image ship target detection. Therefore, the optical remote sensing image multi-scale ship target detection based on deep learning still has great promotion space and research significance.

Disclosure of Invention

The invention provides an optical remote sensing image ship target detection method based on multi-scale feature enhancement, aiming at the problem of low precision when a natural image target detection algorithm is applied to optical remote sensing image ship target detection.

In order to achieve the purpose, the invention adopts the following technical scheme:

a multi-scale feature enhancement-based optical remote sensing image ship target detection method comprises the following steps:

step 1: constructing a multi-scale feature enhancement-based optical remote sensing image ship target detection network; the ship target detection network comprises a CSP Darknet53 with a mixed inverted residual block for feature extraction, a PANet with a multi-size feature enhancement function for feature fusion and a YOLO Head network part for ship target detection;

step 2: carrying out ship target detection network training based on the optical remote sensing image ship target detection data to obtain an optical remote sensing image ship target detection model based on multi-scale feature enhancement;

and 3, step 3: and inputting the optical remote sensing image into the obtained ship target detection model, and carrying out ship target detection on the optical remote sensing image based on the ship target detection model.

Further, the step 1 comprises:

a hybrid inverted residual block is used instead of the residual block in the fifth CSP module of CSP Darknet 53.

Further, in the hybrid inverted residual block, an input feature map is subjected to 1 × 1 convolution to increase dimension, then feature extraction is performed by using hybrid convolution, the extracted features are subjected to 1 × 1 convolution to reduce dimension, and finally residual connection is performed with the input feature map.

Further, the hybrid convolution first divides the input channels into different groups, each group corresponding to a depth separable convolution of a different kernel size, and then fuses the outputs of the convolutions.

Further, the step 1 further comprises:

the structure of the PANet network is improved:

introducing a multi-branch structure to a low-level feature map, namely a feature map output by a third CSP module in the CSP Darknet53, wherein each branch comprises the same input, performing feature extraction by using 3 multiplied by 3 convolution and hole convolution with the expansion rate of 1, and adding feature information extracted by each branch after fusion with an input feature map;

for the middle-layer feature map, namely the feature map output by the fourth CSP module in the CSP Darknet53, the combination of standard convolution and cavity convolution is used for extracting features, on the basis of keeping a multi-branch structure, the cavity convolution is used for improving the receptive field and capturing feature information in a wider range.

Further, an SPP module is further provided between the fifth CSP module of the CSP Darknet53 and the PANet for adjusting the size of the input optical remote sensing image.

Further, in the YOLO Head network part, when predicting, the feature map is divided into a plurality of grids, each grid comprises a plurality of prediction boxes, and the final result is obtained by calculating the IOU union intersection and then using the non-maximum suppression NMS for filtering.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a multi-scale ship target detection method used in an optical remote sensing image, aiming at the problem that the existing target detection method has poor multi-scale ship target detection precision in the optical remote sensing image. Firstly, the invention provides a mixed convolution suitable for optical remote sensing image ship target detection. Meanwhile, a hybrid inverse residual block is proposed based on the hybrid convolution for replacing a common residual block in a deep network. The inverted residual block comprises a wider network structure and multi-core mixed convolution, and the feature extraction capability and multi-scale feature information of the network are enhanced. Secondly, two multi-scale feature enhancement methods are respectively provided for feature maps with different scales, and the two multi-scale feature enhancement methods respectively act on the middle and low-layer feature maps to enhance the receptive field of the feature maps and the feature description of the ship target.

Experimental results on an LEVIR-ship data set show that the precision of the method reaches 79.55%, and the method is improved by 3.25% compared with YOLO v 4. The extended experiment result on the NWPU VHR-10 data set shows that the method achieves the precision of 90.72% on the detection of the ship class target and is improved by 3.56% compared with the YOLO v 4.

Drawings

FIG. 1 is a basic flowchart of a multi-scale feature enhancement-based optical remote sensing image ship target detection method according to an embodiment of the present invention;

FIG. 2 is a diagram of an overall network architecture according to an embodiment of the present invention; wherein MSFE represents multi-scale feature enhancement;

FIG. 3 is a block diagram of an SPP module and a CSP module according to an embodiment of the present invention;

FIG. 4 is a block diagram of hybrid inverse residual block according to an embodiment of the present invention; wherein MC is mixed convolution;

FIG. 5 is a diagram of a hybrid convolution architecture according to an embodiment of the present invention;

FIG. 6 is a diagram of a structure of a PANET network with multi-scale feature enhancement according to an embodiment of the present invention;

FIG. 7 is a diagram of an SFEM network structure according to an embodiment of the present invention;

FIG. 8 is a diagram of an MFEM network structure according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of a CIOU partial parameter according to an embodiment of the present invention;

FIG. 10 is a graph of target location distribution and scale distribution in a LEVIR-ship dataset, in accordance with an embodiment of the present invention;

FIG. 11 is a diagram of a target location distribution and a scale distribution in a NWPU VHR-10 dataset according to an embodiment of the present invention;

FIG. 12 is a graph comparing the plots for the loss function for YOLO v4 and the method of the present invention;

FIG. 13 is a comparison of the test results of the method of the present invention and the YOLO v4 algorithm;

FIG. 14 is a comparison of the loss curves of the method of the present invention and the YOLO v4 algorithm on the NWPU VHR-10 dataset;

FIG. 15 is a comparison of partial detection results on a NWPU VHR-10 dataset for the method of the present invention and the YOLO v4 algorithm.

Detailed Description

The invention is further illustrated by the following examples in conjunction with the drawings and the accompanying drawings:

as shown in fig. 1, a multi-scale feature enhancement-based optical remote sensing image ship target detection method includes:

step S101: constructing a multi-scale feature enhancement-based optical remote sensing image ship target detection network; the ship target detection network comprises a CSP Darknet53 with a mixed inverted residual block for feature extraction, a PANet with a multi-size feature enhancement function for feature fusion and a YOLO Head network part for ship target detection;

step S102: carrying out ship target detection network training based on the optical remote sensing image ship target detection data to obtain an optical remote sensing image ship target detection model based on multi-scale feature enhancement;

step S103: and inputting the optical remote sensing image into the obtained ship target detection model, and carrying out ship target detection on the optical remote sensing image based on the ship target detection model.

Specifically, the overall network structure of the multi-scale feature enhancement-based optical remote sensing image ship target detection model constructed by the invention is shown in fig. 2. The network mainly comprises four parts: CSP Darknet53 backbone Network is used for feature extraction, PANET (Path Aggregation Network) neck Network is used for feature fusion, MSFE (including SFEM and MFEM) is used for multi-scale feature enhancement, and Head part is used for detection. The backbone network comprises five CSP modules, wherein the first four CSP modules respectively comprise 1,2,8,8 residual blocks, and the fifth CSP module comprises 4 hybrid inverted residual blocks. And the feature maps of 13 × 13 and 26 × 26 sizes extracted by the third and fourth CSP modules are directly used as input of the PANET. The 52 × 52-sized feature map extracted by the fifth CSP module is first passed through the SPP module and then input into the PANet for feature fusion. PANET fuses different scales of feature maps through UpSample and DownSample. After the feature maps are subjected to PANet fusion, the feature maps with the scales of 13 × 13 and 26 × 26 are subjected to multi-scale feature enhancement modules SFEM and MFEM and then are sent to a detection head together with the feature maps with the scales of 52 × 52 for detection.

Fig. 3 is a structural display of the SPP and CSP modules. In the SPP module, an input feature map firstly passes through three maximum pooling layers (5,9, 13) with different sizes, and then the output of each pooling layer is fused with the input feature map. The SPP block is proposed for resizing of the input image. The method is mainly used for increasing the receiving range of the main stem feature and separating the context feature. In the CSP structure, the input feature map is fed into two branches of different depths, and the branch containing Res × N is responsible for feature extraction and then directly merged with another branch. The CSP module can promote the accuracy of the model while being light-weighted.

The amount of feature information is closely related to the accuracy of target detection, and the C5 feature map of the CSP Darknet53 backbone network contains a large amount of contextual feature information. Therefore, the characteristic information of C5 is enriched by improving the characteristic extraction capability of the fifth CSP module so as to improve the accuracy of multi-scale ship target detection. Many researchers improve the feature extraction capability of the network by increasing the depth, but the increase of the depth is accompanied by the explosion of the calculation amount. So we consider increasing the width of the network rather than the depth in order not to increase the amount of extra computation. A hybrid inverse residual block MIRes is proposed based on this embodiment, using MIRes instead of the normal residual block in the fifth CSP module. The structure of the MIRes network is shown in fig. 4, the input feature map is firstly subjected to 1 × 1 convolution dimension increasing, then feature extraction is carried out by using convolution, the extracted features are subjected to 1 × 1 convolution dimension reduction, and finally residual connection is carried out with the input feature map. The MIRes adopts an inverted residual block structure, and compared with a common residual block, the number of network channels is expanded by 6 times. Where hybrid convolution is used for multi-scale feature extraction.

The structure of the inverted residual block enables the feature extraction part to have wider channels and can extract richer feature information. And meanwhile, due to the introduction of the deep separable convolution, the module is lighter. While the deep separable convolution tends to cause problems of reduced accuracy, the wider structure compensates for this drawback.

Table 1 shows the network architecture and parameters of the hybrid inverse residual block. Wherein 13 × 13 × 512 represents the length and width of the feature map and the number of channels.

Table 1 network architecture and parameters of hybrid inverted residual block

Unlike conventional convolution, hybrid convolution first divides the input channels into different groups, then each group corresponds to a depth separable convolution of a different kernel size, and then fuses the outputs of the convolutions. Fig. 5 is a hybrid convolution according to the present invention. Assuming that the number of channels of the input feature map is 8, the number of channels is divided into 4 groups according to an exponential form, and each group of channels is 4, 2,1 and 1. Different channels then correspond to different convolution kernels. The convolution kernel sizes in the invention are 1,3,5 and 7. The introduction of the exponential-form channel division rule and the 1 × 1 convolution is that the hybrid convolution provided by the invention further reduces the amount of calculation and parameters on the basis of keeping more detailed characteristic information.

For the same input, we can use a simple formula to demonstrate the difference in the hybrid convolution and depth separable convolution output profiles. For ease of presentation, the input feature map sizes h and w are equivalent and the output feature map sizes are equivalent. Assume a depth separable convolution as W ^(k,k,c,m) Where k × k is the convolution kernel size, c is the input channel, m is channel multiplier, Y ^(h,w,c·m) For output sensor, each output characteristic map can be expressed by the following formula 1. Unlike the depth separable convolution, if the mixed convolution input channels are divided into g groups

And the number of input channels is equal to the number of output channels. Similarly, the convolution kernels are divided into g groups

The output for the t-th group can be expressed as formula 2. The total output of the hybrid convolution can be expressed by equation 3.

In the formula 3, z ₀ ＝z ₁ ,+…+z _g ＝m·c。

Different from natural images, optical remote sensing images often have a large visual field, meanwhile, most of ship target foreground pixels are small, and large kernel convolution easily causes target pixel loss and brings a large amount of calculation. Therefore, the kernel size of the hybrid convolution is improved, and the division mode of the channel number is changed. Fig. 5 is the proposed hybrid convolution structure. We first remove the 9 x 9 convolution to avoid the information loss problem caused by the convolution process. And secondly, 1 × 1 convolution is added, so that the calculation amount and the parameter amount are further reduced on the basis of reserving more detail characteristic information. Finally, equal distribution selective index distribution is abandoned to carry out channel division on convolutions with different kernel sizes. The latter can realize multi-scale feature extraction and can keep more low latitude feature information. Compared with a mode of equally dividing channels in a natural image, the index division performance in the remote sensing image is higher. The exponential channel division is shown in equation 4. Where i is the number of convolution kernels, C _x Is the number of channels of the xth convolution.

From the parameters in table 1 and equation 3, the output profile of the proposed hybrid convolution can be represented using the following equation:

in general. The method has the advantages that the method has a wider network structure and smaller calculated amount, and mixed convolution of deep separable convolution and multi-kernel feature extraction is realized, so that the MIRes can enhance the multi-scale feature extraction capability of the network under the condition of not increasing the calculated amount, and the context feature information of C5 is enriched, thereby improving the precision of the multi-scale ship target detection of the optical remote sensing image.

The PANet is the mainstream scheme for multi-scale ship target detection. It is based on a rule: the shallow feature map contains higher resolution and more detailed information, and the deep feature map has larger receptive field and more semantic information. And the high-layer, medium-layer and low-layer characteristic graphs are respectively used for detecting large, medium and small targets. The MIRes can extract rich semantic information of high-level features, although the information is fused with a middle-level feature map through the PANet. But for multi-scale ship target detection in the optical remote sensing image, the simple fusion strategy cannot adapt to multi-scale change of the ship target, wherein the low-level feature map still lacks enough feature information and the receptive field is insufficient. Therefore, the accuracy of multi-scale ship target detection is improved by enhancing the characteristic information of the middle-lower layer characteristic diagram.

Therefore, based on the above analysis, we propose SFEM and MFEM feature enhancement modules. Fig. 6 is a PANet with multi-scale feature enhancement. MSFE denotes a multi-scale feature enhancement module, SFEM and MFEM function with 13 × 13 and 26 × 26 scale feature maps, i.e., medium and low level feature maps, respectively.

For a low-level feature map, SFEM introduces a multi-branch structure, and uses common convolution and hole convolution with the expansion rate of 1 to extract features. For the middle layer feature map, feature enhancement is performed using MFEM including a hole convolution. Meanwhile, the parameters of the network are considered, and different from other deep feature enhancement modules, a deeper network structure is abandoned, and a wider network is used for multi-scale feature extraction. Finally, we use only 3 × 3 convolutions for each branch, considering that the input and output of the SFEM and MFEM modules do not involve the transformation of the channel.

For the low-level feature map, the method is mainly responsible for detecting the small-scale ship target. For small targets, an excessively large receptive field introduces a large amount of background, which is not favorable for detection of small targets. Therefore, we use only the normal 3 × 3 convolution and the 3 × 3 hole convolution with the spreading factor of 1. The network structure of SFEM is shown in fig. 7. And inputting the feature map, and performing feature extraction through a four-branch structure. Each branch comprises the same input, and the extracted feature information of each branch is added with the input feature map after being fused. The SFEM can enhance the shallow position information, reduce the introduction of noise and effectively improve the precision of ship detection.

For mid-level signatures, it requires a suitable receptive field for the detection of medium-scale ship targets. Therefore, we use a combination of standard convolution and hole convolution to extract features. Its structure is similar to SFEM. On the basis of keeping a multi-branch structure, the cavity convolution is used for improving the receptive field and capturing feature information in a wider range. The network structure of MFEM is shown in fig. 8. MFEM preserves the network structure of SFEM while introducing large-expansion-rate hole convolution, which can enhance the receptive field of the active area while preserving the signature size.

In general, the SFEM and the MFEM fully consider the characteristics of feature maps with different scales, can well enhance the feature description of a multi-scale ship target and improve the precision of ship detection.

Feature maps subjected to feature enhancement are predicted in the YOLO Head network section (detector Head). Specifically, the three prediction scales of the present embodiment are 13 × 13, 26 × 26, and 52 × 52, respectively. In prediction, the feature map is divided into s × s meshes, each mesh will contain multiple prediction frames, which however do not correspond to the final prediction frame of the image. The final result is obtained by computing the IOU union intersection and then using non-maximum suppression NMS filtering.

The IOU loss function is typically used to compute boundariesThe frame regression is lost, i.e. intersection ratio IOU. In the present invention the loss function is CIOU. CIOU takes into full account three important factors of the bounding box regression loss, namely overlap area, center point distance and aspect ratio. As shown in fig. 9. The dotted line frame of the rectangle at the upper left corner is a real frame, and the central point thereof is b _gt . The dotted rectangle at the bottom right is the prediction box with b as its center point. Rho is the Euclidean distance between two central points, and c is the diagonal distance between the prediction frame and the smallest surrounding frame of the real frame. The CIOU can be expressed by the following formula.

In the above formula, α is a positive parameter. V denotes the uniformity of the aspect ratio. α and ν can be expressed by equations 8 and 7.

The NMS algorithm code is shown in table 2. The input of the algorithm is a prediction box set B, a confidence coefficient set C and a threshold value T, and the output is a final prediction box set F. Sorting set B in descending order first, and selecting a prediction box with the highest score of M second, adding it from set B to set F and deleting it from set B. Then deleting the prediction boxes in the set B which are overlapped with M and have the intersection ratio larger than T. And finally, repeating the operations until the set B is empty.

Table 2 non-maxima suppression algorithm

To validate the method proposed by the present invention, we performed a series of experimental comparisons.

(1) Data set

The LEVIR data set contains a total of 21952 pictures with a resolution of 600 × 800 pixels. There are three categories: aircraft, ships and storage tanks. We separate the ship classes and delete the pictures that do not contain the target to form the LEVIR-ship dataset. The LEVIR-ship has 1494 pictures and 3025 ship targets in total, each picture containing at least one ship target. According to the original division mode, the training set comprises 876 pictures of 1790 ship targets, and the test set comprises 618 pictures of 1235 ship targets. The distribution of target locations and scales in the LEVIR-ship dataset are shown in FIG. 10. As can be clearly seen. The distribution of the target positions and the scale distribution in the LEVIR-ship data set are uniform.

The NWPU VHR-10 data set contains a total of 800 high resolution images and ten types of objects. We deleted 150 background images and retained 650 images containing the target for training and testing. The training set and test set were randomly divided in a5 to 5 ratio. FIG. 11 is a target location distribution and scale distribution in a NWPU VHR-10 dataset. Most of the targets in the NWPU VHR-10 dataset are clustered in the central region of the picture, and the small-scale targets are few, mostly medium-scale targets.

(2) Details of the implementation

Before training, pictures were adjusted to 416 x 416 size and then input into the network for training. We iterate 4000 times in total. The training batch size (The training batch size) is set to 64, the initial learning rate is 0.0001, and when The iterations reach 3200 and 3600 times, the learning rate decreases by one tenth respectively. Our training was performed on an RTX3060 GPU and accelerated training using CUDA 11.3 and Cudnn 8.05.

To speed up the convergence of the network, we use the Kmeans + + clustering algorithm to get the prior box size. The prior box size is shown in table 3. And three prior frames are set for each prediction scale, and the three prior frames are respectively suitable for detecting large, medium and small targets.

TABLE 3 LEVIR-ship dataset and NWPU VHR-10 dataset Prior Box size

(3) Evaluation index

To evaluate the accuracy of the method of the present invention, we used the common evaluation methods, namely mAP and FPS. mAP is a common evaluation criterion for target detection, which is determined by P (precision) and R (call). The calculation formula for P and R is as follows:

in equations 9 and 10. TP represents the positive origin, i.e. the number of ships detected. FP represents a negative sample, i.e. the number of misdetected ships. FN represents the number of false samples, i.e. missed ships. The calculation of mAP from P and R can be expressed as follows:

wherein N is _cls Representing the number of categories in the data set. P is _i Accuracy, R, representing the ith class _i Representing the recall of the ith category.

FPS, frames Per Second, indicates how many pictures can be inferred in one Second. The method is mainly used for evaluating the inference speed of the algorithm. The FPS formula is as follows:

in the above formula, frameNum is the number of pictures, and elapsedTime is the inference time. The larger the FPS is, the faster the model reasoning speed is.

(4) Comparative experiment

In this section we split the comparison experiment into two parts, and we compared the method of the invention with other methods and with a baseline network in the first part and analyzed the advantages of the method of the invention. In the second section, we compare the improved feasibility of the invention. The improved feasibility contrasts mainly include the feasibility of hybrid convolution improvement, the contrast of residual blocks and the contrast of feature enhancement modules (SFEM and MFEM) with RFB and EIRM.

4.1 comparison with other methods

According to the experimental parameter settings of section (2), we performed comparative experiments on the LEVIR-Ship data set, and the results are shown in Table 4, wherein the methods are listed in a column from top to bottom, respectively, from [ Zou and Shi, random Access networks: A New partner for Target Detection in High Resolution atomic remove Sensing images 2017, PP,1-1] [ Dong, xu, ZHao, jiano and An, sig-NMS-Based Faster R-CNN Combining Transfer Learning for Small Target Detection in VHR Optical remove Sensing images IEEE Transactions on Geoscience and Remote Sensing 2019, PP,1-12] [ Dong, xu, zhao, jiao and An, sig-NMS-Based fast R-CNN Combining Transfer Learning for Small Target Detection in VHR Optical Remote Sensing Imagement IEEE Transactions on society and Remote Sensing 2019, PP,1-12] [ Zhou, li, rao, et al, feature Enhancement-Based Shift Target Detection Method in Optical Remote Sensing Imagement electronics 8978 zft 8978 ]. Firstly, the improvement of the method of the invention achieves better results compared to a baseline network. Although the detection speed (FPS) is slightly reduced, the accuracy of multi-scale ship detection is improved by 3.25%. Second, the highest level of both mAP and FPS was achieved in comparison to other two-phase algorithms. In general, the algorithm precision is greatly improved compared with a baseline network, the speed is still greatly superior to that of a two-stage algorithm, and the requirement of real-time detection (FPS is more than or equal to 30) is met.

TABLE 4 LEVIR-SHIP data set comparative experiment results

FIG. 12 is a graph comparing the loss function curves of YOLO v4 and the method of the present invention. The YOLO v4 is on the left, the method of the invention is on the right, the loss value curve is below and the mAP curve is above. The loss curve represents the difference between the predicted result and the actual result, and as can be seen from the loss curve and the mAP curve, the method of the invention converges faster than YOLO v4, and has lower loss value and higher precision. The model of the invention is obviously superior to a YOLO v4 network in the aspect of detection precision.

FIG. 13 is a comparison graph of the detection results of the method of the present invention and the YOLO v4 algorithm. Wherein the white oval frame is a missing inspection, and the gray frame is a false inspection. To make the test results more representative, we selected six sets of pictures for comparison. Wherein a1-a6 are original images, b1-b6 are YOLO v4 detection results, and c1-c6 are detection results of the method of the invention. In fig. 10, pictures a1 to a3 respectively include ship targets with small, medium and large scales, the ship targets in a4 and a5 have long wake flow and have complex wave background, the ship target in a6 is shielded by shadow, and the ship target in a7 is influenced by strong light and has strong light-dark contrast. According to the characteristics of the images and the detection results of b1-b7, the YOLO v4 still has a series of problems in the aspect of multi-scale ship detection of the optical remote sensing images, the feature extraction capability and the feature map receptive field are obviously insufficient, and the multi-scale ship is easy to miss-detect and miss-detect. The method of the invention adopts the MIRes to enhance the multi-scale feature extraction capability of the backbone network and enhance the feature receptive field. Meanwhile, the SFEM and MFEM feature enhancement modules enhance the feature maps of the middle and low layers, and enhance the feature description capability of the ship target with the middle and small size. According to c1-c7, the method has higher detection precision in multi-scale ship detection, and can well detect large, medium and small-scale ship targets. Meanwhile, under the conditions of wake flow, wave background, illumination and shadow influence, the method can well inhibit the interference of background factors and accurately detect the ship targets of all scales.

4.2 baseline network and comparison of proposed methods

In order to compare the necessity of large kernel convolution and 1 × 1 convolution in the mixed convolution and the influence of the convolution allocation mode on the precision. We performed ablation experiments using different convolution kernels and assignment patterns. As shown in table 5, 1357exp represents the convolution kernel as 1,3,5,7 and the channel is divided using an exponential form. 1357 denotes the convolution kernel as 1,3,5,7 and the channel is divided using equal form. Meanwhile, two evaluation indexes, namely Parameter and BFLOPS, are added, wherein the Pamameter represents the size of the model, namely the Parameter number, and the BFLOPS represents the calculated amount of the model. According to the results, the substitution of the large kernel removal convolution with the 1 × 1 convolution can bring about a little precision improvement, and meanwhile, the parameter quantity and the calculation quantity are also slightly reduced. In the comparison between the exponential form division and the equal division form, the exponential form division achieves higher precision. Therefore, for the multi-scale ship detection of the optical remote sensing image, due to the complexity of the optical remote sensing image and the multi-scale characteristic of the ship target. The 1 × 1 convolution is probably more efficient than the large kernel convolution. The comparison result proves that the improvement of the invention on the mixed convolution enables the invention to obtain excellent effect on the aspect of optical remote sensing image multi-scale ship detection.

TABLE 5 convolution kernel size and assignment Compare for Mixed convolution

To verify the superiority of the proposed MIRes, we compared the effect of various residual blocks on the detection accuracy. Table 6 shows the comparison results of various residual blocks. It is clear that although IRes [ Sandler, howard, zhu, zhmoginov and Chen, mobilenetv2: inverted responses and linear bottenecks.2018, 4510-4520] has a better feature extraction capability, it has an improved accuracy compared to Res (residual block in YOLO v 4). However, the MIRes provided by the invention achieves better effect, and the increase of the parameters and the calculated amount is almost negligible. Compared with Res and IRes, due to the introduction of the mixed convolution, the MIRes has stronger multi-scale feature extraction capability, and therefore better detection accuracy can be achieved in the multi-scale ship target detection of the optical remote sensing image.

Table 6 residual block comparison

Table 7 shows the comparison of the multi-scale feature enhancement module of the present invention with other feature enhancement modules. FEM denotes a feature enhancement module. For the sake of fairness, we add RFB [ Liu and Huang, received field block network for access and fast object detection.2018,385-400], EIRM [ Zhou, li, rao, et al, feature Enhancement-Based Shift Target Detection Method in Optical remove Sensing images.electronics 2022,11,634] and the multi-scale Feature Enhancement module of the present invention at the same location Based on the improved backbone network. The SFEM and MFEM of the present invention act on the medium and low layer profiles, respectively. Therefore, we add two RFBs and two EIRMs to the middle and low level profiles, respectively. S-RFB represents the addition of an RFB module in the low level profile branch, and M-RFB represents the addition of an RFB module in the middle level profile. S-EIRM indicates that EIRM is added to the branch of the lower-layer feature map, and M-EIRM indicates that EIRM is added to the middle-layer feature map. From the results, it can be seen that in the low-level feature map, the large-spreading-rate hole convolution of RFB may introduce a large amount of background, which may cause a decrease in accuracy. The scaling strategy of EIRM is also less accurate. SFEM only combines the hole convolution with the expansion rate of 1 and the common convolution, and suppresses the interference of background noise while enhancing the feature description, thereby improving the accuracy of ship detection. MFEM also achieves better results for mid-level signatures than RFB and EIRM. Finally, the multi-scale feature enhancement module of the present invention achieves higher accuracy compared to EIRM and RFB. Therefore, the strategy of adopting different enhancement modules for the feature maps with different scales can achieve better effect than the fixed feature enhancement strategy.

TABLE 7 feature enhancement Module contrast

4.3 ablation experiment

We performed ablation experiments on the LEVIR-ship dataset to verify the validity of each module. The results of the ablation experiments are shown in table 8. As is apparent from table 8, both the MIRes and the multi-scale feature enhancement modules SFEM and MFEM provided by the present invention can effectively improve the detection accuracy of the baseline network YOLO v 4. The improvement provided by the invention can effectively improve the multi-scale ship detection precision in the optical remote sensing image. The MIRes greatly enhances the multi-scale feature extraction capability of the backbone network using improved hybrid convolution and wider network structure. The MFEM and the SFEM act on the characteristic diagram of the middle and low layers to further enhance the characteristic description of the ship target, so that the precision of ship detection is improved. And simultaneously, SFEM, MFEM and MIRes are added, and the method finally achieves 79.55 percent mAP. Compared with a basic-line network, the method is greatly improved.

Table 8 ablation experiment

4.4 extended experiments

To further verify the effectiveness of the method of the present invention and scalability in other categories, we performed extension experiments on the NWPU VHR-10 dataset. <xnotran> 9 , , , [ Tian, shen, chen and He, fcos: fully convolutional one-stage object detection.2019,9627-9636] [ Zhu, xia, zhao, et al., spatial hierarchy perception and hard samples metric learning for high-resolution remote sensing image object detection.Applied Intelligence 5754 zxft 5754-3208 ] [ Zhang and Shen, multi-Stage Feature Enhancement Pyramid Network for Detecting Objects in Optical Remote Sensing Images.Remote Sensing 3252 zxft 3252 ] [ Liu, yang and Hu, multiscale Object Detection in Remote Sensing Images Combined with Multi-Receptive-Field Features and Relation-Connected Attention.Remote Sensing2022, 3532 zxft 3532 ] [ Wang, sun, diao and Fu, FMSSD: feature-merged single-shot detection for multiscale objects in large-scale remote sensing imagery.IEEE Transactions on Geoscience and Remote Sensing 3425 zxft 3425-3390 ]. </xnotran> AI. SH, ST, BD, TC, BC, GTF, HB, BR, VH respectively represent airlane, ship, stroage tank, baseball Diamond, tenis court, baseball court, ground track field, harbor, bridge, vhich. It can be seen that the accuracy of the whole mAP and the ship category of the method of the invention reach the highest level. Meanwhile, the method is expanded to other categories such as BD, TC, BC, GTF, HB and BR, and the method is improved compared with YOLO v 4.

Table 9 NWPU VHR-10 data set extension experiment results.

FIG. 14 is a comparison of the loss curves of the method of the present invention and the YOLO v4 algorithm on the NWPU VHR-10 dataset. Obviously, although the loss curve amplitude of the method is large, the detection precision is stably improved while the loss value is converged. The YOLO v4 algorithm converges fast and the curve is smooth, but its detection accuracy is low.

FIG. 15 is a comparison of partial detection results on a NWPU VHR-10 dataset for the method of the present invention and the YOLO v4 algorithm. The pictures containing different targets and with high detection difficulty are selected to show the expandability of the method on other categories. The detection results comprise six groups of pictures in total, a1-a7 are original pictures, b1-b7 are the detection results of YOLO v4, and c1-c7 are the detection results of the method. Wherein, part of the objects in a1 and a2 are shaded, part of the objects in a3 and a4 are not completely intercepted, and the colors of the objects in a5 and a6 are similar to the background. According to the detection results of b1-b7 and c1-c7, the YOLO v4 is easy to cause missing detection and false detection when detecting the target shielded by the shadow, the intercepted target and the target are similar to the background color texture. The method can extract more multi-scale characteristic information, and meanwhile, the multi-scale characteristic enhancement module can further enhance the characteristic description and the receptive field of the target and inhibit the interference of background information. The method has a good detection result for the multi-scale target under the complex background.

In summary, in order to improve the accuracy of multi-scale ship target detection of the optical remote sensing image, a ship target detection method based on multi-scale feature enhancement is provided based on a one-stage algorithm YOLO v 4. Firstly, in order to improve the multi-scale feature extraction capability of a backbone network, the mixed convolution is improved, the mixed inversion residual block is provided based on the improved mixed convolution, and the mixed inversion residual block is used for replacing a common residual block in a deep CSP module. The wider network structure of the hybrid inversion residual block and the hybrid convolution of multiple cores greatly enhance the feature extraction capability of the backbone network and the receptive field of the feature map. And secondly, an SFEM and MFEM feature enhancement module is provided, which respectively acts on the middle and low-layer feature map, enhances the feature map receptive field and the feature information. Experiments in the LEVIR-ship data set show that compared with a base-line network, the method provided by the invention has a better effect and meets the requirement of real-time detection. Meanwhile, compared with the current excellent feature enhancement module, the multi-scale feature enhancement of the invention achieves the best result. Finally, an expansion experiment on the NWPU VHR-10 data set shows that the method can achieve good effects on partial categories and has certain expandability.

The above shows only the preferred embodiments of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims

1. A multi-scale feature enhancement-based optical remote sensing image ship target detection method is characterized by comprising the following steps:

and 2, step: carrying out ship target detection network training based on the optical remote sensing image ship target detection data to obtain an optical remote sensing image ship target detection model based on multi-scale feature enhancement;

2. The multi-scale feature enhancement-based optical remote sensing image ship target detection method according to claim 1, wherein the step 1 comprises:

3. The optical remote sensing image ship target detection method based on multi-scale feature enhancement as claimed in claim 2, wherein in the hybrid inversion residual block, an input feature map is subjected to 1 x 1 convolution for dimension increase, then mixed convolution is used for feature extraction, the extracted features are subjected to 1 x 1 convolution for dimension reduction, and finally residual connection is performed with the input feature map.

4. The multi-scale feature enhancement-based optical remote sensing image ship target detection method according to claim 3, wherein the hybrid convolution first divides input channels into different groups, each group corresponding to a depth separable convolution with different kernel sizes, and then fuses outputs of the convolutions.

5. The multi-scale feature enhancement-based optical remote sensing image ship target detection method according to claim 1, wherein the step 1 further comprises:

the structure of the PANET network is improved:

for the middle layer feature map, namely the feature map output by the fourth CSP module in the CSP Darknet53, the feature is extracted by combining standard convolution and hole convolution, and on the basis of keeping a multi-branch structure, the hole convolution is used for improving the receptive field and capturing feature information in a wider range.

6. The multi-scale feature enhancement-based optical remote sensing image ship target detection method according to claim 2, wherein an SPP module is further arranged between a fifth CSP module of the CSP Darknet53 and the PANet for adjusting the size of the input optical remote sensing image.

7. The multi-scale feature enhancement-based optical remote sensing image ship target detection method according to claim 1, wherein in the prediction of the YOLO Head network part, the feature map is divided into a plurality of grids, each grid comprises a plurality of prediction frames, and the final result is obtained by calculating the IOU union intersection and then using non-maximum suppression NMS filtering.