CN116758437A

CN116758437A - SAR image target detection method and device for cross ratio-focus loss function

Info

Publication number: CN116758437A
Application number: CN202310807759.XA
Authority: CN
Inventors: 李刚; 则正华; 韩江鸿; 王学谦
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-07-03
Filing date: 2023-07-03
Publication date: 2023-09-15

Abstract

The embodiment of the application provides a SAR image target detection method and device of an intersection ratio-focus loss function, wherein the method comprises the following steps: performing feature extraction on the SAR image to be detected by utilizing a pre-trained image processing network to obtain a multi-level semantic feature map; fusing the first N layers of semantic feature graphs in the multi-layer semantic feature graph to obtain a shallow semantic feature graph, and fusing the last M layers of semantic feature graphs in the multi-layer semantic feature graph to obtain a deep semantic feature graph; and carrying out large-size target detection based on the shallow semantic feature map, and carrying out small-size target detection based on the deep semantic feature map to obtain a target detection result. The method provided by the embodiment of the application can solve the problem of unbalanced anchor frame in the SAR image target detection process, improves the target detection precision, simultaneously utilizes semantic features of different levels to detect targets with different sizes, and improves the detection capability of targets with small sizes.

Description

SAR image target detection method and device for cross ratio-focus loss function

Technical Field

The application relates to the technical field of SAR image target detection, in particular to a SAR image target detection method and device of an intersection ratio-focus loss function.

Background

Synthetic aperture radar (Synthetic Aperture Radar, SAR) is widely used in the field of marine surveillance, especially in the field of ship target detection, due to its advantages of all-weather, all-day operation. However, the SAR ship detection performance is degraded due to the unbalanced distribution of simple and difficult samples in the SAR ship detection data set.

The traditional SAR image target detection method needs to preprocess the SAR image, such as filtering background noise, modeling sea clutter, extracting features and the like, so that the traditional method cannot meet the requirement of instantaneity. However, there are a large number of low-quality anchor frames which do not contain targets in the generated anchor frames, and only a few high-quality anchor frames contain targets, so that extreme unbalance between the high-quality anchor frames and the low-quality anchor frames is caused, and the target detection precision is affected. Therefore, there is a need for a method for detecting a target of an SAR image with high real-time performance and high detection accuracy.

Disclosure of Invention

In view of the above, embodiments of the present application provide a method and apparatus for detecting a SAR image target of an intersection ratio-focal point loss function, so as to overcome or at least partially solve the above problem.

In a first aspect of the embodiment of the present application, a method for detecting a SAR image target of an intersection ratio-focal point loss function is disclosed, the method comprising:

performing feature extraction on the SAR image to be detected by utilizing a pre-trained image processing network to obtain a multi-level semantic feature map;

fusing the first N layers of semantic feature graphs in the multi-layer semantic feature graph to obtain a shallow semantic feature graph, and fusing the last M layers of semantic feature graphs in the multi-layer semantic feature graph to obtain a deep semantic feature graph;

performing large-size target detection based on the shallow level semantic feature map, and performing small-size target detection based on the deep level semantic feature map to obtain a target detection result;

the image processing network is trained based on an cross-ratio-focus loss function, and the cross-ratio-focus loss function weights anchor frames with different qualities by using a first adjusting factor.

Optionally, the image processing network is obtained by embedding L attention modules in a backbone network of the YOLOv4-Tiny network, and two adjacent attention modules are connected through a convolution layer; the feature extraction is performed on the SAR image to be detected by using a pre-trained image processing network to obtain a multi-level semantic feature map, which comprises the following steps:

performing feature extraction on the SAR image to be detected by using a convolution layer to obtain a basic feature map;

and sequentially extracting attention information from the input feature images by using the 1 st to the L th attention modules to respectively obtain semantic feature images from shallow layers to deep layers output by the 1 st to the L th attention modules, wherein the input feature images of the 1 st attention module are the basic features, and the input feature images of the 2 nd to the L th attention modules are feature images obtained by carrying out convolution processing on the feature images output by the previous-stage attention module.

Optionally, fusing the first N levels of semantic feature graphs in the multi-level semantic feature graph to obtain a shallow level semantic feature graph, including:

determining a first target size;

and processing the semantic feature graphs of the 1 st to N th levels into shallow semantic feature graphs of a first target size through up-sampling and convolution processing.

Optionally, fusing the semantic feature graphs of the last M levels in the multi-level semantic feature graph includes:

determining a second target size;

and processing the semantic feature graphs of the Mth to L-th layers into deep-level semantic feature graphs with the second target size through downsampling and convolution processing.

Optionally, each attention module processes the input feature map according to the following steps:

extracting channel attention information from the input feature map to obtain a channel attention feature map;

extracting the spatial attention information from the channel attention feature map to obtain a spatial attention feature map;

and convolving the channel attention feature map and the space attention feature map, and then fusing the convolved channel attention feature map and the space attention feature map with the input feature map to obtain an output feature map of the attention module.

Optionally, the extracting the channel attention information from the input feature map to obtain a channel attention feature map includes:

respectively carrying out global maximum pooling and global average pooling on the input feature map to obtain two different space semantic descriptors;

and inputting the two different space semantic descriptors into a two-layer shared network to extract multi-layer attention information, so as to obtain a channel attention feature map.

Optionally, the extracting the spatial attention information from the channel attention feature map to obtain a spatial attention feature map includes:

respectively carrying out global maximum pooling and global average pooling on the channel attention feature map to obtain two different two-dimensional channel feature maps;

and carrying out convolution operation and an activation function on the two different two-dimensional channel feature graphs to obtain a spatial attention feature graph.

Optionally, the cross-over-focus loss function is obtained by embedding a conventional cross-over loss function into the focus loss function, including:

embedding a first adjusting factor, wherein the first adjusting factor assigns a weight larger than 1 to a high-quality anchor frame, and assigns a weight approaching to 1 to a low-quality anchor frame, wherein the high-quality anchor frame is an anchor frame containing a detection target, the low-quality anchor frame is an anchor frame not containing the detection target, and the quality of the anchor frame is influenced by weights of a simple sample and a difficult sample in a training sample;

embedding a second adjustment factor, wherein the second adjustment factor constrains the contribution value of a difficult sample and a simple sample to the loss function, the difficult sample is a sample which does not contain the target anchor frame calibration, and the simple sample is a sample which contains the target anchor frame calibration;

And embedding a proportion adjustment parameter, wherein the proportion adjustment parameter adjusts the proportion of a positive sample and a negative sample in the training sample, the positive sample refers to a sample containing a target, and the negative sample is a sample with only a background.

Optionally, the cross-ratio-focus-based loss function IoU-FL is expressed as:

IoU-FL＝IoU(1+IoU) ^β log(IoU)+C _i log(C _i )+α(1-p _t ) ^β log(p _t ),

wherein, (1+IoU) ^β Is a first regulatory factor; (1-p) _t ) ^β Is a second regulatory factor; alpha is a proportional adjustment parameter; ioU is the cross-ratio loss function; beta is a weight adjustment parameter that adjusts the weights of simple and difficult samples in the training samples; c (C) _i When an input picture is divided into S multiplied by S blocks in an image processing network, the ith picture cell contains the confidence of a target; p is p _t Is the probability of judging a simple sample and a difficult sample as true values in regression.

In a second aspect of the embodiment of the present application, there is disclosed a SAR image target detection apparatus for an intersection ratio-focal point loss function, the apparatus comprising:

the feature extraction module is used for extracting features of the SAR image to be detected by utilizing a pre-trained image processing network to obtain a multi-level semantic feature map;

the feature fusion module is used for fusing the first N layers of semantic feature images in the multi-layer semantic feature images to obtain a shallow semantic feature image, and fusing the last M layers of semantic feature images in the multi-layer semantic feature images to obtain a deep semantic feature image;

The target detection module is used for carrying out large-size target detection based on the shallow-level semantic feature map and carrying out small-size target detection based on the deep-level semantic feature map to obtain a target detection result;

The embodiment of the application has the following advantages:

in the embodiment of the application, the feature extraction is carried out on the SAR image to be detected by utilizing a pre-trained image processing network, so as to obtain a multi-level semantic feature map; fusing the semantic feature graphs of the first N layers in the multi-layer semantic feature graph to obtain a shallow semantic feature graph, and fusing the semantic feature graphs of the last M layers in the multi-layer semantic feature graph to obtain a deep semantic feature graph; and then carrying out large-size target detection based on the shallow semantic feature map, and carrying out small-size target detection based on the deep semantic feature map to obtain a target detection result.

The image processing network is obtained by training based on the cross-ratio-focus loss function, and the cross-ratio-focus loss function weights anchor frames with different qualities by using the first adjusting factor, so that the problem of unbalanced anchor frames in the SAR image target detection process is solved, and the target detection precision is improved. In addition, the method directly extracts the features of the SAR image without preprocessing the SAR image, so that the real-time performance of target detection is ensured. In addition, the small-size targets are not easy to detect due to the fact that the small-size targets are affected by the background, so that the small-size targets are detected by using semantic features of deeper layers, robustness of detection of targets of different sizes is improved, and target detection accuracy of SAR images is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of steps of a SAR image target detection method of an intersection ratio-focal point loss function according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a feature extraction process of an attention module according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an image processing network according to an embodiment of the present application;

FIG. 4 is an original SAR image under different environments provided by an embodiment of the present application;

FIG. 5 is a comparison of detection results of a method of YOLOv4 and a SAR image target detection method of an intersection ratio-focal point loss function provided by an embodiment of the present application in a densely distributed ship environment;

FIG. 6 is a comparison of detection results of a YOLOv4 method and a SAR image target detection method of an intersection ratio-focal point loss function provided by the embodiment of the application under a multi-scale ship environment;

FIG. 7 is a comparison of detection results of a YOLOv4 method and a SAR image target detection method of an intersection ratio-focal point loss function provided by the embodiment of the application under a complex background ship environment;

FIG. 8 is a graph comparing training convergence rates based on a conventional cross-ratio loss function with a cross-ratio focal point loss function according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a SAR image target detection device with an intersection ratio-focal loss function according to an embodiment of the present application.

Detailed Description

In order that the above objects, features and advantages of the present application will be readily apparent, a more particular description of embodiments of the application will be rendered by reference to the appended drawings, which are illustrated in the appended drawings. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, fig. 1 shows a flowchart of steps of a SAR image target detection method of an intersection ratio-focal point loss function according to an embodiment of the present application. As shown in fig. 1, the method for detecting the SAR image target of the cross-focus loss function according to the embodiment of the present application specifically includes steps S110 and S130:

Step S110: and extracting features of the SAR image to be detected by utilizing a pre-trained image processing network to obtain a multi-level semantic feature map.

In the embodiment of the application, the SAR image to be detected refers to a sea surface ship image acquired by using a synthetic aperture radar, and ships with different sizes (namely detection targets) are contained in the SAR image to be detected. The multi-level semantic feature map refers to semantic feature maps with different sizes from shallow layers to deep layers, and the deep-level semantic feature map contains richer semantic information along with feature extraction. The image processing network is a network capable of multi-level semantic feature extraction. In the embodiment of the application, the difficulty of detecting targets with different sizes is considered to be different, and especially small-size targets are often mixed with background noise and background buildings to cause the small-size targets to be not detected well, so that deeper and better abundant semantic features are needed for detection. Therefore, in order to realize detection of multi-size targets so as to improve the robustness of detection of targets with different sizes, the image processing network is utilized to extract semantic feature images of multiple levels of SAR images to be detected so as to facilitate target detection based on the semantic feature images of the multiple levels in a subsequent step.

In an alternative embodiment, in order to implement multi-level semantic feature extraction, the image processing network is obtained by embedding L attention modules in a backbone network of a YOLOv4-Tiny network, and two adjacent attention modules are connected through a convolution layer.

Wherein, L is not fixed, and the value of L is determined according to the actual target detection requirement. Under the condition that the size of the target to be detected is smaller, the target detection needs to be carried out by utilizing semantic features of a deeper level, and the value setting of L is larger. Under the condition that the size of the target to be detected is large, target detection can be realized by utilizing the shallow-level semantic features, and the value setting of L is smaller.

Further, the feature extraction is performed on the SAR image to be detected by using a pre-trained image processing network to obtain a multi-level semantic feature map, which comprises the steps of S110-1 and S110-2:

step S110-1: and extracting features of the SAR image to be detected by using a convolution layer to obtain a basic feature map.

Specifically, feature extraction is carried out on SAR images to be detected by using convolution layers with different sizes, so as to obtain a basic feature map X epsilon R of the SAR images ^H×W×C Wherein H, W, C represents the length, width and number of channels, respectively, of the base profile.

Step S110-2: and sequentially extracting attention information from the input feature images by using the 1 st to the L th attention modules to respectively obtain semantic feature images from shallow layers to deep layers output by the 1 st to the L th attention modules, wherein the input feature images of the 1 st attention module are the basic features, and the input feature images of the 2 nd to the L th attention modules are feature images obtained by carrying out convolution processing on the feature images output by the previous-stage attention module.

That is, the 1 st attention module performs attention information extraction on the basic feature map in step S110-1 to obtain a semantic feature map output by the 1 st attention module, performs convolution processing on the semantic feature map output by the 1 st attention module, inputs the processed semantic feature map to the 2 nd attention module to perform attention information extraction to obtain a semantic feature map output by the 2 nd attention module, and similarly, each subsequent attention module sequentially performs attention information extraction according to the same logic to finally obtain the semantic feature map output by each attention module.

In an alternative embodiment, as shown in fig. 2, each attention module processes the input signature according to the following steps A1 to A3:

Step A1: and extracting the channel attention information from the input feature map to obtain the channel attention feature map.

In the embodiment of the application, the attention module firstly carries out channel attention information characteristic on the input characteristic so as to carry out further characteristic extraction and self-adaptive modification on the input characteristic from the channel dimension, thereby realizing better extraction on the input characteristic.

Specifically, the extracting the channel attention information from the input feature map to obtain the channel attention feature map includes: respectively carrying out global maximum pooling and global average pooling on the input feature map to obtain two different space semantic descriptors; and inputting the two different space semantic descriptors into a two-layer shared network to extract multi-layer attention information, so as to obtain a channel attention feature map.

Illustratively, the channel attention profile may be expressed as:

M _c ＝σ(MLP(avgPool(X _in ))+MLP(maxPool(X _in ))),

wherein M is _c ∈R ^C×1×1 Representing a channel attention profile, avgPool representing an average pooling operation, maxPool representing a maximum pooling operation, MPL representing a multi-layer attention information extraction, σ representing an activation function, X _in ∈R ^H×W×C Representing the input features, the input features of the 1 st attention module being the base features X ε R ^H×W×C 。

Step A2: and extracting the spatial attention information from the channel attention feature map to obtain the spatial attention feature map.

In the embodiment of the application, after the channel attention feature map is obtained, the spatial attention information is extracted, so that the channel attention feature map is further extracted and adaptively modified from the spatial dimension, and the input features are better extracted.

Specifically, the extracting the spatial attention information from the channel attention feature map to obtain the spatial attention feature map includes: respectively carrying out global maximum pooling and global average pooling on the channel attention feature map to obtain two different two-dimensional channel feature maps; and carrying out convolution operation and an activation function on the two different two-dimensional channel feature graphs to obtain a spatial attention feature graph.

Wherein, the convolution operation and the activation function are performed on two different two-dimensional channel feature graphs, namely, a 7×7 convolution operation f is utilized _7×7 The operator and a sigma activation function process two different two-dimensional channel feature graphs to obtain a spatial attention feature graph M _s ∈R ^H×W 。

For example, the spatial attention profile may be expressed as:

M _s ＝σ(f _7×7 ([avgPool(M _c )；maxPool(M _c )])),

step A3: and convolving the channel attention feature map and the space attention feature map, and then fusing the convolved channel attention feature map and the space attention feature map with the input feature map to obtain an output feature map of the attention module.

Illustratively, the channel attention profile and the spatial attention profile are convolved to obtain an initial output profile X _s The initial output profile is expressed as:

further, the initial output feature map is added to the input feature map to obtain an output feature map of the attention module, and the output feature map X of the attention module _out Expressed as:

X _out ＝X _s +X _in ,

in the embodiment of the application, in order to improve the robustness of target detection of different sizes, L attention modules comprising spatial attention and channel attention are embedded into a Yolov4-Tiny backbone network to extract semantic feature graphs of different levels, so that targets of different sizes can be detected based on the semantic feature graphs of different levels in a subsequent step.

Step S120: fusing the semantic feature graphs of the first N layers in the multi-layer semantic feature graph to obtain a shallow semantic feature graph, and fusing the semantic feature graphs of the last M layers in the multi-layer semantic feature graph to obtain a deep semantic feature graph.

In the embodiment of the application, the fact that a common detection network cannot detect multiple-size targets at the same time is considered, namely if the detection of a large-size target is ensured, the detection of a small-size target is ignored, and if the detection of a small-size target is saved, the detection of a large-size target cannot be realized. Therefore, in the embodiment of the application, in order to simultaneously detect a large-size target and a small-size target in the SAR image, the large-size target is detected by using a shallow-level semantic feature map, and the small-size target is detected by using a deep semantic feature map with richer semantic information. The large-size target refers to a target with the number of pixels occupied by the target object in the SAR image to be detected exceeding a number threshold, and the small-size target refers to a target with the number of pixels occupied by the target object in the SAR image to be detected not exceeding the number threshold.

In order to obtain shallow semantic information capable of accurately representing a large-size target, the semantic feature images of the first N layers of the output of the image processing network are fused to obtain the shallow semantic feature images. The value of N is determined according to the actual detection situation, for example, the accuracy requirement of target detection, the size of the target, and the like. Since the semantic feature graphs of each level are different in size, in order to integrate the semantic feature graphs of different levels into a shallow level semantic feature graph, the semantic feature graph can be processed by up-sampling and convolution.

Specifically, fusing the first N levels of semantic feature graphs in the multi-level semantic feature graph to obtain a shallow level semantic feature graph, including: determining a first target size; and processing the semantic feature graphs of the 1 st to N th levels into shallow semantic feature graphs of a first target size through up-sampling and convolution processing. The first target size is the size of the shallow level semantic feature. For example, if the first target size is 208×208, the sizes of the 3-level semantic feature graphs output by the image processing network are 104×104, 52×52, and 26×26, respectively, and the first 2-level semantic feature graphs need to be fused to obtain shallow-level semantic features with the size of 208×208. Up-sampling and convolution processing are performed on the sizes 104×104 and 52×52, respectively, to obtain 208×208 shallow semantic features.

Likewise, in order to obtain deep level semantic information capable of accurately characterizing a small-size target, the semantic feature images of the last M levels of output of the image processing network are fused to obtain a deep level semantic feature image. The value of M is determined according to the actual detection situation, for example, M may be 1 or a number greater than 1. Since the semantic feature graphs of each level are different in size, in order to integrate the semantic feature graphs of different levels into one deep-level semantic feature graph, the processing can be performed by means of downsampling and convolution.

Specifically, fusing semantic feature graphs of the last M levels in the multi-level semantic feature graphs includes: determining a second target size; and processing the semantic feature graphs of the Mth to L-th layers into deep-level semantic feature graphs with the second target size through downsampling and convolution processing. The second target size is the size of the deep level semantic feature. For example, if the second target size is 13×13, the sizes of the 3-level semantic feature graphs output by the image processing network are 104×104, 52×52, and 26×26, respectively, and the semantic feature graphs of the last 2 levels need to be fused to obtain deep-level semantic features with the size of 13×13. Then the semantic feature graphs with the sizes of 52×52 and 26×26 are respectively subjected to downsampling and convolution processing, so as to obtain 13×13 deep semantic features.

Step S130: and carrying out large-size target detection based on the shallow-level semantic feature map, and carrying out small-size target detection based on the deep-level semantic feature map to obtain a target detection result.

In the embodiment of the application, the shallow semantic feature map is used for large-size target detection and the deep semantic feature map is used for small-size target detection, so that the detection result of the shallow semantic feature map and the detection result of the deep semantic feature map are overlapped to obtain a final detection result, namely a target detection result. The target detection is carried out by utilizing the shallow-level semantic feature map and the deep-level semantic feature map at the same time, and targets with different sizes can be focused at the same time, so that the method improves the robustness of target detection with different sizes, and further improves the accuracy of target detection.

The image processing network is trained based on an cross-over-focus loss function, and the cross-over-focus loss function weights anchor frames with different qualities by using a first adjusting factor. That is, when the target detection is performed, a larger weight is given to the anchor frame containing the target, and a smaller weight is given to the anchor frame not containing the target, so that the problem of unbalance of the anchor frame during ship detection regression is solved, and the target detection precision is improved.

In an alternative embodiment, training of the image processing network with an cross-over-focus loss function is proposed for the problem of unbalance between a large number of anchor frames containing no targets and a small number of anchor frames containing targets generated during target detection. Specifically, the cross-over-focus loss function is obtained by embedding a conventional cross-over loss function into a focus loss function, and comprises the following steps:

Illustratively, the cross-ratio-focus-loss-based function IoU-FL is expressed as:

IoU-FL＝IoU(1+IoU) ^β log(IoU)+C _i log(C _i )+α(1-p _t ) ^β log(p _t ),

Specifically, in a first portion based on the cross-over-focus loss function, the conventional IoU loss function is multiplied by a first adjustment factor (1+IoU) ^β The first adjustment factor can automatically assign greater weight to high quality anchor frames containing targets while suppressing those low quality anchor frames that do not contain targets. For example, when a high quality anchor frame containing the target is present, the first adjustment factor (1+IoU) is used ^β The contribution of this high quality anchor box to the loss function will be greater than 1. Conversely, when a low quality anchor box does not contain the target occurs, the contribution of the anchor box to the loss function will approach 1 by the action of the first adjustment factor. Therefore, ioU-FL solves the problem of unbalance of the anchor frame at the time of target detection. In addition, due to the first regulatory factor (1+IoU) ^β The network can be optimized continuously during training, so that compared with the training method (IoU) based on the traditional cross ratio loss function,the training method (IoU-FL) based on the cross-over-focus loss function also accelerates the convergence rate of the image processing network.

In a third part based on the cross-ratio-focus loss function, the ratio of positive and negative samples during training is adjusted by a ratio adjustment parameter alpha, and the ratio of positive and negative samples during training is adjusted by a second adjustment factor (1-p _t ) ^β To increase the contribution of difficult samples to the loss function while suppressing the contribution of simple samples to the loss function. For example, when a difficult sample is detected, because of p _t Smaller, the value of the adjustment factor will be close to 1; in contrast, when a simple sample is detected, because p _t The adjustment factor will be close to 0, larger. Therefore, the IoU-FL of the loss function provided by the embodiment of the application solves the problems of unbalance of the anchor frame and slow convergence speed of training caused by unbalance of simple and difficult samples.

In addition, in the embodiment of the application, the training sample of the image processing network is a SAR image ship data set, the data set comprises a jpg format SAR image and xml format target annotation information, and the annotation information comprises: target class, upper left corner coordinates of rectangular label frame (X _min ,Y _max ) And lower right angular position (X) _max ,Y _min ) The size of the picture. When training the image processing network, the training sample is input into the image processing network for processing, the loss function value is calculated by utilizing the cross-focus loss function, the parameters of the image processing network are updated by utilizing random gradient descent (stochastic gradient descent, SGD) back propagation, and the repeated iterative training is carried out until the network loss function converges.

Fig. 3 is a schematic structural diagram of an image processing network according to an embodiment of the present application. The image processing network comprises a backbone network, a feature fusion network and a target detection network, wherein an SAR image to be detected is firstly input into the backbone network of the image processing network for feature extraction, and a multi-level semantic feature map is obtained; further, in a feature fusion network, the first N layers of semantic feature images in the multi-layer semantic feature images are fused to obtain a shallow semantic feature image, and the last M layers of semantic feature images in the multi-layer semantic feature images are fused to obtain a deep semantic feature image; and finally, the target detection network performs large-size target detection based on the shallow semantic feature map, and performs small-size target detection based on the deep semantic feature map to obtain a target detection result.

Further, in order to verify the target detection performance of the image processing network, the performance of the image processing network is evaluated using the index of the precision (P), recall (R), average detection precision (AP), and the like. Specifically, the precision (P), recall (R), and average detection precision (AP) are expressed as:

where TP represents the number of correctly detected targets, FP represents the number of undetected targets, FN represents the number of targets detected as background, and AP represents the number obtained by integrating the P-R curve with the lower limit of 0 and the upper limit of 1.

The embodiments of the present application utilize the disclosed multisource multiscale SAR ship slice data sets (SAR ship detection dataset, SSDD) for testing. The data set comprises a plurality of SAR ship detection slices 40000, and adopts domestic high-resolution No. 3 satellite and European space agency Sentinel-1 satellite data, including various common ship targets such as tankers, large container ships, bulk carriers and the like. The data set is according to the training set: the test set is divided into 8:2, a random gradient decline learning method is adopted during image processing network training, the minimum learning rate of the model is 0.0001, and the learning rate decline mode is cos. After 120 times of iterative learning, tests are carried out in three different environments, namely dense ship distribution, complex background distribution and multi-scale ship distribution, so that the detection accuracy of 95.2% is obtained, and the detection results are shown in fig. 4 to 7 and table 1.

Table 1 comparison of the method provided by the example of the present application with other existing methods

Model	Input size	AP
			SSD-300	300×300	0.883
SSD-512	512×512	0.894
			Faster R-CNN	600×800	0.883
RetinaNet	800×800	0.914
			Modified SSD-300	300×300	0.883
Modified SSD-512	512×512	0.891
			YOLOv5s+CBAM+BiFPN	256×256	0.928
The method provided by the embodiment of the application	416×416	0.952

The test results in Table 1 show that the method provided by the examples of the present application provides at least 5% higher detection accuracy than one-stage methods, such as SSD and YOLOv 5; meanwhile, the detection precision of the method provided by the embodiment of the application is 6% higher than that of the two-stage method Faster R-CNN. Therefore, the method provided by the embodiment of the application has obvious advantages in detection precision and efficiency.

Fig. 4 illustrates raw SAR images in densely distributed vessels, multi-scale vessels, and complex background vessel environments. Fig. 5 is a comparison of the detection results of the YOLOv4 method with the method provided by the embodiment of the application in a densely distributed ship environment, wherein the solid rectangular box represents a detected target, and the dotted box represents an undetected target, and as can be seen from fig. 5, the detection accuracy of the method provided by the embodiment of the application is higher than that of the method using YOLOv4 in the densely distributed ship environment due to the extraction of different levels of semantic features by the attention module. Fig. 6 is a comparison of the detection result of the YOLOv4 method with the detection result of the method provided by the embodiment of the application in a multi-scale ship environment, wherein a solid rectangular frame represents a detected target, and a dotted frame represents a non-detected target, and as can be seen from fig. 6, the method provided by the embodiment of the application can well detect a small-size target, and can well detect a large-size target, and the detection precision of the large-size target is higher than that of the YOLOv 4. Fig. 7 is a comparison of the detection result of the YOLOv4 method with the detection result of the method provided by the implementation of the application in the complex background ship environment, wherein the solid rectangular box represents the detected target, and the dotted box represents the undetected target, and as can be seen from fig. 7, in the complex background ship distribution environment, compared with YOLOv4, the method provided by the embodiment of the application can detect the small-size target in the complex background.

Fig. 8 illustrates a comparison of convergence rates of a training image processing network based on an intersection ratio-focal point loss function and a conventional intersection ratio loss function, and it can be seen that the convergence rate of the IoU-FL method is faster than that of the conventional IoU, the IoU method converges after about 20 iterations, and the IoU-FL method begins to converge after 5 iterations.

The image processing network is obtained by training based on the cross-ratio-focus loss function, and the cross-ratio-focus loss function weights anchor frames with different qualities by using the first adjusting factor, so that the problem of unbalanced anchor frames in the SAR image target detection process is solved, and the target detection precision is improved. In addition, the method directly extracts the features of the SAR image without preprocessing the SAR image, so that the real-time performance of target detection is ensured. In addition, the small-size target is not easy to detect due to the fact that the small-size target is influenced by the background, so that the small-size target is detected by using semantic features of a deeper level, and the target detection accuracy of the SAR image is further improved.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a SAR image target detection apparatus for an intersection ratio-focal loss function according to an embodiment of the present application, as shown in fig. 9, the apparatus includes:

the feature extraction module 910 is configured to perform feature extraction on the SAR image to be detected by using a pre-trained image processing network, so as to obtain a multi-level semantic feature map;

the feature fusion module 920 is configured to fuse the first N levels of semantic feature graphs in the multi-level semantic feature graphs to obtain a shallow level semantic feature graph, and fuse the last M levels of semantic feature graphs in the multi-level semantic feature graph to obtain a deep level semantic feature graph;

the target detection module 930 is configured to perform large-size target detection based on the shallow level semantic feature map, and perform small-size target detection based on the deep level semantic feature map, so as to obtain a target detection result;

In an alternative embodiment, the image processing network is obtained by embedding L attention modules in a backbone network of a Yolov4-Tiny network, and two adjacent attention modules are connected through a convolution layer; the feature extraction module 910 includes:

The basic feature module is used for carrying out feature extraction on the SAR image to be detected by utilizing a convolution layer to obtain a basic feature map;

the semantic feature module is used for extracting attention information from the input feature images by sequentially utilizing the 1 st to the L th attention modules to respectively obtain semantic feature images from shallow layers to deep layers output by the 1 st to the L th attention modules, wherein the input feature images of the 1 st attention module are the basic features, and the input feature images of the 2 nd to the L th attention modules are feature images obtained by carrying out convolution processing on the feature images output by the previous-stage attention module.

In an alternative embodiment, the feature fusion module 920 includes:

a first size module for determining a first target size;

and the first fusion module is used for processing the semantic feature graphs of the 1 st to N th levels into shallow semantic feature graphs with the first target size through up-sampling and convolution processing.

In an alternative embodiment, the feature fusion module 920 includes:

a second size module for determining a second target size;

and the second fusion module is used for processing the semantic feature graphs of the Mth to L-th layers into deep-level semantic feature graphs with the second target size through downsampling and convolution processing.

In an alternative embodiment, the semantic feature module includes:

the channel attention module is used for extracting channel attention information from the input feature map to obtain a channel attention feature map;

the spatial attention module is used for extracting the spatial attention information of the channel attention feature map to obtain the spatial attention feature map;

and the attention fusion module is used for fusing the channel attention characteristic diagram and the space attention characteristic diagram after convolving the channel attention characteristic diagram and the space attention characteristic diagram with the input characteristic diagram to obtain an output characteristic diagram of the attention module.

In an alternative embodiment, the channel attention module includes:

the first channel attention sub-module is used for respectively carrying out global maximum pooling and global average pooling on the input feature map to obtain two different space semantic descriptors;

and the second channel attention sub-module is used for inputting the two different spatial semantic descriptors into a two-layer shared network to extract multi-layer attention information so as to obtain a channel attention characteristic diagram.

In an alternative embodiment, the spatial attention module includes:

the first space attention sub-module is used for carrying out global maximum pooling and global average pooling on the channel attention feature images respectively to obtain two different two-dimensional channel feature images;

And the second spatial attention sub-module is used for carrying out convolution operation and activation function on the two different two-dimensional channel feature graphs to obtain a spatial attention feature graph.

In an alternative embodiment, the cross-over-focus loss function is obtained by embedding a conventional cross-over loss function into the focus loss function, and includes:

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the application.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

The above describes in detail the SAR image target detection method and apparatus of the cross-ratio-focus loss function provided by the present application, and specific examples are applied to illustrate the principles and embodiments of the present application, and the description of the above examples is only used to help understand the method and core ideas of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A method for detecting a target of an SAR image of an intersection ratio-focal point loss function, comprising:

2. The method according to claim 1, wherein the image processing network is obtained by embedding L attention modules in a backbone network of a YOLOv4-Tiny network, and two adjacent attention modules are connected through a convolution layer; the feature extraction is performed on the SAR image to be detected by using a pre-trained image processing network to obtain a multi-level semantic feature map, which comprises the following steps:

3. The method of claim 2, wherein fusing the first N levels of semantic feature graphs in the multi-level semantic feature graph to obtain a shallow level semantic feature graph, comprises:

Determining a first target size;

4. The method of claim 2, wherein fusing the last M levels of semantic feature graphs in the multi-level semantic feature graph comprises:

determining a second target size;

5. The method of claim 2, wherein each attention module processes the input profile according to the steps of:

6. The method of claim 5, wherein the extracting the channel attention information from the input feature map to obtain the channel attention feature map comprises:

7. The method of claim 5, wherein the extracting the spatial attention information from the channel attention profile to obtain the spatial attention profile comprises:

8. The method of claim 1, wherein the cross-over-focus loss function is obtained by embedding a conventional cross-over loss function into a focus loss function, comprising:

9. The method of claim 8, wherein the cross-ratio-focus-loss-based function IoU-FL is expressed as:

IoU-FL＝IoU(1+IoU) ^β log(IoU)+C _i log(C _i )+α(1-p _t ) ^β log(p _t ),

10. A SAR image target detection apparatus for an intersection ratio-focal loss function, comprising: