CN113313128B

CN113313128B - SAR image target detection method based on improved YOLOv3 network

Info

Publication number: CN113313128B
Application number: CN202110613778.XA
Authority: CN
Inventors: 蒋忠进; 王强; 曾祥书
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-06-02
Filing date: 2021-06-02
Publication date: 2022-10-28
Anticipated expiration: 2041-06-02
Also published as: CN113313128A

Abstract

The invention discloses an SAR image target detection method based on an improved YOLOv3 network, which comprises the following steps: inputting the SAR image training data set into a Darknet53 network, and performing feature extraction on the SAR image to obtain a basic feature map of the SAR image; corresponding to three grid divisions of the SAR image, carrying out feature extraction and feature fusion on the basic feature map according to three scales of large, medium and small to obtain a multi-scale feature map corresponding to the SAR image; inputting the multi-scale feature map into a prediction network, and adjusting and optimizing the parameters of the candidate frame; substituting the optimized candidate frame parameters and the optimized marking frame parameters into a loss function, and calculating the loss value of the current network; based on the obtained loss value, the network parameters are updated through back propagation, and the network parameters are repeatedly trained until the network parameters are converged; inputting an SAR image test data set, carrying out network test, and outputting a detection frame which is matched with a target in a one-to-one manner and various detection indexes.

Description

SAR image target detection method based on improved YOLOv3 network

Technical Field

The invention relates to an SAR image target detection method based on an improved YOLOv3 network, and belongs to the technical field of deep learning and machine vision.

Background

In recent years, artificial intelligence has been rapidly developed, and is widely applied to various fields such as military, geophysical prospecting, medical treatment, urban planning and the like, and good effects are obtained, and particularly, a deep learning framework with a Convolutional Neural Network (CNN) as a core shows strong capability in the aspects of image processing and machine vision. The CNN is excellent in optical image processing, has high efficiency in automatic SAR (Synthetic Aperture Radar) image interpretation, and can efficiently and accurately perform target detection and identification.

At present, target detection methods based on deep learning can be mainly divided into two types: a two-stage (two-stage) detection model and a one-stage (one-stage) detection model. The double-stage algorithm mainly comprises Fast-RCNN, fast-RCNN and the like, has the advantages of high precision, low detection speed and difficulty in realizing real-time detection of targets. The single-stage detection is a detection method based on parameter regression, and is characterized in that candidate frame generation and classification regression are combined into one step, and the method mainly comprises a method of a YOLO (You Only Look one) series. The YOLO algorithm is proposed by Redmon J and the like in 2015, and the YOLOv2 algorithm is proposed again every other year, so that the calculation complexity is greatly reduced, and the target detection speed is increased. In 2018, month 4, the author introduced the YOLOv3 algorithm, again with a significant improvement in accuracy and speed. However, when the conventional YOLOv3 network is directly used for target detection of the SAR image, more false detection and missing detection phenomena occur. How to improve the YOLOv3 network to make it more suitable for SAR image target detection, thereby improving precision ratio and recall ratio is a problem faced by scientific researchers.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention aims to provide an SAR image target detection method based on an improved YOLOv3 network. In order to better optimize the parameters of the candidate frame, a k-means clustering method is adopted to generate 9 groups of prior anchor frames which are used as initial values of the sizes of the candidate frames; to better describe the goodness of fit of two bounding boxes, r is introduced _GIOU Instead of the cross-over ratio r _IOU The box regression loss and confidence loss are recalculated, and the overall loss function is optimized, so that the network parameters are updated more effectively through back propagation.

In order to achieve the purpose, the invention adopts the technical scheme that:

a SAR image target detection method based on an improved YOLOv3 network comprises the following steps:

step 1, preparing an SAR image training data set, wherein a labeling frame parameter is attached to each known target in the data set, and the method comprises the following steps: frame center coordinates, frame width and height, object category and confidence; inputting the SAR image training data set into a Darknet53 network, and performing feature extraction on the SAR image to obtain a basic feature map of the SAR image;

step 2, corresponding to the division of three grids of the SAR image in large, medium and small sizes, carrying out feature extraction and feature fusion on the basic feature map according to three scales of the large, medium and small sizes to obtain a multi-scale feature map of the SAR image; each SAR image grid is provided with three size candidate boxes, each candidate box having the following parameters: frame center coordinates, frame width and height, object category and confidence;

step 3, inputting the multi-scale characteristic diagram into a prediction network, and adjusting and optimizing the parameters of the candidate frame;

step 4, substituting the optimized candidate frame parameters and the optimized marking frame parameters into a loss function, and calculating the loss value of the current network;

step 5, based on the obtained loss value, reversely propagating and updating the network parameters, and repeatedly training until the network parameters are converged;

and 6, inputting an SAR image test data set, carrying out network test, and outputting a detection frame which is matched with the target in a one-to-one manner and various detection indexes.

In step 3, the candidate frame parameters need to be initialized before being optimized, where the initial values of the frame width and height parameters are calculated as follows:

(1) Aligning the centers of all the labeled frames in the training data set, and listing the size data of all the labeled frames;

(2) Clustering of the size of the labeling frame is carried out by a k-means clustering method, and the distance metric d used for clustering is calculated as follows:

d(A,B)＝1-r _IOU (A,B)

wherein A and B are two different bounding boxes, r _IOU Representing the cross-over ratio operation, r _IOU (A, B) represents a cross-over ratio between the bounding boxes A and B, n represents a cross-over operation, and u represents a cross-over operation;

(3) Extracting 9 sizes which exist most densely to obtain 9 groups of prior anchor frames as initial values of the sizes of the candidate frames; the grid with three dimensions of large, medium and small exists, each dimension of grid is provided with candidate frames with 3 dimensions, and the candidate frames with 9 dimensions are provided in total.

In step 4, the loss function is expressed as follows:

l _total ＝l _box +l _cls +l _obj

wherein l _box Represents the loss of regression of the frame, l _cls Represents a loss of classification,/ _obj Representing a confidence loss;

frame regression loss l _box Is represented as follows:

wherein λ is _coord Is a weighting factor; each image is divided into S-S grids, and each grid is provided with J candidate frames;

and

the width and the height of a marking frame corresponding to the ith grid are respectively;

representing whether the jth candidate frame of the ith grid contains a target, wherein the target is 1, and otherwise, the target is 0;

is represented by B _i,j And

r between _GIOU Value of, wherein B _i,j The jth candidate box representing the ith mesh,

notation corresponding to ith gridFraming;

loss of classification l _cls Is represented as follows:

wherein p is _i,j Is the probability that the jth candidate box of the ith mesh is predicted to contain the target,

probability f of containing target for label box corresponding to ith grid _etrp () Representing a two-class cross entropy function;

loss of confidence l _obj Is represented as follows:

wherein λ is _noobj Is a weight factor;

representing whether the jth candidate box of the ith grid contains a target or not, wherein the value of the time value of no target is 1, and the value of the time value is 0 if not; c _i,j The confidence score for the jth candidate box representing the ith lattice,

representing the confidence score of the labeling frame corresponding to the ith grid;

r mentioned above _GIOU The calculation is as follows:

wherein D is the smallest closed convex surface of bounding box a and bounding box B, i.e. the smallest box that can enclose them; d \ represents the rest part of the union of the removal of A and B in D;

the above-mentioned binary cross entropy function f _etrp () Is shown below：

Where u is the candidate frame parameter,

is the label box parameter.

Confidence score C as described above _i,j The calculation is as follows:

wherein p is _i,j A probability that a jth candidate box for the ith mesh is predicted to contain the target;

is represented by B _i,j And with

R between _GIOU Value of, wherein B _i,j The jth candidate box representing the ith lattice,

and representing the label box corresponding to the ith grid.

The improved YOLOv3 network in the invention uses k-means clustering method to compare r _IOU And as a measurement distance, clustering the size of the labeling frame in the SAR image training data set to obtain 9 groups of prior anchor frames as an initial value of the frame size of the candidate frame optimization.

The improved YOLOv3 network in the invention improves the box regression loss function and the confidence coefficient loss function, thereby optimizing the overall loss function.

The improved YOLOv3 network of the invention introduces the r-based network _GIOU The similarity measurement updates the calculation method of the confidence score, and can better calculate the goodness of fit between two bounding boxes.

Has the advantages that: compared with the conventional YOLOv3 network, the improved YOLOv3 network provided by the invention has faster network convergence speed in training. In the aspect of SAR image target detection indexes, the method has lower false detection and missing detection, and is more suitable for target detection when the image background is complex and the target size is different.

Drawings

FIG. 1 is a block diagram of an improved YOLOv3 network architecture;

FIG. 2 is a diagram of a port scene ship target detection false alarm comparison; wherein, (a) the detection result of the conventional YOLOv3 network; (b) improving the YOLOv3 network detection results;

FIG. 3 is a comparison diagram of ship target detection alarm leakage in a port scene; wherein, (a) the detection result of the conventional YOLOv3 network; (b) improving the YOLOv3 network detection results;

fig. 4 is a graph comparing loss function curves of a conventional YOLOv3 network and an improved YOLOv3 network.

Detailed Description

The invention will be further elucidated with reference to the drawings and specific embodiments, it being understood that these examples are intended to illustrate the invention only and are not intended to limit the scope of the invention. Various equivalent modifications of the invention, which fall within the scope of the appended claims of this application, will be suggested to those skilled in the art after reading this disclosure.

The invention discloses an SAR image target detection method based on an improved YOLOv3 network, wherein the network structure diagram is shown in figure 1. Inputting the SAR image training data set into a Darknet53 network, and performing feature extraction on the SAR image to obtain a basic feature map of the SAR image; carrying out feature extraction and feature fusion on the basic feature map according to three scales of large, medium and small to obtain a multi-scale feature map; inputting the multi-scale characteristic diagram into a prediction network, and adjusting and optimizing the parameters of the candidate frame; substituting the optimized candidate frame parameters and the optimized marking frame parameters into a loss function, and calculating the loss value of the current network; based on the obtained loss value, updating the network parameters through back propagation, and repeatedly training until the network parameters are converged; inputting an SAR image test data set, carrying out network test, and outputting a detection frame which is matched with a target in a one-to-one manner and various detection indexes.

The following specific embodiments take SAR image ship target detection as specific examples, and the specific steps are as follows:

step 1, preparing an SAR image training data set, and adjusting the sizes of all images to 416 multiplied by 416; attaching a label box parameter to each known target in the training dataset, including: frame center coordinates, frame width and height, object category and confidence; inputting the SAR image training data set into a Darknet53 network, and performing feature extraction on the SAR image to obtain a basic feature map of the SAR image;

step 2, corresponding to three grid divisions of 13 × 13, 26 × 26 and 52 × 52 of the SAR image, carrying out feature extraction and feature fusion on the basic feature map according to three scales of large, medium and small to obtain a multi-scale feature map of the SAR image; each SAR image grid is provided with three size candidate boxes, each candidate box having the following parameters: frame center coordinates, frame width and height, object category and confidence;

and 3, inputting the multi-scale characteristic graph into a prediction network, and adjusting and optimizing the parameters of the candidate frame.

Before optimizing the candidate frame parameters, the candidate frame parameters need to be initialized, wherein the initial values of the frame width and the height are calculated as follows:

1) And aligning the centers of all the labeled boxes in the training data set, and listing the size data of all the labeled boxes.

2) Clustering of the size of the labeling frame is carried out by a k-means clustering method, and the distance metric d used for clustering is calculated as follows:

d(A,B)＝1-r _IOU (A,B)

wherein A and B are two different bounding boxes, r _IOU Represents the cross-over ratio operation, r _IOU (A, B) represents the intersection ratio between the bounding boxes A and B, n represents the intersection operation, and u represents the union operation.

3) And extracting the most densely existing 9 sizes to obtain 9 groups of prior anchor frames serving as initial values of the sizes of the candidate frames. The grids of three scales, namely a large scale, a medium scale and a small scale exist, each scale of grid needs candidate frames of 3 sizes, and the candidate frames are 9 in total.

And 4, substituting the optimized candidate frame parameters and the optimized marking frame parameters into a loss function, and calculating the loss value of the current network.

The loss function used is expressed as follows:

l _total ＝l _box +l _cls +l _obj

wherein l _box Indicates the loss of regression in the box,/ _cls Represents a loss of classification,/ _obj Indicating a loss of confidence.

Frame regression loss l _box Is represented as follows:

wherein λ is _coord Is a weight factor; dividing each image into S-S grids, wherein each grid is provided with J candidate frames;

and

is shown as B _i,j And

and representing the label box corresponding to the ith grid.

Loss of classification l _cls Is represented as follows:

wherein p is _i,j The probability that the jth candidate box for the ith mesh is predicted to contain the target,

probability f of containing target for label box corresponding to ith grid _etrp () Representing a two-class cross entropy function.

Loss of confidence l _obj Is represented as follows:

wherein λ _noobj In order to be a weight factor, the weight factor,

representing whether the jth candidate box of the ith grid contains a target or not, wherein the value of the time value of no target is 1, and the value of the time value is 0 if not; c _i,j The confidence score for the jth candidate box representing the ith mesh,

and representing the confidence score of the labeling box corresponding to the ith grid.

R as mentioned above _GIOU The calculation is as follows:

wherein D is the smallest closed convex surface of bounding box a and bounding box B, i.e. the smallest box that can enclose them; d \ represents the rest part of the union of the A and the B in the D.

The above-mentioned binary cross entropy function f _etrp () Is represented as follows:

where u is the candidate box parameter,

is the label box parameter.

Confidence score C as described above _i,j The calculation is as follows:

step 5, reversely propagating and updating network parameters based on the loss value; and repeatedly training until the network parameters are converged.

And 6, preparing an SAR image test data set, and adjusting the sizes of all images to 416 x 416 for an improved YOLOv3 network test. Inputting the test data set into a network for processing until a candidate frame with optimized parameters is obtained; deleting a plurality of candidate frames which belong to the same class of targets and have higher overlapping rate and lower confidence coefficient by adopting a non-maximum suppression (NMS) algorithm to obtain a plurality of detection frames which are matched with the targets one to one; and (5) counting detection indexes such as recall ratio and precision ratio, and outputting a detection result.

The performance evaluation index includes precision ratio r _P Recall ratio r _R And the harmonic mean F of the two ₁ The calculation formulas are respectively as follows:

wherein N is _TP Number of targets correctly detected, N _FP Number of targets for false detection, N _FN The number of missed targets.

The embodiment is as follows:

in the embodiment, a high-resolution SAR ship target detection data set AIR-SARShip-2.0 is adopted to verify the improved YOLOv3 network provided by the invention. The data set was published by the radars in 2020 and comprises a total of 300 SAR images with image resolutions including 1m and 3m. The image size is about 1000 multiplied by 1000 pixels, the image format is Tiff, single channel, 8/16 bit image depth, and the labeling file provides the length and width size of the corresponding image, the type of the labeling target and the position of the labeling rectangular frame. The SAR images in the data set relate to different scenes such as ports, island reefs and sea surfaces and comprise different ship targets such as transport ships, oil tankers and fishing ships, and each image comprises different numbers of ship targets.

In the embodiment, firstly, the data set is expanded by using methods such as turning, translating, changing brightness and the like, a total of 1500 SAR images are obtained after the expansion, and a corresponding tag is established for each SAR image. 1200 randomly drawn SAR images were used as a training data set, and the remaining 300 were used as a test data set.

Network parameters are set that improve the YOLOv3 network. And (3) carrying out initialization training on the model by adopting a COCO weight file, setting the moving average attenuation rate to be 0.9995, setting the initialization learning rate to be 1e-4, setting the final learning rate to be 1e-6, and amplifying the image by twice by using a nearest neighbor interpolation method in the upsampling process. Setting BatchSize as 6, and performing iterative training twice, wherein the algebra of the first local training is 20 epochs, and the algebra of the second global training is 30 epochs. The IOU threshold is set to 0.5 and the number of classes is 1, i.e., the ship class.

The experimental results show that the precision ratio of the improved YOLOv3 network is improved by 1.4 percentage points compared with the conventional YOLOv3 network, which indicates that the improved YOLOv3 network reduces the false detection rate. Fig. 2 shows the ship target detection results of two networks in a complex harbor background, where (a) is the detection result of a conventional YOLOv3 network, and (b) is the detection result of an improved YOLOv3 network. It can be seen that in the same scene, two false positives exist in the detection result of the conventional YOLOv3 network, and no false positive exists in the detection result of the improved YOLOv3 network.

In addition, compared with the conventional YOLOv3 network, the improved YOLOv3 network has the advantages that the recall ratio is also improved by 1.17 percentage points, which indicates that the improved YOLOv3 network has lower missing detection rate and has stronger detection capability for small target ships which are easy to miss detection. As shown in fig. 3, a diagram (a) shows the detection result of the conventional YOLOv3 network, a diagram (b) shows the detection result of the improved YOLOv3 network, and the circle in the diagram (a) is the ship target which is not detected, and the ship target is detected in the diagram (b).

The loss function versus curve is shown in fig. 4 when trained and tested using a conventional YOLOv3 network and an improved YOLOv3 network. It can be seen that the improved YOLOv3 network converges faster than the conventional YOLOv3 network, and the loss function of the improved YOLOv3 network decreases lower after the loss function has stabilized.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. An SAR image target detection method based on an improved YOLOv3 network is characterized in that: the method comprises the following steps:

step 1, preparing an SAR image training data set, attaching a labeling frame parameter to each known target in the data set, wherein the method comprises the following steps: frame center coordinates, frame width and height, object category and confidence; inputting the SAR image training data set into a Darknet53 network, and performing feature extraction on the SAR image to obtain a basic feature map of the SAR image;

step 3, inputting the multi-scale feature map into a prediction network, and adjusting and optimizing the parameters of the candidate frame;

2. The SAR image target detection method based on the improved YOLOv3 network as claimed in claim 1, characterized in that: in step 3, the candidate frame parameters need to be initialized before being optimized, where the initial values of the frame width and height parameters are calculated as follows:

(2) And (3) clustering the size of the marked frame by a k-means clustering method, wherein the distance metric d used for clustering is calculated as follows:

d(A,B)＝1-r _IOU (A,B)

wherein A and B are two different bounding boxes, r _IOU Representing the cross-over ratio operation, r _IOU (A, B) represents the intersection ratio between the bounding boxes A and B, n represents the intersection operation, and u represents the union operation;

3. The method for detecting the SAR image target based on the improved YOLOv3 network as claimed in claim 1, wherein: in step 4, the loss function is expressed as follows:

l _total ＝l _box +l _cls +l _obj

frame regression loss l _box Is represented as follows:

and

is shown as B _i,j And with

representing a label frame corresponding to the ith grid;

loss of classification l _cls Is represented as follows:

wherein p is _i，j The probability that the jth candidate box for the ith mesh is predicted to contain the target,

loss of confidence l _obj Is represented as follows:

wherein λ is _noobj Is a weighting factor;

representing whether the jth candidate frame of the ith grid contains a target or not, wherein the value of no target is 1, and otherwise, the value is 0; c _i,j The confidence score for the jth candidate box representing the ith mesh,

r mentioned above _GIOU The calculation is as follows:

wherein D is the smallest closed convex surface of bounding box a and bounding box B, i.e. the smallest box that can enclose them; d \ represents the rest part of the union of the A and the B in the D;

where u is the candidate box parameter,

is the label box parameter.

4. The SAR image target detection method based on the improved YOLOv3 network as claimed in claim 3, characterized in that: the confidence score C _i，j The calculation is as follows:

is shown as B _i,j And

and representing the label box corresponding to the ith grid.