CN113469073B

CN113469073B - SAR image ship detection method and system based on lightweight deep learning

Info

Publication number: CN113469073B
Application number: CN202110765081.4A
Authority: CN
Inventors: 陈潇钰; 侯彪; 焦李成; 张丹; 马文萍; 马晶晶; 王爽
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-07-06
Filing date: 2021-07-06
Publication date: 2024-02-20
Anticipated expiration: 2041-07-06
Also published as: CN113469073A

Abstract

The invention discloses a SAR image ship detection method and system based on lightweight deep learning, which are used for preprocessing a large-size SAR image and selecting a training sample; introducing a Ghost module and GhostBottleneck to upgrade the YOLOv5s to obtain a preliminary lightweight model of the YOLOv5 s; based on the preliminary lightweight model, the further lightweight of the model is realized by utilizing traditional model lightweight algorithm network pruning and knowledge distillation; utilizing a TensorRT reasoning optimizer to perform reasoning acceleration on the lightweight YOLOv5s model and disposing the model on NVIDIA Jetson TX 2; cutting the large-size SAR image to be detected, and sequentially sending the large-size SAR image to be detected into a model to complete detection; and synthesizing the detection result, and using NMS non-maximum value inhibition screening prediction frame on the final large-size SAR image. On the premise of meeting the acceptable precision loss, the parameter quantity and floating point operation quantity of the compression model are improved, and the detection speed is improved.

Description

SAR image ship detection method and system based on lightweight deep learning

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a SAR image ship detection method and system based on lightweight deep learning.

Background

The Alexnet is free in 2012, and a deep convolutional neural network is lifted in the field of computers. Deeper models often mean that the model has better nonlinear expressive power, and more complex transformations can be accomplished, so that more complex features can be fitted. Based on such an assumption, the deep convolutional neural network is developed towards a deeper and wider direction, and although the deep convolutional neural network shows more excellent performance in various tasks, the volume of the network model is increased, which is contrary to the hardware conditions of various embedded devices at the current mobile terminal, and various results of the deep convolutional neural network research only can be put into a pavilion and cannot fall to the ground. The development speed of the deep neural network is equivalent to that of various mobile terminal devices, the devices usually have no high-performance computing clusters of graphic processing units (Graphic Processing Unit, GPU), only a central processing unit (Central Processing Unit, CPU) is used for completing computing tasks, and the large convolutional neural network for extracting the depth characteristics with stronger expression capability cannot be provided with storage space and computing power conditions matched with the large convolutional neural network, so that the development and the application of the deep convolutional upgrading network on portable devices are seriously hindered. The method is used for greatly promoting the research of network model lightweight algorithms of mass academic and industrial scholars to land on the ground of the artificial intelligence industry so as to improve the performance and efficiency of the portable equipment in the aspect of image processing.

The existing method for lightening the network can be mainly divided into two main categories: model compression and compact model design. The model compression refers to the compression of the neural network model by the pointer and parameters, so that the requirements of the model on storage equipment and computing resources are reduced, and the portable memory and computing force limitation requirements of the mobile terminal are met. Model compression is directed to redundant parts of network structures and network weights, and accuracy is sacrificed to a certain extent to replace a model with less redundancy, faster speed and more simplified. Algorithms that have been proposed at present are NetWork Pruning (NetWork Pruning), model quantization (Model Quantization), binarization method (Binarization Method), low-rank decomposition (Low-rank Decomposition), knowledge distillation (Knowledge Distillation), and the like. Because the redundancy degree of each layer of the deep neural network is different, the traditional model compression algorithm is often fitted to a specific model, and if the model compression algorithm adapting to the redundancy degree of each layer of the model is manually explored aiming at each model, time and labor are wasted, the development of an automatic machine learning algorithm (AutoML) is promoted, the network super-parameters and the architecture which are locally optimal are automatically learnt and explored, the artificial interference is avoided, and the model compression algorithm can be popularized to each model. Based on AutoML, the university of western traffic and Google research team propose an automatic model compression Algorithm (AMC), which introduces reinforcement learning into the model compression algorithm, and the compression ratio is higher under the condition of maintaining the performance of the network model compared with the traditional rule-based compression strategy. A series of compact models like Xception, mobileNetV, mobilenet v2, mobilenet v3, shuffleNet, shuffleNetv, etc. have also been proposed in recent years. These network models are typically viewed from the perspective of reducing redundancy of the convolution kernel, compressing the number of channels, replacing the traditional convolution with an efficient convolution module. By using a plurality of small convolution kernels in the convolution layer, the convolution kernel redundancy is reduced, and network parameters are effectively reduced. The Fire module proposed in the squeezeNet consists of a squeeze layer and an expansion layer, and the number of input channels of a 3×3 convolution kernel is reduced by reducing the number of 1×1 convolution kernels in the squeeze layer. The common convolution is decomposed into a depth convolution (depthwise convolution) and a point convolution (pointwise convolution) in mobiletv 1 using a depth separable convolution (Depthwise Separable Convolution); shuffleNet further proposes a scrambling (Shuffle) operation and a packet point-by-point convolution (group pointwise convolution), rearranging features so that feature information flows through each channel packet; mobilenetV2 proposes an inverse residual structure (Inverted residual block), mobilenetV3 uses a neural network architecture search technique (NAS), a SE (squeeze and excitation) module is introduced, and an H-swish activation function is selected to further compress the model for the network structure. These excellent lightweight network models have achieved good results in model compression and acceleration with little loss of accuracy.

The object detection is also called object class detection or object classification detection, and returns class information and position information of the object of interest in the image. Is a research hotspot in the fields of computer vision and digital image processing in the last twenty years. Alexnet in 2012 has previously proposed a target detection method based on conventional manual features, as is well known: V-J detection, HOG detection, DPM detection combined with binding box regression. After 2012, with the rising of convolutional neural networks and the exponential growth of GPU performance, deep learning is developed in an explosive manner, target detection also enters a deep learning period, and a target detection algorithm based on the deep learning can be divided into a single-stage (One-stage) detection algorithm and a Two-stage (Two-stage) detection algorithm according to whether the algorithm needs to generate a pre-selected frame or not. Representative networks in the single-stage detection algorithm are YOLO series, SSD, retinaNet. The method is mainly characterized by low detection precision and high detection speed. Typical networks for two-stage detection algorithms are R-CNN, SPP-Net, fast R-CNN and Fast R-CNN. Unlike single-stage detection algorithms, two-stage detection has high detection accuracy but high time cost. The most excellent target detection algorithm is difficult to compare with human eye detection at present. Current target detection still faces a number of challenges. Aiming at the requirement of high accuracy, the similar objects have various textures, colors and materials; the diversity of object instance gestures and deformations; the variability of the sampling process environment and the effect of image noise all affect the robustness of the algorithm to intra-class deformations. As for the class-to-class distinguishability, this is generally determined by the similarity between classes and the diversity of the classes. Aiming at the requirements of high efficiency in time and memory occupation, the richness of the nature category, the double nature of the object detection task including positioning and classification and the increasing volume of image data, which put higher requirements on the current object detection algorithm, are the fields of personal bodies of various researchers.

The high-resolution image target detection based on big data is a hot research direction in the remote sensing image processing field all the time, the traditional target detection and identification method cannot be adaptively adjusted aiming at massive data of the remote sensing image, a large number of image features are required to be designed manually, extremely high requirements are put forward for researchers on professional knowledge and understanding of the data features while extremely high time cost is brought, and an efficient classifier is searched to fully understand the data as if the data are fished out in the sea. While the ability to deep learn powerful advanced (more abstract and semantically meaningful) feature representations and learning can provide an efficient framework for target extraction in images. Related researches comprise vehicle detection, ship detection, crop detection, building and other ground feature detection.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides the SAR image ship detection method based on lightweight deep learning, which deploys a lightweight target detection network in embedded equipment NVIDIA Jetson TX2 to realize large-size SAR image ship detection. And the target detection network YOLOv5 is taken as a base line network, and the base line network is light by combining a traditional model compression algorithm and a Ghost light module.

The invention adopts the following technical scheme:

a SAR image ship detection method based on lightweight deep learning comprises the following steps:

s1, preprocessing a large-size SAR image, and selecting a subgraph containing target information as a training sample;

s2, introducing a Ghost module and GhostBottleneck to upgrade the YOLOv5S model to obtain a preliminary lightweight YOLOv5S model, and training the YOLOv5S model by using the training sample selected in the step S1;

s3, distilling the YOLOv5S model obtained after training in the step S2, then performing sparsification training and pruning, and performing fine tuning training on the YOLOv5S model after pruning;

s4, utilizing a TensorRT reasoning optimizer to perform reasoning acceleration on the YOLOv5S model after the fine tuning training in the step S3, and disposing the model on the NVIDIA Jetson TX 2;

s5, cutting SAR images to be detected, and then sequentially sending the SAR images to be detected into a YOLOv5S model deployed on NVIDIA Jetson TX2 in step S4 for detection, so as to obtain corresponding sub-image detection results;

and S6, splicing the sub-graph detection results obtained in the step S5, using NMS non-maximum value to restrain and screen a prediction frame on the final large-size SAR image, drawing the prediction frame on the original large-size image according to the value of the screened prediction frame, and marking the category to realize the ship detection of the SAR image.

Specifically, step S1 specifically includes:

s101, cutting 5 single-channel TIF images Img10K and 31 single-channel TIFF images AIR-SARShip-1.0 into blocks at a coincidence rate of 50%, so as to obtain a sub-graph of a large-size remote sensing image;

s102, amplifying 1000 8-bit JPG images SAR-train-int;

s103, unifying the Img-10K, AIR-SARShip-1.0 obtained in the step S101 and the SAR-train-int image format obtained in the step S102 into an 8-bit single-channel TIF image to obtain a data set comprising 2551 pictures, dividing 2351 pictures into training samples and 200 pictures into verification samples;

s104, performing random operation on the training sample in the step S103 by utilizing a Mosaic data enhancement algorithm, and splicing every four pictures in the training sample in a random zooming, random cutting and random arrangement mode.

Specifically, step S2 specifically includes:

s201, replacing a convolution module and a bottleneck module in a YoLOV5S model backbone network by using a Ghost module and a GhostBottleneck, and upgrading the YoLOV5S model by using the Ghost module and the GhostBottleneck;

s202, adjusting the width multiplier to 0.15, adjusting the depth multiplier to 0.35, and reducing the network layer number to 212 layers to obtain the preliminary lightweight YOLOv5S model.

Specifically, the step S3 specifically includes:

S301, using YOLOv5m as a teacher model, using L2 loss as a distillation basis function, selecting a distillation dist balance coefficient in loss as 1, and performing distillation training on 100 epochs;

s302, setting a sparse parameter to be 6e-4 after an excessive parameterized model is obtained through normal training, carrying out L1 regularization on a gamma parameter of a BN layer through the sparse training, generating a sparse weight matrix as a standard for evaluating the contribution of neurons, determining a threshold according to a 30% sparse rate, cutting off layers smaller than the threshold and dependent layers of corresponding layers, and reserving a maximum channel if all channels in the corresponding layers need to be removed;

and S303, after pruning processing in the step S302 is completed, training 50 epochs on the model obtained in the step S302, and learning the final weight of sparse connection through fine tuning training.

Specifically, in step S4, deployment by the TensorRT inference optimizer includes a Build phase and a depoymeng phase, specifically:

s401, optimizing by using a Pytorch training model in the Build stage to obtain a pt file, converting the pt file into an onnx model, loading the onnx model in a TensorRT, and converting the onnx model into the TensorRT model; then the TensorRT model is stored in a disk or a memory in a serialization manner, and the TensorRT model is called a plan file;

s402, carrying out light-weight YOLOv5 model Deployment in a development stage, firstly deserializing the plan file obtained in the step S401, creating a run time engine, and completing a forward reasoning process.

Specifically, step S5 specifically includes:

s501, before sending a subgraph of a picture to be detected into a trained lightweight YOLOv5S model for detection, if the subgraph of the picture to be detected does not meet the requirement of the model on the picture size, carrying out self-adaptive picture scaling, sending the subgraph into a feature extraction network to obtain a feature map with the size of S multiplied by S, and dividing an input image into S multiplied by S cells;

s502, predicting B bounding boxes by using logistic regression for each grid, if the center of the predicted bounding box is in a grid unit, classifying and predicting frames of targets by the B bounding boxes of the grid unit to obtain the predicted result of each grid on the B bounding boxes, outputting the position information of the bounding boxes, the confidence degree indicating whether the grid contains the targets and the probability information of C categories, and predicting t by each bounding box _x 、t _y 、t _w 、t _h 、t _o ，t _x 、t _y Is the offset value of the bounding box center coordinates relative to the current grid cell; using logical activation pairs t _x And t _y Normalization processing is carried out to limit the value to be within 0-1, t _w 、t _h Is the scaling of bounding box width and height, t _o Is confidence;

s503, adopting a feature pyramid network to downsample to communicate strong semantic features from top to bottom and a path aggregation network to downsample to communicate strong positioning features from bottom to top to fuse detection results of the three scales respectively; for a picture input size of 960×960, the output feature maps are respectively 120×120, 60×60, 30×30, and are respectively 8 times, 16 times, and 32 times downsampled results.

Further, in step S502, the coordinates b of the center point of the predicted bounding box in the entire feature map are obtained according to the 5 values predicted for each bounding box _x 、b _y And length and width b _w 、b _h The following are provided:

b _x ＝σ(t _x )+c _x

b _y ＝σ(t _y )+c _y

wherein the sigma function is logically active, c _x And c _y The distances of the current grid cell relative to the upper left corner of the feature map, p _w And p _h Is the length and width of the a priori frame.

Further, the coordinate offset and confidence are limited to be within 0-1, pr (object) is 1 when the real frame is within the grid cell, otherwise Pr (object) is 0, and the grid cell belongs to probability Pr (class) of a certain category under the condition of containing the target _i I object) is expressed as

Wherein,pr (class) is the intersection ratio of the real frame and the predicted frame _i ) Is the probability of the corresponding class of the target within a certain cell.

Specifically, step S6 specifically includes:

s601, calculating the position information of the target on the large graph through the position information of the target on the sub graph and the relative position of the sub graph on the large graph;

s602, setting an NMS threshold value to be 0.65 for a certain category, selecting a boundary box with highest confidence, filtering all boundary boxes exceeding the NMS threshold value according to DIOU values of the boundary box and other boundary boxes, and carrying out picture frame according to the reserved prediction box after the prediction box is screened, so as to finish ship detection of large-size SAR images.

The invention further provides a SAR image ship detection system based on lightweight deep learning, which comprises:

the data module is used for preprocessing the large-size SAR image and selecting a subgraph containing target information as a training sample;

the processing module is used for introducing the Ghost module and the GhostBottleneck to upgrade the YOLOv5s model to obtain a preliminary lightweight YOLOv5s model, and training the YOLOv5s model by using the training sample selected by the data module;

the fine tuning module is used for distilling the YOLOv5s model obtained after the training of the processing module, then carrying out sparsification training and pruning, and carrying out fine tuning training on the YOLOv5s model after pruning;

the reasoning module is used for carrying out reasoning acceleration on the YOLOv5s model after the fine tuning training of the fine tuning module by using a TensorRT reasoning optimizer and is deployed on the NVIDIA Jetson TX 2;

the detection module is used for cutting SAR images to be detected, sequentially sending the SAR images to the inference module to be deployed on the YOLOv5s model of NVIDIA Jetson TX2 for detection, and obtaining corresponding sub-image detection results;

and the removing module is used for splicing the sub-graph detection results obtained by the detection module, using NMS non-maximum value to restrain screening prediction frames on the final large-size SAR image, drawing the prediction frames on the original large-size image according to the value of the screened prediction frames and marking categories, so as to realize the ship detection of the large-size SAR image.

Compared with the prior art, the invention has at least the following beneficial effects:

the invention relates to a SAR image ship detection method based on lightweight deep learning, which adopts the technical means of network pruning, knowledge distillation and Ghost algorithm, and aims at the characteristic of large remote sensing image size, excessive information loss can be caused by direct scaling input into a network, and the loss of network information can be avoided by adopting a mode of cutting pictures into blocks with a certain degree of coincidence, and the size of the pictures can be ensured to be matched with the size of the network input; combining traditional model compression algorithm network pruning and knowledge distillation with an artificially designed lightweight model Ghost, and upgrading a target detection network YOLOv5; the parameter quantity and floating point operation quantity of the model are reduced to a great extent, and the reasoning speed is improved.

Furthermore, the data sets with different size and format are pertinently diced with a certain overlap ratio, the sizes and formats of the pictures are unified, and the sizes and formats of training samples of the input model are ensured to be consistent.

Further, the convolution module and the bottleneck module in the YOLOv5s model are optimized and upgraded by using a lightweight model Ghost, the width multiplier is adjusted to 0.15, the depth multiplier is adjusted to 0.35, and the network layer number is reduced to 212 layers, so that the parameters and floating point operation amount of the model are reduced.

Furthermore, in order to further compress the model, a traditional model compression algorithm is introduced, network pruning and knowledge distillation are performed, knowledge distillation teaches the superior performance of the large model YOLOv5m to the lightweight YOLOv5s, the model performance is improved to a certain extent, the network pruning shears off neurons which are relatively unimportant by measuring the importance of the neurons, and model parameters and floating point operation amount are further reduced.

Further, the meaning of the lightweight model is to implement deployment of the deep learning model on the embedded device, and step S4 deploys the lightweight YOLOv5S on the NVIDIA Jetson TX2 by using a TensorRT reasoning optimizer.

Further, the detection process of the light-weight YOLOv5S on the large-size SAR image is described by step S5, and finally, a result diagram marked with a prediction frame and category information is obtained.

Further, in step S502, a calculation scheme of the center point coordinates and the length and width of the prediction boundary box in the whole feature map is described.

Further, the probability Pr (class) that the grid cell belongs to a certain class under the condition of containing the target _i I object) is the result of the network output

Further, step S6 restores the sub-picture result of the picture to be detected to the original size picture, and filters out the prediction frame with higher repeatability by NMS to obtain the final detection result.

In summary, the invention provides a complete model light-weight process, and finally obtains a light-weight YOLOv5s model, and deploys the model on the embedded equipment NVIDIA Jetson TX2 to complete the ship-to-ship task of the large-size SAR image.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is a Ghost model illustration;

FIG. 3 is a GhostBottleneck illustration;

FIG. 4 is a schematic diagram of a key part of a complex model after confidence is removed from a complex scene detection result;

FIG. 5 is a schematic diagram of a key part of a complex model with no confidence level removed from a complex scene detection result;

FIG. 6 is a schematic diagram of a key part of a complex model with no confidence level removed from a complex scene detection result;

FIG. 7 is a schematic diagram of a key part of a simple model after confidence is removed from a result obtained by simple scene detection;

FIG. 8 is a schematic diagram of a key portion of a simple model with no confidence level removed from the results of a simple scene detection.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it will be understood that the terms "comprises" and "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Various structural schematic diagrams according to the disclosed embodiments of the present invention are shown in the accompanying drawings. The figures are not drawn to scale, wherein certain details are exaggerated for clarity of presentation and may have been omitted. The shapes of the various regions, layers and their relative sizes, positional relationships shown in the drawings are merely exemplary, may in practice deviate due to manufacturing tolerances or technical limitations, and one skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions as actually required.

The invention provides a SAR image ship detection method based on lightweight deep learning, which is oriented to embedded equipment NVIDIA Jetson TX2, relates to a model compression method, utilizes a traditional model compression algorithm and a manually designed lightweight model compression optimization target detection network, and can be applied to detection of certain specific targets in a large-size synthetic aperture radar image; on the premise of meeting the acceptable precision loss, the parameter quantity and floating point operation quantity of the compression model are improved, and the detection speed is improved.

Referring to fig. 1, the SAR image ship detection method based on lightweight deep learning of the invention comprises the following steps:

s1, preprocessing a large-size SAR image, selecting a subgraph containing target information as a training sample, and obtaining 500 epochs of a lightweight YOLOv5S model in the step S2 on the training sample;

s101, cutting 5 pieces of 10000 multiplied by 10000 pixels, 16-bit single-channel TIF image Img10K and 31 pieces of 3000 multiplied by 3000 pixels, 16-bit single-channel TIFF image AIR-SARShip-1.0 into blocks at an overlap ratio of 50%, obtaining a sub-image of a large-size remote sensing image by cutting, and inputting the sub-image as a training sample to a network for training;

s102, amplifying 1000 800×800 pixels and 8-bit JPG image SAR-train-int to 1000×1000;

S103, unifying an Img-10K, AIR-SARShip-1.0, an AIR-SARShip-2.0 and an SAR-train-int image format into an 8-bit single-channel TIF image, wherein the finally established data set comprises 2551 pictures, 2351 training samples are divided, and 200 verification samples are obtained;

s104, splicing the four pictures in a random scaling, random cutting and random arrangement mode by utilizing a Mosaic data enhancement algorithm, and increasing the number of small target samples to enable training data distribution to be uniform.

S2, introducing a Ghost module and GhostBottleneck to upgrade the YOLOv5S model, finishing preliminary light weight of the YOLOv5S model, and training the YOLOv5S model by using the training sample selected in the step S1 for 500 epochs;

the idea of the Ghost module instead of the standard convolution is to use the cheap linearly transformed "Ghost" of the few eigenvectors as the output eigenvector. The method utilizes the similarity between the redundant feature map pairs, can obtain a large number of similar redundant feature maps by simple linear transformation based on a small number of intrinsic feature maps, and achieves the aim of compressing the quantity and operand of convolution parameters. The Ghost module decomposes a standard convolution into two parts, the first part generates a small number of eigenvalues with a small number of standard convolutions, and the second part generates a large number of "Ghost" eigenvectors, i.e., redundant, at extremely low cost by performing simple linear operations on the eigenvalues.

S201, a specific operation of upgrading YOLOv5S by using a Ghost module and a Ghost bottleneck is to replace a convolution module and a bottleneck module in a YOLOv5S model backbone network with the Ghost module and the Ghost bottleneck, respectively, as shown in fig. 3.

S202, because the Ghost module can bring about a large increase in network depth, the increase in network depth brought about by the Ghost module is considered to be reduced by changing the depth multiplier. And adjusting the width multiplier to 0.15, adjusting the depth multiplier to 0.35, and reducing the network layer number to 212 layers to obtain the preliminary lightweight YOLOv5s model.

Two multipliers are adjusted: the width multiplier and the depth multiplier are adjusted to reduce the network layer number, and the process is called primary light weight.

S3, based on the YOLOv5S model obtained in the step S2, further lightening the YOLOv5S preliminary lightening model by utilizing traditional model lightening algorithm network pruning and knowledge distillation to obtain a YOLOv5S model;

and (3) distilling the preliminary lightweight YOLOv5S model in the step (S2), performing sparsification training after distillation, pruning, and performing fine tuning training on the pruned model to restore accuracy.

S301, using YOLOv5m as a teacher model (T-model), using L2 loss as a distillation basis function, selecting a distillation dist balance coefficient in loss as 1, and performing distillation training on 100 epochs;

S302, setting a sparse parameter 6e-4 after an excessive parameterized model is obtained through normal training, and carrying out L1 regularization on a gamma parameter of a BN layer through the sparse training to generate a sparse weight matrix. This was used as a criterion for evaluating the size of the neuron contribution, and a threshold was determined from the 30% sparsity. Cutting off a layer smaller than a threshold value and a dependent layer of the layer, and if all channels in the layer need to be removed, reserving a maximum channel for ensuring a network structure;

step S301 is to perform distillation optimization on the preliminary lightweight model, and step S302 is to further prune the distilled model to obtain a pruned model.

S303, after pruning processing is completed, training 50 epochs on the pruned model obtained in the step S302, and learning the final weight of sparse connection through fine tuning training in order to ensure that the model accuracy is not greatly reduced.

S4, utilizing a TensorRT reasoning optimizer to perform reasoning acceleration on the YOLOv5S model obtained in the step S3, and deploying the YOLOv5S model on NVIDIAJetson TX2, wherein the TensorRT reasoning optimizer is deployed and comprises a Build stage and a depoymeng stage;

s401, optimizing by using a Pytorch training model in the Build stage to obtain a pt file, converting the pt file into an onnx model, loading the onnx model in a TensorRT, converting the onnx model into a TensorRT model, and serializing the TensorRT model into a disk or a memory, so that the disk or the memory is called a plan file;

And S402, performing Deployment of a lightweight YOLOv5 model in a development stage, and completing a forward reasoning process. Firstly, deserializing a plan file obtained in the Build process, creating a run time engine, and reasoning.

S5, after cutting the large-size SAR image to be detected, sequentially sending the large-size SAR image to a YOLOv5S model deployed on NVIDIA Jetson TX2 in the step S4 to finish detection;

similar to the generation of training samples, the large-size SAR image is segmented into 1000×1000 subgraphs at an overlap ratio of 50%, and the subgraphs are sequentially sent to a model for detection.

S501, before sending a subgraph of a picture to be detected into a trained lightweight YOLOv5S model for detection, if the subgraph does not meet the requirement of the model on the size of the picture, carrying out self-adaptive picture scaling, sending the subgraph into a feature extraction network to obtain a feature map with the size of S multiplied by S, and dividing an input image into S multiplied by S cells;

s502, predicting B bounding boxes by using logistic regression for each grid, and if the center of the predicted bounding box is in a grid unit, classifying and predicting the frame of the target by the B bounding boxes of the grid unit to obtain a prediction result of the B bounding boxes by each grid;

the location information of the bounding box, confidence indicating whether the grid contains the object, and probability information of the C categories are output. Each bounding box predicts 5 values: t is t _x 、t _y 、t _w 、t _h 、t _o 。t _x 、t _y Is the offset value of the bounding box center coordinates relative to the current grid cell. At the same time, to ensure that the center of the bounding box is constrained within the current grid cell, logical activation (logic) is used for t _x And t _y Normalization processing is carried out, and t is taken as _x And t _y The value of (2) is limited within 0-1, so that model training is more stable; t is t _w 、t _h Is the scaling of bounding box width and height, t _o Is confidence, mentioned in RCNNt _h As does the calculation of (a).

The 5 values predicted from each bounding box can then be calculated according to the following formulaObtaining the center point coordinates b of the predicted bounding box in the whole feature diagram _x 、b _y And length and width b _w 、b _h 。

b _x ＝σ(t _x )+c _x (1)

b _y ＝σ(t _y )+c _y (2)

Wherein c _x And c _y Is the distance of the current grid cell relative to the upper left corner of the feature map, p _w And p _h Is the prior frame length and width. The sigma function is logic activated, and limits the coordinate offset and confidence to be within 0-1. When the real frame falls within the grid cell, the probability Pr (object) of the real frame falling within the grid cell is 1, otherwise, pr (object) is 0.

Probability Pr (class) that a certain grid cell belongs to a certain class under the condition of containing a target _i I object) is expressed as:

S503, feature pyramid network FPN downsamples to convey strong semantic features from top to bottom and path aggregation network PAN upsamples to convey strong positioning features from bottom to top to fuse detection results of three scales respectively.

For a picture input size of 960×960, the output feature maps are respectively 120×120, 60×60, 30×30, and are respectively 8 times, 16 times, and 32 times downsampled results.

And S6, splicing the sub-graph detection results obtained in the step S5, using NMS non-maximum value to restrain screening prediction frames on the final large-size SAR image, drawing the prediction frames on the original large-size image according to the value of the screened prediction frames, drawing reserved prediction frames on the original image, and marking categories to finish the target detection of the large-size SAR image.

S601, the splicing process is the inverse process of the dicing process, and the position information of the target on the large graph is calculated through the position information of the target on the sub graph and the relative position of the sub graph on the large graph;

s602, setting an NMS threshold value to be 0.65 for a certain category, selecting a boundary frame with highest confidence, filtering all boundary frames exceeding the NMS threshold value according to DIOU values of the boundary frame and other boundary frames, removing boundary frames with larger repetition rate, screening predicted frames by the NMS, and carrying out picture frame according to the reserved predicted frames after screening the predicted frames, thereby completing ship detection of large-size SAR images.

In still another embodiment of the present invention, a lightweight deep learning-based SAR image ship detection system is provided, which can be used to implement the above lightweight deep learning-based SAR image ship detection method, and specifically, the lightweight deep learning-based SAR image ship detection system includes a data module, a processing module, a fine tuning module, an inference module, a detection module, and a removal module.

And the removing module is used for splicing the sub-image detection results obtained by the detection module, using NMS non-maximum value to restrain and screen a prediction frame on the final large-size SAR image, and drawing the prediction frame on the original large-size image according to the value of the screened prediction frame to realize the ship detection of the SAR image.

In yet another embodiment of the present invention, a terminal device is provided, the terminal device including a processor and a memory, the memory for storing a computer program, the computer program including program instructions, the processor for executing the program instructions stored by the computer storage medium. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf Programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., which are the computational core and control core of the terminal adapted to implement one or more instructions, in particular adapted to load and execute one or more instructions to implement a corresponding method flow or a corresponding function; the processor of the embodiment of the invention can be used for the operation of the SAR image ship detection method based on lightweight deep learning, and comprises the following steps:

Preprocessing a large-size SAR image, and selecting a subgraph containing target information as a training sample; introducing a Ghost module and GhostBottleneck to upgrade the YOLOv5s model to obtain a preliminary lightweight YOLOv5s model, and training the YOLOv5s model by using a training sample; distilling the YOLOv5s model obtained after training, then performing sparsification training and pruning treatment, and performing fine tuning training on the YOLOv5s model after pruning treatment; utilizing a TensorRT reasoning optimizer to perform reasoning acceleration on the YOLOv5s model after fine tuning training, and disposing the model on NVIDIA Jetson TX 2; cutting SAR images to be detected, sequentially sending the SAR images to a YOLOv5s model deployed on NVIDIA Jetson TX2 for detection, and obtaining corresponding sub-image detection results; splicing the obtained sub-image detection results, using NMS non-maximum value to restrain and screen a prediction frame on a final large-size SAR image, and drawing the prediction frame on an original large-size image according to the value of the screened prediction frame to realize SAR image ship detection.

In a further embodiment of the present invention, the present invention also provides a storage medium, in particular, a computer readable storage medium (Memory), which is a Memory device in a terminal device, for storing programs and data. It will be appreciated that the computer readable storage medium herein may include both a built-in storage medium in the terminal device and an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer readable storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory.

One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to implement the corresponding steps of the above-described embodiments with respect to a lightweight deep learning-based SAR image ship detection method; one or more instructions in a computer-readable storage medium are loaded by a processor and perform the steps of:

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The effect of the invention can be further illustrated by the following experiments:

1. experimental environment

The training host simulation environment is as follows: ubuntu 18.04, intel (R) Xeon (R) Gold 5118CPU, GPU GeForce RTX 2080ti,python 3.8.5,CUDA 10.0.130,CuDNN 7.0.

Jetson TX2 inference environment: ubuntu 18.04,CPU is HMP Dual Denver 2/2MB L2+Quad A57/2MB L2, GPU is NVIDIA PascalTM,256CUDA cores,python3.8.5,CUDA 10.2.89,CuDNN 8.0.0.180,TensorRT version 7.1.3.0, jetpack version 4.4.1.

2. Experimental details

(1) And (3) taking YOLOv5s as a baseline model, respectively verifying the effectiveness of a Ghost module on a GPU and a CPU of a training host for model weight reduction, and recording model parameter, floating point operand, average precision AP50, AP50:95, precision P, recall rate R, and reasoning time and total processing time for processing a 1000 multiplied by 1000 picture. The Ghost model demonstrates compression performance for both the parametric and floating point operands, as shown in FIG. 2.

(2) And (3) carrying out distillation pruning on the preliminary lightweight model, testing the model on NVIDIA Jetson TX2, and recording the reasoning time and the total processing time for processing one 1000X 1000 picture. The performance of the model in terms of inference time acceleration is verified.

3. Simulation experiment results

Experimental results show that the Ghost module has larger compression on the parameter quantity and floating point operation quantity of the baseline model YOLOv5 s. The distillation and pruning strategies used can further compress the model and greatly increase the reasoning speed.

TABLE 1 comparison of Performance of Mobile and Ghost modules in Yolov5

Model

All

L

Weights(M)

FLOPs(G)

AP ₅₀

AP _50:95

P

R

Infer(ms)

Total(ms)

YOLOv5s

\

224

6.72

16.3

61.0％

33.5％

84.5％

56.6％

8

54

YOLOv5m

\

308

20.06

50.3

64.3％

33.5％

77.9％

60.1％

13

61

GhostYOLOv5

326

4.44

9.7

62.8％

33.7％

78.5％

59％

11

60

GhostYOLOv5

√

362

2.69

6.5

60.5％

32.9％

80.3％

58.4％

12

60.9

The performance pair of each model is shown in table 1. The experiments in the tables were all done on a laboratory server. The image sizes are adapted to the network inputs, all 960 x 960. Where all represents whether to replace all convolution modules and bottleneck blocks in the network, and L represents the number of network layers. It can be seen that the Ghost has significant compression on the parameter number and floating point operation amount of the model, the Ghost reduces the parameter number of the YOLOv5s from 6.72M to 4.44M, the floating point operation amount is reduced from 16.3G to 9.7G, and meanwhile, certain precision of the Ghost YOLOv5s in AP can be maintained ₅₀ And AP (Access Point) _50:95 The above is slightly higher than YOLOv5s, but the reasoning speed on the GPU is not as high as that of the parametric quantity and floating point operation quantity which are larger than those of YOLOv5s. Considering that the GPU calculation bottleneck is in memory access bandwidth, in order to reduce the network layer number and improve the reasoning speed, only the backbone network is replaced. Model reasoning speed is tested on the CPU. The results are shown in Table 2.

Table 2 model CPU inference time contrast

Model

Weights(M)

FLOPs(G)

AP ₅₀

AP _50:95

P

R

Infer(ms)

Total(ms)

YOLOv5s

6.72

16.3

61.0％

33.5％

84.5％

56.6％

510

546

GhostYOLOv5

4.44

9.7

62.8％

33.7％

78.5％

59％

440

499

The reasoning speed of GhostYOLOv5s on the CPU is significantly faster than that of YOLOv5s, and the total processing time is also faster than that of YOLOv5s, which indicates that Ghost is effective for the weight reduction of the network model. It is sufficient to demonstrate the excellent performance of the Ghost model on network compression.

TABLE 3 influence of depth multiplier and width multiplier on inference time

Model	Depth	Width	Weights(M)	FLOPs(G)	AP ₅₀	AP _50:95	Infer(ms)	Total(ms)
									GhostYOLOv5	0.33	0.50	4.44	9.7	62.8％	33.7％	11	60
GhostYOLOv5	0.15	0.35	2.22	5.1	63.0％	34.8％	12	60.9

YOLOv5 implements four models of different sizes by adjusting the width multiplier (width multiplier) and depth multiplier (depth multiplier). Since the Ghost module will bring a substantial increase in network depth, it is contemplated that the increase in network depth brought by the Ghost module is reduced by altering the depth multiplier. The width multiplier is adjusted to 0.15 and the depth multiplier is adjusted to 0.35. The number of network layers is reduced to 212 layers. The test reasoning speed and the total processing speed on the server GPU are improved to a certain extent, but the accuracy of the model is not reduced. The means for altering the width multiplier and depth multiplier to control the model width and depth are described as being effective.

TABLE 4 TensorRT inference performance comparison

Model	Pruning	Distillation	Weights(M)	FLOPs(G)	AP ₅₀	AP _50:95	Infer(ms)	Total(ms)
									YOLOv5s			6.72	16.3	61.1％	33.5％	70.38	121.82
GhostYOLOv5			2.22	5.1	63.0％	34.8％	61	109.77
									GhostYOLOv5	√		1.62	3.0	61.6％	32.2％	40.5	90.7
GhostYOLOv5		√	2.22	5.1	63.0％	32.3％	59.4	108.66
									GhostYOLOv5	√	√	0.89	1.8	57.3％	27.7％	30.2	84.5

And after the Ghost light-weighted YOLOv5s is subjected to distillation, pruning and fine tuning training, the parameter number and floating point operation amount of the model are greatly reduced. A loss of accuracy is unavoidable, but is within an acceptable range. The time for reasoning a 1000 x 1000 picture on TX2 is only 30.2ms, the total time length including reading the picture and post-processing is also only 84.5ms. Sufficient to demonstrate the superiority of GhostYOLOv 5. Analysis of Table 4 shows that distillation has little effect on improving the accuracy of the model, but further weight reduction can be achieved by distilling and pruning the model with the same pruning strategy. Considering distillation can make the model weight distribution denser, making important weights more important and less important than less important weights. A more sparse matrix can be obtained in sparse training.

Fig. 4 shows a key part of the detection result on a 10000×10000 test image. Confidence information is hidden so as not to block small targets. The experiment finally results in two models, namely a complex model with only pruning and a simple model with distillation pruning. The two models are very different in parameter and floating point operand, and the image complexity is different. The reasoning time of the single picture of the complex model on TX2 in GhostYOLOv5s is 40ms, and if the complex model comprises a loading model, a picture reading and post-processing stage, the total processing time of the single picture is about 92ms. The reasoning time of the simple model at GhostYOLOv5s on TX2 is 30.2ms, and if the model is loaded, the total processing time of the single picture is about 84.5ms in the picture reading and post-processing stages.

TABLE 5 comparison of detection effects of different models in complex scenes

Model

Weights(M)

FLOPs(G)

AP ₅₀

AP _50:95

Miss

Fake

F ₁

Infer(s)

Total(s)

Complex

1.62

3.0

69％

42.3％

35.6％

13.5％

0.738

14.44

32.2

Simple

0.89

1.8

56.1％

27.3％

49.5％

22.9％

0.61

10.9

30.5

Table 6 comparison of detection effects of different models in simple scenarios

Model

Weights(M)

FLOPs(G)

AP ₅₀

AP _50:95

Miss

Fake

F ₁

Infer(s)

Total(s)

Complex

1.62

3.0

81.6％

50％

27.9％

13％

0.789

14.44

32.2

Simple

0.89

1.8

55.1％

24.6％

46.3％

36.7％

0.581

10.9

30.5

Fig. 5 and 6 show that the complex model does not hide the confidence picture. The drawing is a river channel, the detection complexity of the shoal is high, and the confidence of the detection can be kept at a high level.

The complexity is mainly directed to complex scenes and the simple model is mainly directed to simple scenes. Fig. 7 and 8 illustrate the detection effect of a simple model on a simple scene. Simple models exhibit superior performance in the range sea domain. The optimization of the inference speed can be achieved by selecting an appropriate model for images of different complexity.

In summary, according to the SAR image ship detection method based on lightweight deep learning, the obtained lightweight YOLOv5s model is deployed on the embedded equipment NVIDIA Jetson TX2 to complete the ship and ship task of the large-size SAR image, and the ship can be effectively detected in both simple scenes and complex scenes of the SAR image.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. The SAR image ship detection method based on lightweight deep learning is characterized by comprising the following steps of:

s2, introducing a Ghost module and GhostBottleneck to upgrade the YOLOv5S model to obtain a preliminary lightweight YOLOv5S model, and training the YOLOv5S model by using the training sample selected in the step S1, wherein the steps are as follows:

s202, adjusting a width multiplier to be 0.15, adjusting a depth multiplier to be 0.35, and reducing the network layer number to 212 layers to obtain a preliminary lightweight YOLOv5S model;

s3, distilling the YOLOv5S model obtained after training in the step S2, then performing sparsification training and pruning, and performing fine tuning training on the YOLOv5S model after pruning, wherein the method specifically comprises the following steps:

s303, after pruning processing in the step S302 is completed, continuing training 50 epochs on the model obtained in the step S302, and learning the final weight of sparse connection through fine tuning training;

s5, cutting SAR images to be detected, sequentially sending the SAR images to a Yolov5S model deployed on NVIDIA Jetson TX2 in step S4 to detect, and obtaining corresponding sub-image detection results, wherein the sub-image detection results specifically comprise:

s503, adopting a feature pyramid network to downsample to communicate strong semantic features from top to bottom and a path aggregation network to downsample to communicate strong positioning features from bottom to top to fuse detection results of the three scales respectively; for the picture input size of 960×960, the output feature maps are respectively 120×120,60×60,30×30, and are respectively 8 times, 16 times and 32 times downsampled results;

2. The method according to claim 1, wherein step S1 is specifically:

s102, amplifying 1000 8-bit JPG images SAR-train-int;

3. The method according to claim 1, wherein in step S4, deployment by the TensorRT inference optimizer comprises a Build phase and a depoymeng phase, specifically:

4. The method according to claim 1, wherein in step S502, the predicted bounding boxes are in the whole feature map according to 5 values predicted from each bounding boxCenter point coordinate b of (2) _x 、b _y And length and width b _w 、b _h The following are provided:

b _x ＝σ(t _x )+c _x

b _y ＝σ(t _y )+c _y

5. The method of claim 4, wherein the coordinate offset and confidence are limited to within 0-1, pr (object) is 1 when the real frame falls within the grid cell, otherwise Pr (object) is 0, and the probability Pr (class) of the grid cell belonging to a certain class under the condition of containing the object _i I object) is expressed as

6. The method according to claim 1, wherein step S6 is specifically:

7. A SAR image ship detection system based on lightweight deep learning, comprising, based on the method of claim 1: