CN112926486A

CN112926486A - Improved RFBnet target detection algorithm for ship small target

Info

Publication number: CN112926486A
Application number: CN202110281458.9A
Authority: CN
Inventors: 方健; 刘坤
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2021-06-08

Abstract

The invention provides a target detection method based on improved RFBnet, aiming at the problems that in ship target detection, ship targets under the condition of multiple targets are easily shielded by the multiple targets, so that the ship targets are missed to detect, and are wrongly classified. Firstly, performing feature fusion by using a pooling feature fusion module and a deconvolution feature fusion module; secondly, providing step-by-step convolution to extract the information of the concerned region of the feature unit in the original image, and designing an expansion convolution module integrated with the attention mechanism and new first three effective feature layers to perform feature fusion again; then, a focusing classification loss function is introduced to solve the problem of unbalanced distribution of positive and negative samples in the training process; and finally, training the ship detection data set SeaShips. The results show that: the improved algorithm has good effect, and especially has obvious effect on small targets shielded by multiple targets. The average precision mean value is 96.26%, the average precision mean value is improved by 4.74% compared with the algorithm before improvement, the frame rate per second reaches 26FPS, and the requirement of real-time detection is met.

Description

Improved RFBnet target detection algorithm for ship small target

The technical field is as follows:

the invention relates to a method for detecting a small target under the condition of ship multi-target shielding, in particular to a ship target detection method for improving an RFBnet network.

Background art:

ships are important carriers of offshore activities, and computer vision-based ship target detection is applied to an actual ship management system, so that detection of offshore and offshore ships is widely applied to military and civil fields. At present, a natural image of a head-up viewing angle adopted by ship target detection has the characteristics of small data volume, high resolution, rich color and texture information, easy data acquisition and the like, and becomes an important source in the field of target detection. However, the ship target under the condition of multiple targets in the image is easily shielded by the multiple targets, so that the problems of small target missing detection, classification error and the like are caused, and how to improve the detection precision and speed meets the requirement of ocean security in practical application is an urgent problem to be solved.

Traditional ship target detection algorithms are classified into three categories, such as target detection based on statistics, target detection based on knowledge and target detection based on models, and these algorithms need to manually extract target features, such as scale invariant feature transformation matching algorithms, directional gradient histogram features, accelerated robust features and the like, which all need complicated and complicated feature extraction processes and have great disadvantages in detection accuracy and speed.

Although mainstream deep learning target detection algorithms are researched and improved, the mainstream deep learning target detection algorithms only detect ship targets, do not classify the detected ship targets further, and do not meet the requirements of actual ship supervision.

The invention content is as follows:

aiming at the problems that in the conventional ship target detection, a ship target under the condition of multiple targets is easily shielded by the multiple targets to cause the problems of ship target omission, classification errors and the like, the method provides a natural image target detection method based on improved RFBnet, which comprises the following steps:

s1, creating a SeaShips containing 7000 ship data sets with 1920 × 1080 resolution, the data set contains six types of vessels, which are defined to cover substantially all vessels present in the offshore area and to take into account background, lighting, perspective, visible hull proportion, dimensions and occlusion, the data set adopts a standard PASCALVOC labeling format, each picture is precisely labeled with a label and a boundary box of a target, preprocessing is carried out before training, the size of the picture is adjusted to 300 multiplied by 300 pixels, the ore carrier type comprises 1141 pieces, the bulk carrier type comprises 1129 pieces, the container hip type comprises 814 pieces, the general carrier hip type comprises 1188 pieces, the fisher boat type comprises 1258 pieces, the passenger hip type comprises 705 pieces, the six mixed types mainly comprise 765 pieces of mutually shielded ships in the image, and the training set, the verification set and the testing set are randomly divided according to the proportion of 7:2: 1;

s2, firstly cutting a natural image with original resolution into a size of 300 × 300 × 3, transmitting the natural image into an improved RFBnet network, keeping a common VGG only containing conv4-3 and fc7, adding some RFB modules, forming new feature layers which are respectively BasicRFB P3, P5, P6, P7 and P8, after conv4-3, changing one part of the natural image into a size of 38 × 38 × 256 through convolution, changing the other part of the natural image into a size of 19 × 19 × 512 through a mode of maximum pooling, convolution and pooling, continuing to change the natural image into a size of 19 × 19 × 1024 through a dilation convolution with a dilation rate of 6 and fc7, then changing the natural image into a size of 19 × 19 × 256 through convolution once, continuing to sample into a size of 38 × 256, and then fusing with the improvement of the first part to obtain a new feature layer, wherein the six feature layers are totally included;

s3, firstly, processing relatively shallow BasicRFB _ aP3 and BasicRFB P3 features by using maximum pooling layer 2 × 2, 3 × 3 convolution and Relu activation functions in the PFF module, enabling shallow layer network features to learn more nonlinear relations while keeping significant detail features and reducing feature dimensions of a shallow layer network, fusing with BasicRFB P5 to enable the BasicRFB P5 to acquire more edge detail information of BasicRFB _ a P3 and BasicRFB P3, then processing relatively deep BasicRFB P6, Conv 2P 5848P 7 and Conv 2P 8 by using Deconv 2 × 2, 3 × 3 convolution and Relu activation functions in the DFF module, enabling the network features to learn more nonlinear relations while filling feature contents and extracting sensitive feature information, fusing with BasicRFB P5, enabling the Conv 2P d P8 to extract more deep layer RFP 3527, and finally extracting more normalized features of the ConicRFP 638, ConicRFP 638 and Conv2, forming a new BasicRFB P5 characteristic, and similarly, performing the same operation on other layers in the backbone network, thus forming six new effective characteristic layers;

s4, adding DB1, DB2 and DB3 expansion volume blocks to the new layers of the basic RFB _ a P3, the basic RFB P3 and the basic RFB P5 in the network framework respectively, then the information of the receptive fields of the three layers of characteristic units in the original image is learned through DB1, DB2 and DB3 respectively, and the merged features are fused in a concatee mode, the number of channels of the original feature graph is increased by the merged features, finally, a convolution layer is added for increasing the learning capability of the network and simultaneously reducing the feature dimension, because the RFBnet algorithm obtains the feature map and then respectively inputs the feature map into the classification network and the positioning network, the category information and the position information are obtained by convolution of 3 x 3, i.e. information for one object is present in a feature cell of 3 x 3 size, therefore, the algorithm also uses a convolution dimension reduction mode of 3 multiplied by 3, so that the effective characteristic layers after the dimension reduction of the first three layers and the effective characteristic layers of the last three layers form six layers of latest effective characteristic layers;

s5, for the first latest effective feature layer 38 x 512, 1444 grids are contained in the first latest effective feature layer, each grid corresponds to 6 prior frames, each grid of the second, third and fourth latest effective feature layers corresponds to 6 prior frames, each grid of the fifth and sixth latest effective feature layers corresponds to 4 prior frames, finally 11620 prior frames are formed, then the 11620 frames are respectively adjusted according to the prediction result, then whether the 11620 adjusted frames contain the required object or not is judged, if yes, the frame is marked, of course, some frames obtained by utilizing the prior frames can be overlapped, therefore, the score and the overlapping condition of the frames are also judged, the required frame is found by utilizing a non-maximum inhibition method, and the type of the required frame is marked;

s6 candidate box matching and loss function design

In order to realize the detection of ship targets with different scales in the image, candidate frames with different aspect ratios are designed to be matched to adapt to image targets with different scales, and the image targets are detected according to RFThe Bnet loss function may obtain a candidate frame D ═ D1, D2, … dn, where di is composed of (cx, cy, w, h)4 coordinate values, where (cx, cy) is the coordinate of the center point, and w, h are the width and height of the candidate frame, respectively, and the candidate frame is matched with the real tag frame to obtain the coordinate of the candidate frame and the corresponding target category, which may be specifically expressed as the coordinate of the candidate frame and the target category corresponding to the candidate frame

l represents category class, and can be ordered for simple labeling

Wherein

Representing a set of prediction class vectors, wherein

Representing a set of predicted coordinate vectors, marking the candidate frame as a positive sample Pos when the matching of the candidate frame and the real label frame is greater than a threshold value, and marking the candidate frame as a negative sample Neg when the matching of the candidate frame and the real label frame is less than the threshold value;

in addition, the RFBnet algorithm predicts 11620 prediction frames on 6 prediction scales, wherein only a small part of the prediction frames contain targets, most of the prediction frames only contain image background information, the network focuses more on background frames which are easy to classify, the classification capability of the targets is reduced, and in order to avoid the problem that model training is degraded due to the fact that the situation causes model training degradation, the method introduces focused classification loss to supervise model training, which can be expressed as:

where N is the number of candidate frames corresponding to the actual frame, the localization loss function can be expressed as:

in the formula (I), the compound is shown in the specification,

indicates the probability of correct and class as background prediction box, a_tAnd r is a hyperparameter, and a_t∈[0,1]，r∈[0,5]When r is>When the average value of the positive sample and the negative sample is 0, the loss of the positive sample is relatively reduced, and the model focuses more on training of the negative sample, so that the problem of unbalanced distribution of the positive sample and the negative sample is effectively solved by adding a focusing classification loss function, and the optimization efficiency of the model is improved;

s7, network training

In the whole training process, in order to quickly optimize training, a priori boxes with IoU values larger than 0.5 of real boxes are set as positive example boxes, negative example boxes difficult to learn are adopted to participate in training, and the proportion of the positive sample and the negative sample is set as 3: 1, setting each 4 pictures as a batch, setting an optimizer as Adam, using a callback mode for learning rate, when two epochs are completed, loss does not decrease, the learning rate is reduced to half of the original rate, training is carried out three times, and a model is firstly pre-trained on an ILSVRC CLS-LOC data set; the second training, setting the parameters of the first 20 layers of the network not to participate in the training, setting the initial learning rate to be 0.0005 and the epoch to be 50, setting all the parameters of the third network to participate in the training, setting the initial learning rate to be 0.0001 and the epoch to be 100, adding early stop in order to reduce the training time, and finishing the training when 4 epoch loss values do not fall in each training;

s8, network test

And for the trained network model, taking the test sample as input to obtain an output predicted value, and comparing the output predicted value with a sample true value to calculate the mAP.

In summary, compared with the conventional convolutional neural network, the network model provided by the present invention is different in that feature fusion in a concatee manner is performed on the conventional six feature layers, so that the shallow layer and the deep layer are interconnected, and meanwhile, a DB module is added to the first three layers of the new six feature layers, so as to enhance the detection efficiency of the shallow layer on the small target in a fusion manner.

Description of the drawings:

FIG. 1 is a flow chart for fusing Basic RFB P5 features.

DB1 with attention mechanism incorporated in FIG. 2

FIG. 3 structures of DB1, DB2 and DB3

The specific implementation mode is as follows:

a ship target detection method based on an improved RFBnet network comprises the following steps:

s2, firstly cutting a natural image with an original resolution into a size of 300 × 300 × 3, transmitting the natural image into an improved RFBnet network, keeping a common VGG only containing conv4-3 and fc7, adding some RFB modules, forming new feature layers which are respectively BasicRFB P3, P5, P6, P7 and P8, after conv4-3, changing one part of the natural image into a size of 38 × 38 × 256 through convolution, changing the other part of the natural image into a size of 19 × 19 × 512 through a mode of maximum pooling, convolution and pooling, continuing to change the natural image into a size of 19 × 19 × 1024 through a dilation convolution with a dilation rate of 6 and fc7, changing the natural image into a size of 19 × 19 × 256 through a convolution, continuing to sample into a size of 38 × 256, and then fusing the natural image with the improved cone of the first part to obtain a new feature layer, wherein the size is six feature layers;

s5, for the first latest effective feature layer 38 x 512, 1444 grids are contained in the first latest effective feature layer, each grid corresponds to 6 prior frames, and each grid of the second, third and fourth latest effective feature layers corresponds to 6 prior frames; each grid of a fifth and a sixth latest effective feature layer corresponds to 4 prior frames, finally 11620 prior frames are formed, then the 11620 frames are adjusted respectively according to a prediction result, then whether the 11620 adjusted frames contain the required object or not is judged, if yes, the frame is marked, of course, some frames obtained by utilizing the prior frames are overlapped, the score and the overlapping condition of the frames are also judged, and the required frame is found by utilizing a non-maximum inhibition method and the type of the frame is marked;

s6 candidate box matching and loss function design

In order to detect ship targets with different scales in an image, candidate frames with different aspect ratios are designed to be matched to adapt to image targets with different scales, according to an RFBnet loss function, a candidate frame D (D1, D2, … dn) can be obtained, wherein di is composed of (cx, cy, w, h)4 coordinate values, the (cx, cy) is a central point coordinate, w and h are the width and the height of the candidate frame respectively, the candidate frame is matched with a real label frame, and the coordinate of the candidate frame and the corresponding target category are obtained and can be specifically expressed as

l represents category class, and can be ordered for simple labeling

Wherein

Representing a set of prediction class vectors, wherein

in the formula (I), the compound is shown in the specification,

for the classification loss function, the cross entropy is used to calculate the loss, which can be expressed as:

in the formula (I), the compound is shown in the specification,

s7, network training

s8, network test

Claims

1. A ship target detection method for improving an RFBnet network is characterized by comprising the following steps:

s1, creating a 7000-piece 1920 × 1080-resolution ship data set SeaShips, wherein the data set comprises six types of ship types, the data set adopts a standard PASCALVOC labeling format, each picture is accurately labeled with a target label and a boundary box, preprocessing is performed before training, the size of an image is adjusted to 300 × 300 pixels, 1141 of an ore carrier type, 1129 of a bulk carrier type, 814 of a container type, 1188 of a general carrier type, 1258 of a firm boat type, 705 of a passger type, 765 of ship mutual occlusion observed by human eyes of the six types of mixed types, and the training set, the verification set and the test set are randomly divided according to a ratio of 7:2: 1;

s3, firstly, processing relatively shallow BasicRFB _ a P3 and BasicRFB P3 features by using maximum pooling layer 2 x 2, 3 x 3 convolution and Relu activation functions in a PFF module, then fusing with the BasicRFB P5, secondly, processing relatively deep BasicRFB P6, Conv2dP7 and Conv2dP8 by using Deconv 2 x 2, 3 x 3 convolution and Relu activation functions in the DFF module, fusing with the BasicRFB P5, and finally fusing the extracted features and performing L2 norm normalization operation to form new BasicRFB P5 features, and similarly, performing the same operation on other layers in a main network, thus forming six new effective feature layers in total;

s4, adding DB1, DB2 and DB3 expansion convolution blocks to new layers of BasicRFB _ a P3, BasicRFB P3 and BasicRFB P5 in a network frame respectively, then learning information of the receptive field of the three layers of feature units in an original image through DB1, DB2 and DB3 respectively, and finally adding a convolution layer, wherein after an RFBnet algorithm obtains a feature map, the feature map is input into a classification network and a positioning network respectively, category information and position information are obtained through convolution of 3 x 3, namely information of a target exists in feature units of 3 x 3 size, so that the algorithm also uses a convolution dimension reduction mode of 3 x 3, and thus an effective feature layer after the previous three layers of dimension reduction and an original three layers of effective feature layers form a six-layer latest effective feature layer together;

s6 candidate box matching and loss function design

l represents category class, and can be ordered for simple labeling

Wherein

Representing a set of prediction class vectors, wherein

in the formula (I), the compound is shown in the specification,

in the formula (I), the compound is shown in the specification,

indicates the probability of correct and class as background prediction box, a_tAnd r is a hyperparameter, and a_t∈[0,1]，r∈[0,5]When r is>A time of 0 means that the loss of the positive sample is relatively reduced, and the model focuses more on the training of the negative sample;

s7, network training

In the whole training process, in order to quickly optimize training, a priori boxes with IoU values larger than 0.5 of real boxes are set as positive example boxes, negative example boxes difficult to learn are adopted to participate in training, and the proportion of the positive sample and the negative sample is set as 3: 1, setting each 4 pictures as a batch, setting an optimizer as Adam, using a callback mode for learning rate, when two epochs are completed, loss does not decrease, the learning rate is reduced to half of the original rate, training is performed in three times, pre-training a model on an ILSVRC CLS-LOC data set, training for the second time, setting parameters of the first 20 layers of a network to be not involved in training, setting the initial learning rate to be 0.0005, setting the epochs to be 50, training all parameters of the third layer of the network to be involved in training, setting the initial learning rate to be 0.0001 and setting the epochs to be 100, adding early-stop for reducing training time, and finishing the training when 4 epochs of loss values do not decrease each time;

s8, network test