CN113850256A

CN113850256A - Target detection and identification method based on FSAF and fast-slow weight

Info

Publication number: CN113850256A
Application number: CN202111065576.2A
Authority: CN
Inventors: 聂振钢; 赵乐; 卢继华; 侯杰继; 马志峰; 韩航程; 谢民
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2021-12-28

Abstract

The invention relates to a target detection and identification method based on FSAF and fast-slow weight, and belongs to the technical field of supervised learning and target identification. The method comprises the following steps: 1) building a main network comprising a convolution layer, a characteristic layer and a prediction layer and a RetinaNet branch with a reference frame; 2) the method comprises the following steps of building an FSAF branch and generating an effective area and an neglected area of an image feature layer, specifically: adding a branch without a reference frame in a prediction layer of each level in the RetinaNet branch with the reference frame, and taking a mapping frame of a standard frame in different feature layers as an effective area after being reduced by A1 times and taking the mapping frame as an ignored area after being reduced by A2 times; 3) calculate the sum of the classification loss plus the regression loss based on the FSAF branch: comprehensive loss; 4) and inputting the comprehensive loss corresponding to the optimized feature layer into a standard optimizer and a LookAhead optimizer, so that the comprehensive loss is converged. According to the method, a non-reference frame mechanism and a Lookahead fast-slow weight optimizer are introduced, so that a better recognition effect than that of a RetinaNet network is achieved, and the recognition accuracy and the loss convergence speed are improved.

Description

Target detection and identification method based on FSAF and fast-slow weight

Technical Field

The invention relates to a target detection and identification method based on FSAF and fast-slow weight, belonging to the technical field of supervised learning and target identification.

Background

With the continuous and deep development of deep sea observation and ocean technology, the importance of underwater target identification in fishery, breeding industry, sea defense and military application is increasingly prominent. However, the underwater environment in natural environment is very complex, various types of interference layer are infinite, and the effect of detecting underwater targets is greatly influenced. Techniques that focus on improving the accuracy of underwater target detection have also been rapidly developed.

From the viewpoint of object recognition, underwater image data has the following problems: (1) degradation of edges and details; (2) extremely low contrast of the target to the background; (3) various types of noise caused by the floating objects and the marine domestic garbage. In order to solve these problems, the conventional method mainly performs feature extraction of an image by hand and manpower. That is, after the acquired image is processed, the specified features are manually located and classified in the acquired image, and the feasibility of the method in a huge data set is extremely low. With the development of the technology, a plurality of target and feature recognition technologies are proposed.

Among them, CNN (Convolutional neural network) is currently the most commonly used structure for object recognition and image feature extraction. In application, the convolutional neural network needs to be identified as a picture with the following properties: (1) and (4) locality. Since usually, only part of an image contains features of the image in a picture, and weighted summation of the features can obtain a feature value of an object. (2) Location invariance. In a set of data comprising many images, different features and different numbers of objects at different locations may be included, but the relative coordinates of similar objects or similar features are. (3) And (4) stability. After down-sampling, the picture can basically keep the same feature information, and the target generally has several obvious features.

In the case of small object recognition, the following problems arise: (1) small objects have certain features, however these features are too small compared to the amount of image required to be recognized, and may be dense and scattered where the image appears. (2) a small object has a limited amount of information and, due to its change in pose, is prone to have a reduced positional reference of its features. (3) Small objects may lose important information themselves after sampling.

To solve the above problems, there are many different solutions, such as simply enlarging the input image size directly, using pre-segmentation of the image into different sizes, etc., like YOLOv 1; more complicated, it is advantageous to enlarge the small object in the image using a competing network, or to select a place where the target may appear via an independent neural network before recognition, etc. In addition, feature fusion of the convolutional network structure itself is also considered, such as FPN and DSSD. The feature pyramid network FPN is a kind of deep convolutional neural network, and the feature map of each layer in the feature pyramid network selects subsequent classification and regression according to the target size; the common convolution network performs different target detection according to the feature map of the convolution result of the last layer. DSSDs are improved from SSDs.

In order to obtain a faster regression convergence, the concept of reference frame is also proposed. The reference frame mechanism is introduced as in YOLOv2, and the number of reference frames is increased in YOLOv 3. It can be interpreted approximately as a priori initiative to propose a plurality of frames, refer to the frames and train according to the preset, equivalent to artificially setting a limited number of positions to learn the network, and newer networks such as single-stage RetinaNet, double-stage Faster-RCNN, etc. all adopt the mode. The reference frame mechanism can generate a dense reference frame set, so that the network can directly perform target classification and bounding box coordinate regression on the basis of the dense reference frame set. The dense reference frame can effectively improve the network target recall capability, and the improvement is very obvious for small target detection.

But there is a reference frame mechanism that each instance can only match with the feature layer with the size closest to the instance in the layer when training exists, and the matching rule comes from artificial design in the early stage. The feature layer selected for each instance is therefore based entirely on temporal heuristics, and two similar instances may be assigned to different feature layers due to size differences, resulting in the feature layer selected by the training instance may not be optimal.

On the basis, the FSAF introduces a mechanism without a reference frame, and in practical application, the mechanism without the reference frame (Anchor-Free) reduces the high requirement of the reference frame mechanism on prior knowledge, reduces the generation rate of redundant frames under a limited target and improves the baseline performance of a backbone network. The disadvantage that the range cannot be guaranteed to be large enough to obtain the optimal feature layer under the single frame-referenced network mechanism, so that each feature instance can freely select the optimal hierarchy to optimize the network. Meanwhile, the added overhead is very limited and can be almost ignored.

In addition, compared with a model only adopting an FSAF network and an internal standard optimizer (such as SGD or Adam), the Lookahead optimizer with image preprocessing and fast and slow weight is inserted to obtain a better effect, so that better fitting to target characteristics is obviously achieved, and the model has a faster convergence rate to loss.

There are two types of convergence loss, namely classification loss and regression loss. In the calculation process, in order to obtain the group-route of the example, the concepts of an effective area and an ignored area are introduced, and the results of mapping the example of a certain category and the standard box on the feature map according to different proportions in the feature pyramid are respectively set to be 1 and 0 as the effective area and the ignored area. The classification loss is the regularization of the sum of focal distances of the non-ignored regions divided by the sum of the number of pixels in the active region. The regression loss is the average of the IoU losses for the image valid box region, i.e., the prediction box.

The Lookahead optimizer first updates the "fast weight" k times in its inner loop using the internal criteria optimizer before updating the "slow weight" in the direction of the weight optimum, which reduces the variance. We have found that the Lookahead is less sensitive to sub-optimal hyper-parameters, thus reducing the need for extensive hyper-parameter tuning. By using an internal criteria optimizer in combination with a lookup head fast and slow weight optimizer, faster convergence on loss can be achieved.

Disclosure of Invention

The invention aims to provide a target detection and identification method based on FSAF and fast-slow weight, aiming at the technical defect that the prior RetinaNet and target detection method based on FSAF have lower average accuracy.

In order to achieve the purpose, the invention adopts the following technical scheme:

the target detection and identification method based on FSAF and fast-slow weight comprises the following steps:

step 1, building a main network comprising a convolution layer, a characteristic layer and a prediction layer and a RetinaNet branch with a reference frame;

the number of convolution layers is the number of layers of the feature pyramid, and is marked as N, the number of prediction layers is also N, the number of feature layers is N-2, and the resolution of the convolution layer of the first layer is 1/2 of the resolution of the input image^l；

Wherein, the value range of l is 1 to N-1, and the characteristic pyramid is the convolution layers from 1 st to N th;

step 1 comprises the following substeps:

step 1.1, taking the original image as a first layer of convolution layer, and sequentially carrying out 1/2 sampling from bottom to top to obtain 2 nd to N th layers of convolution layers to obtain a characteristic pyramid;

step 1.2, feature fusion is carried out on different layers of the feature pyramid to obtain feature layers from the 3 rd layer to the Nth layer, and the method specifically comprises the following steps:

for the N-th convolution layer, the N-th convolution layer can be directly used as the N-th characteristic layer; for l-3, …, N-1, summing the result of 2 times up-sampling of the l +1 th layer of feature map layer with the result of 1 × 1 convolution of the l th layer of convolution layer, namely performing feature fusion, and sequentially obtaining the N-1 to 3 rd layer of feature map layer after feature fusion;

the 1 × 1 convolution is used for ensuring that the dimension of the convolution layer participating in the fusion is consistent with that of the characteristic layer;

step 1.3, respectively carrying out 3 x 3 convolution on the N layer of the feature map layer obtained in the step 1.2 and the feature map layers from the Nth layer to the 3 rd layers to obtain prediction layers from the Nth layer to the 3 rd layers, wherein N-2 layers are shared;

step 1.4, independently carrying out 3 × 3 convolution on the Nth convolution layer obtained in the step 1.1 to obtain an N +1 th prediction layer;

step 1.5, performing 3 × 3 convolution on the N +1 th prediction layer obtained in the step 1.4 to obtain an N +2 th prediction layer;

so far, N prediction layers with the same number as the convolution layers are obtained through the steps 1.3 to 1.5, and different types of object characteristics are fused in each prediction layer;

step 1.6, adding a RetinaNet branch with a reference frame behind each prediction layer;

thus, a backbone network built by the convolution layer, the characteristic layer and the prediction layer is obtained;

step 2, establishing an FSAF branch based on the main network established in the step 1 and a RetinaNet branch with a reference frame, and generating an effective area and an ignored area of an image feature layer; the method specifically comprises the following steps: adding a branch without a reference frame in a prediction layer of each level in the RetinaNet branch with the reference frame, and taking a mapping frame of a standard frame in different feature layers as an effective area after being reduced by A1 times and taking the mapping frame as an ignored area after being reduced by A2 times;

wherein, all branches without reference frame are collectively called FSAF branches; the value range of A1 is 0.15 to 0.4, and the value range of A2 is 0.45 to 0.6; the non-reference frame branch comprises a classification subnet and a regression subnet, the structure of the classification subnet is 'w multiplied by h multiplied by K convolutional layer + Sigmoid activation function', and the structure of the regression subnet is 'w multiplied by h multiplied by 4 convolutional layer + ReLU activation function'; a gray area is arranged between the effective area and the neglected area;

wherein, the output of the prediction layer of each level in the branch with the reference frame is the input of the convolution layer of w × h × K in the corresponding classification subnet, and the output of the convolution layer of w × h × K in the classification subnet is the input of the Sigmoid function; the Sigmoid activation function maps the output of the w multiplied by h multiplied by K convolutional layers in the classified subnets to 0-1; k is the number of 3 multiplied by 3 convolution kernels of the convolution layers in the classified sub-network and corresponds to K characteristic types; the ReLU function is a ramp function; w and h are the height and width of the standard box, respectively; and the standard box is marked in the original data;

step 3, calculating the comprehensive loss based on the FSAF branch;

wherein the comprehensive loss is the sum of the classification loss and the regression loss;

the step 3 specifically comprises the following steps:

step 3.A, calculating the classification loss by the formula (1):

wherein, I is the given example,

classification loss for the ith layer feature layer for a given example I;

is composed of

The sum of the pixel points in the region,

for the active area of this example of this layer feature layer, FL (l, i, j) is the focal loss at the location of the l-th layer feature layer (i, j), which is calculated by (2):

FL(p_t)＝α_t(1-p_t)^γlog(p_t) (2)

among them, the focal loss, i.e., focal length, is denoted as FL (p)_t)；p_tRepresenting the classification probability of different classes, gamma being a hyper-parametric index, alpha_tIs a hyper-parameter coefficient;

step 3.B, calculating the regression loss through the formula (3);

wherein,

for the regression loss for the ith layer feature layer of example I given, where IoU (l, I, j) is the IoU loss at the ith layer feature layer (I, j) position, and the IoU loss is the comparison calculation of the standard and prediction blocks of the training set, calculated by equation (4):

wherein, Box_pFor the current standard Box, Box_lA prediction box obtained for the current calculation; the molecular moiety Box in formula (4)_p∩Box_lIs Box_pAnd Box_lArea of common part | Box_p∪Box_lThe denominator part is Box_pAnd Box_lThe sum of the area, ln is a logarithmic function; after the standard frame is processed by an effective area and an neglected area of the example, setting 0 for the area outside the neglected area and setting 1 for the effective area of the standard frame; regarding the position between the effective area and the neglected area as a gray area, and not processing the data of the area;

after the online feature selection is performed on the prediction frame based on the loss calculation result, the selected optimal feature layer and the loss are returned for the next iteration to be inferred, and the method specifically comprises the following steps: after the branch of each example obtains a result, the comprehensive loss of the prediction layer sample of each layer is obtained by averaging, and then the prediction layer with the minimum comprehensive loss is selected as a return characteristic layer;

wherein, the comprehensive loss of the return characteristic layer is the input of the optimizer in the RetinaNet branch with the reference frame; a prediction layer with minimum comprehensive loss, namely an optimal characteristic layer;

the prediction box is obtained by the following steps:

and 3, BA obtaining a characteristic layer with minimum comprehensive loss through a formula (5):

wherein l^*The number of feature layers for parameter feedback is selected,

and

the classification loss and the regression loss of the l-th layer feature layer of the I example are shown as the returned feature layer

Wherein, the characteristic layer with the minimum comprehensive loss is the optimal characteristic layer;

step 3.BB obtains the coordinates of a prediction frame based on the result of the BB optimal feature layer in the step 3. BB;

step 3, BB.1 calculates the offset;

the method specifically comprises the following steps: for all pixels (i, j) of the effective area, mapping frames of standard frames of the characteristic layer of the layer

Represented by pixels (i, j) and

distance between four sides, and a normalization constant S is set to 4, and pixel (i, j) is connected with pixel (i, j)

The distances of the four sides are subjected to offset processing to obtain a transmission offset;

wherein (i, j) is the horizontal and vertical coordinates of the pixel,

mapping standard frames on the l-th layer characteristic layer;

step 3, BB.2 calculates the frame size and the mapping frame of the characteristic layer by taking all pixels in the effective area as the center

The integration loss of the predicted frame and the standard frame are consistent, and the pixel (i) with the minimum integration loss is selected_min,j_min) As the center of the prediction box;

and 3, BB.3, eliminating the influence of the standardized constant S on the offset to obtain the actual distance between the pixel and the prediction frame, obtaining the coordinates of the prediction frame on the mapping upper left corner and lower right corner of the characteristic layer, and reusing the coordinates

Zooming the mapping frame to obtain a prediction frame of the original image;

wherein the pixel (i) is obtained by eliminating S, i.e. the transfer offset, by multiplying the normalization constant S in step 3.BB.1_max,j_max) Distances from four sides of the prediction frame;

step 4, inputting the comprehensive loss corresponding to the optimal feature layer selected in the step 3 into an internal standard optimizer and a LookAhead optimizer in a branch of the RetinaNet reference frame, so that the comprehensive loss is converged;

step 4, specifically comprising the following substeps:

step 4.1, initializing an external loop count value, an objective function L, a slow weight parameter phi and a fast weight parameter theta;

wherein, the external cycle count value is denoted as t, and the maximum cycle count value is denoted as t_maxInitializing t to 1, initializing slow weight parameter phi₀；

Step 4.2 in the t-th external cycle, the slow weight parameter phi with the external cycle count value at the moment of t-1_t-1Assigning an initial fast weight theta to serve as an initial parameter of the standard optimizer;

the standard optimizer runs in the loop of the Lookahead optimizer, and the iteration number in the loop is a synchronization period k; the synchronization period k is the iteration number of the standard optimizer in the loop of the Lookahead optimizer, the synchronization period k is carried out according to the optimization effect, and the value range of k is 1000-100000-;

4.3, calculating the fast weight in the ith iteration of the loop in the standard optimizer by the standard optimizer in the Lookahead optimizer according to the formula (6):

θ_t,i＝θ_t,i-1+A(θ_t,i-1,d) (6)

wherein A is a standard optimizer and is one of standard gradient descent and random gradient descent; d is the sampling value of the current data; theta_t,i-1Parameters required for a standard optimizer; theta_t,iIs the fast weight optimization result of the ith iteration of the standard optimizer in the t-th outer loop; theta_t,i-1Is the fast weight optimization result of the i-1 iteration of the standard optimizer in the t-th outer loop;

step 4.4 updates the slow weight parameter based on the result of step 4.3 by equation (7):

φ_t＝φ_t-1+β(θ_t,k-φ_t-1) (7)

wherein, theta_t,kIs the fast weight optimization result of the kth iteration of the standard optimizer in the t-th outer loop; phi is a_t-1Is a slow weight parameter at the moment when the outer loop count value is t-1; phi is a_tIs a slow weight parameter with an outer loop count value at time t; beta is an iteration parameter of the Lookahead optimizer;

step 4.5 judges whether the outer loop count value t is equal to the maximum loop count value t_maxAnd determining whether to finish the method, specifically: if yes, outputting a slow weight parameter phi at the time t_tCompleting optimization; otherwise, making t equal to t +1, and jumping to the step 4.2;

so far, from step 1 to step 4, the target detection and identification method based on the FSAF and the fast-slow weight is completed.

Advantageous effects

Compared with the prior art, the target detection and identification method based on FSAF and fast-slow weight has the following beneficial effects:

1. compared with a RetinaNet network only introduced with a reference frame mechanism, the method adds a no-reference frame mechanism, and selects an optimal feature layer by an online feature selection method under the no-reference frame mechanism under the condition of no obvious increase of complexity, so that the FSAF network has obviously better training precision compared with the RetinaNet network in clearer underwater data set comparison;

2. according to the method, by introducing regression loss and classification loss calculation and combining focus loss, IoU loss and online feature selection in loss estimation and optimal feature layer selection, compared with a RetinaNet network, the method realizes the rapid convergence of comprehensive loss, and the convergence effect is obviously better than that of the RetinaNet network;

3. according to the method, a Lookahead fast-slow weight optimizer is inserted, and a fast weight sequence is checked in advance on the basis of an internal standard optimization method to determine the search direction, so that better optimization speed is achieved. In practice, the precision and the convergence speed are improved by inserting the Lookahead optimizer;

4. according to the method, in the comparison of the FSAF and the RtinaNet network, a top prediction layer is removed, so that the calculation amount is effectively reduced, and meanwhile, better target detection precision can be kept;

drawings

FIG. 1 is a schematic diagram of a backbone network of the target detection and identification method based on FSAF and fast-slow weight according to the present invention;

FIG. 2 is a schematic diagram of a FSAF branch of the target detection and identification method based on FSAF and fast-slow weight according to the present invention;

FIG. 3 is a branch schematic diagram of the FSAF and fast-slow weight based target detection and identification method without reference frame according to the present invention;

FIG. 4 is a schematic diagram of an image preprocessing flow in an example of the target detection and identification method based on FSAF and fast-slow weighting according to the present invention;

FIG. 5 is a schematic diagram of a training flow in an example of the target detection and identification method based on FSAF and fast-slow weights according to the present invention;

FIG. 6 is a schematic diagram of loss convergence in an actual environment of the target detection and identification method based on FSAF and fast-slow weighting according to the present invention;

FIG. 7 is a schematic diagram of mAP in an actual environment of the target detection and identification method based on FSAF and fast-slow weight according to the present invention;

FIG. 8 is a test result of the target detection and identification method based on FSAF and fast-slow weight in underwater environment;

FIG. 9 is a graph showing the convergence of the first 40 epoch losses in a small data set comparison based on FSAF and RetinaNet according to the present invention;

FIG. 10 is a test result of the FSAF and fast-slow weight based target detection and identification method of the present invention in a clear environment;

Detailed Description

The object detection and identification method based on FSAF and fast-slow weighting according to the present invention will be further explained and described in detail with reference to the accompanying drawings and embodiments.

Example 1

The method realizes the optimal selection of the feature layer based on the feature pyramid and the RetinaNet network, combines a fast-slow weight optimizer, and has wide application space in underwater target identification and target detection with fuzzy features in dark environment; the method has great practical significance in the application scenes of marine industry, fishery industry and terrestrial fuzzy environment. In an example, simulated environmental testing is performed on a training and testing data set provided by a global underwater robot competition. The data set is obtained by intercepting non-adjacent frames of the video in a simulation environment, so that the underwater fuzzy environment is well simulated, and in addition, good effects are obtained on the data sets of classified identification such as ImageNet, PASCAL VOC, Labelme and the like.

The network model training and testing equipment configuration used in the examples is as follows: i7-9750H, 8GB memory, GPU (GTX1050) and 3GB video memory. The training results of the shallow sea environment picture provided by the URPC data set are visualized by this example. Selecting 2 pictures as a batch, forming an epoch by 1250 steps, and training 60 epochs;

mainly comprises the following steps:

a, selecting and dividing a subset of a data set from an underwater robot game;

the data set pictures are taken in underwater environments with different actual conditions, and the data set pictures comprise 7 types of subsets, wherein the numbers of the subsets are (1)2019V1, 2019V2, 2019V3(2) CHN083846(3) G0024172, G0024173, G0024174(4) GOPR0293, GOPR0294(5) GP010293, GP010294, GP010295, GP010296(6) YDXJ0001, YDXJ0002, YDXJ0003, YDXJ0013(7) YN01, YN02 and YN 03;

dividing an atlas for comparing model effects, numbering a training set with YN at the beginning and 2574 in number, and dividing the residual 2183 pictures into an evaluation set;

b, preprocessing the image;

the image preprocessing flow is shown in fig. 4, and step B includes the following sub-steps:

b.1, reading an image, transferring the image to a GPU, and 1/2 sampling the image with the maximum edge larger than 1000 pixels until the maximum edge of the image is smaller than 1000 pixels;

b.2, carrying out balanced RGB three-channel color processing on the image with the same size;

weighting and adding a global average value and a local average value of a single channel of an image, then proportionally balancing the result to 100, and finally combining the three channels into an image matrix with the original size;

step B.3, defogging and brightness balancing are carried out on the matrix obtained in the step B.2;

the brightness balance is to correct the condition that the partial pixel value of the image after the defogging processing is larger than 255, so the brightness adjustment is carried out after the defogging. The method specifically comprises the following steps:

inverting and centralizing the image pixel values, and adjusting the brightness of the pixel segments larger than the average value;

b.4, performing median filtering on the result obtained in the step B.3 to filter noise in the image;

b.5, restoring the image to 800 multiplied by 800 size and saving the image to a specified folder;

step C, building a main network comprising a convolution layer, a characteristic layer and a prediction layer and a RetinaNet reference frame branch;

wherein, the number of the convolution layers is set to 5, and the number of the characteristic map layers and the prediction layers is also 5

Step C comprises the following substeps:

step C.1, taking the original image as a first layer of convolution layer, and sequentially carrying out 1/2 sampling from bottom to top to obtain 2 nd to 5 th layers of convolution layers to obtain a characteristic pyramid;

and C.2, carrying out feature fusion on different layers of the feature pyramid, specifically:

the 5 th convolution layer can be directly used as a 5 th characteristic layer; for l is 3,4, summing the result of 2 times up-sampling of the l +1 th layer of feature map layer with the result of 1 × 1 convolution of the l th layer of convolution layer, namely performing feature fusion, and sequentially obtaining the 3 rd and 4 th layer of feature map layers after feature fusion;

step C.3, respectively carrying out 3 multiplied by 3 convolution on the N layer of the feature map layer obtained in the step 1.2 and the feature map layers of the '3 rd and 4 th layers' to obtain prediction layers from the 5 th layer to the 3 rd layer, wherein the total number of the prediction layers is 3;

step C.4, independently carrying out 3 × 3 convolution on the convolution layer 5 obtained in the step 1.1 to obtain a prediction layer 6;

step C.5, performing 3 × 3 convolution on the 6 th layer prediction layer obtained in the step 1.4 to obtain a 7 th layer prediction layer;

so far, 5 prediction layers with the same number as the convolution layers are obtained through the steps 1.3 to 1.5, and a built backbone network is obtained; the backbone network structure is shown in fig. 1;

step C.6, adding a RetinaNet branch with a reference frame behind each layer of prediction layer;

d, building an FSAF branch based on the main network built in the step C and the RetinaNet branch with a reference frame, and generating an effective area and an ignored area of the image feature layer; the method specifically comprises the following steps: adding a branch without a reference frame in a prediction layer of each level in the RetinaNet branch with the reference frame, and taking a mapping frame of a standard frame in different feature layers as an effective area after being reduced by A1 times and taking the mapping frame as an ignored area after being reduced by A2 times;

wherein, all branches without reference frame are collectively called FSAF branches; the value of A1 is 0.25, and the value of A2 is 0.5; the non-reference frame branch comprises a classification subnet and a regression subnet, the structure of the classification subnet is 'w multiplied by h multiplied by K convolutional layer + Sigmoid activation function', and the structure of the regression subnet is 'w multiplied by h multiplied by 4 convolutional layer + ReLU activation function'; a gray area is arranged between the effective area and the neglected area;

wherein the FSAF branch is shown in fig. 2; the branch diagram without reference frame is shown in FIG. 3;

step E, calculating the comprehensive loss based on the FSAF branch;

wherein the comprehensive loss comprises classification loss and regression loss;

the step E specifically comprises the following steps:

step e.a calculates the classification loss by equation (1):

wherein, I is the given example,

classification loss for the ith layer feature layer for a given example I;

is composed of

The sum of the pixel points in the region,

for the active area of this example of the layer feature layer, FL (l, i, j) isThe focal loss at the position of the ith layer feature layer (i, j) is calculated by (2):

FL(p_t)＝α_t(1-p_t)^γlog(p_t) (2)

among them, the focal loss, i.e., focal length, is denoted as FL (p)_t)；p_tRepresenting the classification probability of different classes, wherein gamma is a hyper-parametric index and takes the value of gamma as 2.0 and alpha_tThe coefficient is a hyper-parameter coefficient, and the value is alpha is 0.25;

step E.B calculating the return loss by equation (3);

wherein,

after the example is processed by an effective area and an neglected area, setting 0 for the area outside the neglected area and setting 1 for the effective area of the standard frame to obtain the standard frame; regarding the position between the effective area and the neglected area as a gray area, and not processing the data of the area; setting the threshold of the effective area to be 0.5 and the threshold of the neglected area to be 0.25, wherein the thresholds respectively correspond to A1 and A2 in the step D;

wherein, the comprehensive loss of the return characteristic layer is the input of the optimizer in the RetinaNet branch with the reference frame;

step E.C, selecting online features, calculating to obtain coordinates of the prediction box, specifically including the following substeps:

c1, obtaining a characteristic layer with minimum total loss through formula (5):

wherein l^*The number of feature layers for parameter feedback is selected,

and

the classification loss and the regression loss of the I example in the l layer feature layer are respectively;

wherein, the characteristic layer with the minimum total loss is the optimal characteristic layer;

E.C2, obtaining a coordinate of a prediction frame based on the result of the optimal characteristic layer in the step E.C 1;

c2.1 calculating an offset;

the method specifically comprises the following steps: for all pixels (i, j) of the effective area, mapping frames of the standard frames of the feature layer

Expressed as a 4-dimensional vector

Set normalization constant S to 4, pair

Inner pixel, get d_i,j(ii) S is an offset, expressed as

Wherein (i, j) is the horizontal and vertical coordinates of the pixel,

for the mapping of the standard box at the l-th layer feature level,

are pixels (i, j) and (b), respectively

The distance between the upper edge, the lower edge, the left edge and the right edge;

c2.2 calculating the frame size and the mapping frame of the characteristic layer by taking all pixels in the effective area as the center

step E.C2.3 eliminating the standard constant S from the predicted offset

To obtain a distance of the pixel from the prediction frame as

Obtaining the coordinates of the mapping upper left corner and lower right corner of the prediction frame in the feature layer

And

can be reused

Zooming the mapping frame to obtain a prediction frame of the original image;

wherein,

respectively representing that the pixel distance selected in the step 3.C2.2 is centered on itself, and the frame size and the feature layer mapping frame of the layer

The offset of the top, bottom, left and right sides of the consistent prediction frame corresponds to that in step e.c2.1

Rejecting the influence of S on the offset, i.e. in

Multiplying each element by the normalization constant S in the step E.C2.1 to obtain the distance between the pixel and four edges of the prediction frame

Step F, inputting the comprehensive loss corresponding to the optimal characteristic layer selected in the step E into an internal standard optimizer and a LookAhead optimizer in a RetinaNet reference frame branch for comprehensive loss convergence;

the step F specifically comprises the following substeps:

f.1, initializing an outer loop count value, a slow weight parameter phi and a fast weight parameter theta;

wherein, the external cycle count value is denoted as t, and the maximum cycle count value is denoted as t_maxInitializing a slow weight parameter, wherein the initialization t is 1; the fast weight parameter and the slow weight parameter are weights of all the images during the comprehensive convergence of the calculated images;

step F.2 in the t-th external cycle, the count value of the external cycle is the slow weight parameter phi at the moment of t-1_t-1Assigning an initial fast weight theta to serve as an initial parameter of the standard optimizer;

the standard optimizer runs in the loop of the Lookahead optimizer, and the iteration number in the loop is a synchronization period k; the synchronization period k is the iteration number of the standard optimizer in the loop of the Lookahead optimizer, the synchronization period k is carried out according to the optimization effect, and the adjustment value range of k is 1000-100000-;

and F.3, calculating the fast weight in the ith iteration of the loop in the standard optimizer by the standard optimizer in the Lookahead optimizer according to the formula (6):

θ_t,i＝θ_t,i-1+A(θ_t,i-1,d) (6)

wherein A is a standard optimizer and is a random gradient descent; d is the sampling value of the current data; theta_t,i-1Parameters required for a standard optimizer; theta_t,iIs the fast weight optimization result of the ith iteration of the standard optimizer in the t-th outer loop; theta_t,i-1Is the fast weight optimization result of the i-1 iteration of the standard optimizer in the t-th outer loop;

step F.4 updates the slow weight parameter based on the result of step 4.3 by equation (7):

φ_t＝φ_t-1+β(θ_t,k-φ_t-1) (7)

step F.5 determines whether the outer loop count value t is equal to the maximum loop count value t_maxAnd determining whether to finish the method, specifically: if yes, outputting a slow weight parameter phi at the time t_tCompleting optimization; otherwise, making t equal to t +1, and jumping to the step F.2;

g, training images based on the method from the step A to the step F, wherein the training model is set as follows:

the training model used the FSAF model after training through a simulation environment at the time of comparison. Training uses a single GPU, and the learning rate of an internal standard optimizer is set to be 0.000625; the weight updating totals 60 iterations, 0.001 is selected as an initial iteration parameter, and the learning rate is reduced to 10% of the previous time at the 40 th iteration;

starting training the model for calling tools/train in the toolbox; the training flow is shown in fig. 5;

step H, performing visualization processing on the target identification record based on the obtained training result;

specifically, after a prediction frame and comprehensive loss are obtained, calling an mmcv library to display the obtained loss convergence of the prediction frame;

the method comprises the following steps of considering that a target is an underwater organism, has a camouflage pattern and is easy to gather, and screening images as follows:

for prediction blocks contained by different classes, they are considered low-score samples. When the center of the frame is close to the center of the frame containing the frame, the frame is discarded; otherwise, reducing certain confidence level; for the shielded small frame, the confidence coefficient of the shielded small frame is moderately improved; in order to ensure a better display effect, only a part of the frames with high reliability are reserved. Due to underwater channel interference and the like, a high-credibility frame can be used as a result which can be identified in practical application;

the model to be evaluated further comprises two parts of prediction accuracy and recall rate, wherein the single-class accuracy formula is shown as the formula (8):

P＝TP/(TP+FP) (8)

the recall ratio formula is shown in equation (9):

R＝TP/(TP+FN) (9)

wherein, P and R are the precision rate and the recall rate of a single required identification category respectively, TP represents the number of samples which are predicted to be positive samples and are predicted to be accurate, FP represents the number of samples which are predicted to be positive samples and are predicted to be wrong, and FN represents the number of samples which are predicted to be positive samples and are predicted to be wrong and are extremely positive samples.

In the criteria of the data set, the average accuracy (mAP) and average recall (mAR) of all feature types are defaulted as a way to measure the goodness of the model. Where mAP50 is the average accuracy obtained when the predicted box with overlap greater than 50% of the actual box is considered as correct prediction and less than it is considered as wrong prediction, and mAP is the average accuracy obtained when the step size is set to 0.05 and the average accuracy obtained with the overlap equal to 50% to 95% is obtained. The recall rate is the same, and by combining the mAP and the mAR, the performance of the model in practical application can be effectively evaluated.

The integrated loss convergence is shown in fig. 6: as can be seen from fig. 6, the addition of image pre-processing and the introduction of the LookAhead optimizer converged slightly faster in the early stage, but converged slower in the late stage instead. The convergence speed is more biased to directly use the convergence effect of the internal standard optimizer;

and the FSAF network is better than the RetinaNet network in convergence speed;

in addition, when only one Epoch is trained in all pictures, the mAP of the model optimized by adopting the LookAhead is far higher than that of the model not optimized by adopting the LookAhead. However, in the following training process, the evaluation results of all the modes on the evaluation set fluctuate greatly, as shown in fig. 7; furthermore, although RetinaNet is less effective in the initial period, the effect achieved at the completion of training is substantially the same as FSAF under the same conditions. The final mAP is shown in Table 1:

TABLE 1 results of evaluation of actual Environment

mAP(IoU＝0.5)	FSAF	FSAF+LookAhead	FSAF + LookAhead + pretreatment	RetinaNet
					First cycle	25.9％	39.5％	42.5％	No result
End result	54.2％	54.3％	54.6％	54.4％

The partial inspection visualization results are shown in fig. 8;

the left sides of the two adjacent pictures are original images, and the right sides of the two adjacent pictures are pictures subjected to image preprocessing. It can be seen that after training, the network can mark out most of the objects to be recognized. Some objects which are difficult to be identified by naked eyes in the original image are well marked in the picture after image preprocessing.

Example 2

This embodiment describes a specific implementation of the target detection and identification method based on FSAF and fast-slow weight, which realizes performance comparison between the method and a RetinaNet network in an environment with a clear small data set, shows a comparison result of the two methods, and includes the following steps:

step A, dividing an atlas for comparing model effects into training sets with the number of 200 and evaluation sets with the number of 100, wherein the division mode adopts random division;

in specific implementation, the picture is derived from a training and testing data set provided by an underwater robot competition, and the data set is derived from non-adjacent frames of an intercepted video in a simulation environment;

b, preprocessing the image;

the image and processing are used for obtaining an image input model with basically consistent size and image effect, and the flow of the image and processing is the same as that of the embodiment 1;

step C, building a network through an MMDetection tool box,

the model is constructed according to the requirements of the tool box, and comprises the following sub-steps:

c.1, invoking torchvision to construct a residual error network module for feature extraction;

wherein the torchvision is a tool library which is independent of the pyrch and is related to image operation, and the constructed residual error network is mainly used for carrying out 1/2 down-sampling on an image original layer C1 to obtain convolution layers C2, C3, C4 and C5 with different sizes so as to form five convolution layers;

and C.2, performing 1 × 1 convolution on C3, C4 and C5 to obtain prediction layers P3, P4 and P5. 3 × 3 convolution with step size 2 is performed on C5 to obtain a prediction layer P6, which is different from embodiment 1 in that embodiment 2 removes a prediction layer output P7 additionally added by RetinaNet after the P6 layer;

step C.3, setting partial hyper-parameters used in training and evaluation, wherein the parameter setting is the same as that of the embodiment 1;

step D, building an FSAF branch, wherein the specific steps are consistent with those of the embodiment 1 so as to be convenient for comparison with a reference frame branch of a RetinaNet network;

step E, inserting a Lookahead fast-slow weight optimizer into the FSAF internal standard optimizer, wherein the parameter setting is consistent with that of the embodiment 1;

step F, training images based on a training set, wherein the training setting is the same as that of the embodiment 1;

g, performing result visualization processing on the target recognition based on the training result obtained in the step F; specifically, after a boundary frame, a classification and a prediction score of a predicted track are obtained, an mmcv library is called to qualitatively display the obtained prediction frame, and the specific steps and the subsequent screening mode are the same as those of the embodiment 1;

when the effects of the FSAF and the RetinaNet model are compared, the learning rate of the RetinaNet network is set according to the learning rate/(the number of single-GPU processed pictures × GPU): 0.01/(2 × 8), that is, the learning rate is 0.00625, and compared with this, the FSAF model is trained with two learning rates, namely 0.001 and 0.00625. The convergence of the first 40 epoch losses in the training is shown in fig. 9, which shows that the loss convergence rate of FSAF is significantly better than RetinaNet, and the convergence effect when the learning rate is 0.001 is better than the convergence effect when the learning rate is 0.000625; the performance of the RetinaNet network and the FSAF network at different learning rates is shown in table 2:

from the results in table 2, when training was inadequate, the demonstrated learning efficiency of FSAF was apparently due to RetinaNet; when the learning rate is large enough, the FSAF shows good performance, and a good training effect is achieved in a short time, compared with RetinaNet, the mAP50 reaches 93%, and in the aspect of recall rate, the FSAF is higher than RetinaNet by 2.7% under the same training condition;

TABLE 2 fast evaluation results of small data sets

The simulation test results in a clearer environment are shown in fig. 10:

the top left corner of the figure records the number of recognized aquatic biometrics marked with prediction boxes 1, 2, 3, respectively, for different targets. It can be seen that the FSAF model achieves excellent results in a cleaner environment after sufficient training has been performed.

While the foregoing is directed to the preferred embodiment of the present invention, it is not intended that the invention be limited to the embodiment and the drawings disclosed herein. Equivalents and modifications may be made without departing from the spirit of the disclosure, which is to be considered as within the scope of the invention.

Claims

1. The target detection and identification method based on FSAF and fast-slow weight is characterized by comprising the following steps: the method comprises the following steps:

step 1, building a main network comprising a convolution layer, a characteristic layer and a prediction layer and a RetinaNet reference frame branch;

the number of the convolution layers is the number of layers of the characteristic pyramid and is marked as N, the number of the prediction layers is also N, and the number of the characteristic layers is N-2;

step 1 comprises the following substeps:

step 1.2, carrying out feature fusion on different layers of the feature pyramid to obtain feature layers from the 3 rd layer to the Nth layer;

so far, the same number of N prediction layers as the number of convolutional layers are obtained through the steps 1.3 to 1.5;

wherein, the convolution layer, the characteristic map layer and the prediction layer also form a built backbone network;

step 2, building an FSAF branch based on the main network built in the step 1 and a RetinaNet reference frame branch, and generating an effective area and an ignored area of an image feature layer; the method specifically comprises the following steps: adding a branch without a reference frame in a prediction layer of each level in the RetinaNet branch with the reference frame, and taking a mapping frame of a standard frame in different feature layers as an effective area after being reduced by A1 times and taking the mapping frame as an ignored area after being reduced by A2 times;

wherein, all branches without reference frame are collectively called FSAF branches;

step 3, calculating the comprehensive loss based on the FSAF branch;

the step 3 specifically comprises the following steps:

step 3.A, calculating the classification loss by the formula (1):

wherein, I is the given example,

classification loss for the ith layer feature layer for a given example I;

is composed of

The sum of the pixel points in the region,

FL(p_t)＝α_t(1-p_t)^γlog(p_t) (2)

step 3.B, calculating the regression loss through the formula (3);

wherein,

wherein, Box_pFor the current standard Box, Box_lA prediction box obtained for the current calculation; the molecular moiety Box in formula (4)_p∩Box_lIs Box_pAnd Box_lArea of common part | Box_p∪Box_lThe denominator part is Box_pAnd Box_lThe sum of the area, ln is a logarithmic function;

step 3.B, after the standard box is processed by an effective area and an neglected area on the example, setting 0 for the area outside the neglected area and setting 1 for the effective area of the standard box to obtain the standard box; regarding the position between the effective area and the neglected area as a gray area, and not processing the data of the area;

and 3. after the prediction frame in the step B carries out online feature selection based on the loss calculation result, returning the selected optimized feature layer and the loss for reasoning and obtaining by next iteration, specifically: after the branch of each example obtains a result, the comprehensive loss of the prediction layer sample of each layer is obtained by averaging, and then the prediction layer with the minimum comprehensive loss is selected as a return characteristic layer;

wherein, the comprehensive loss of the return characteristic layer is input of an optimizer in a RetinaNet branch with a reference frame, and the minimum comprehensive loss is a prediction layer, namely an optimized characteristic layer;

step 4, inputting the comprehensive loss corresponding to the optimized feature layer selected in the step 3 into an internal standard optimizer and a LookAhead optimizer in a reference frame branch of the RetinaNet, so that the comprehensive loss is converged;

step 4, specifically comprising the following substeps:

the standard optimizer runs in an inner loop of the Lookahead optimizer, and the iteration number in the inner loop is a synchronization period k;

θ_t,i＝θ_t,i-1+A(θ_t,i-1,d) (6)

φ_t＝φ_t-1+β(θ_t,k-φ_t-1) (7)

step 4.5 judges whether the outer loop count value t is equal to the maximum loop count value t_maxAnd determining whether to finish the method, specifically: if yes, outputting a slow weight parameter phi at the time t_tCompleting optimization; otherwise, let t be t +1, jump to step 4.2.

2. The method for detecting and identifying targets based on FSAF and fast-slow weights as claimed in claim 1, wherein: 1/2 wherein the resolution of the first layer in the convolution layer in step 1 is the input image resolution^l(ii) a And l ranges from 1 to N-1.

3. The method for detecting and identifying targets based on FSAF and fast-slow weights as claimed in claim 2, wherein: the feature pyramid is the 1 st through nth convolutional layers.

4. The method for detecting and identifying targets based on FSAF and fast-slow weights as claimed in claim 3, wherein: step 1.2, specifically: for the N-th convolution layer, the N-th convolution layer can be directly used as the N-th characteristic layer; and for l being 3, … and N-1, summing the result of 2 times upsampling of the l +1 th layer of feature image layer and the result of 1 × 1 convolution of the l th layer of convolution layer, namely performing feature fusion, and sequentially obtaining the N-1 to 3 rd layer of feature image layers after feature fusion.

5. The method for detecting and identifying targets based on FSAF and fast-slow weights as claimed in claim 4, wherein: in step 1.2, the 1 × 1 convolution is used to ensure that the dimensions of the convolution layer participating in the fusion are consistent with those of the feature layer.

6. The method for detecting and identifying targets based on FSAF and fast-slow weights as claimed in claim 5, wherein: and (4) fusing different types of object features into each of the prediction layers obtained in the step 1.5.

7. The method for detecting and identifying targets based on FSAF and fast-slow weights as claimed in claim 6, wherein: the value range of A1 in step 2 is 0.15 to 0.4; a2 has a value in the range of 0.45 to 0.6.

8. The method for detecting and identifying targets based on FSAF and fast-slow weights as claimed in claim 7, wherein: in step 2, the branch without the reference frame comprises a classification subnet and a regression subnet, wherein the classification subnet has a structure of 'w × h × K convolutional layer + Sigmoid activation function', and the regression subnet has a structure of 'w × h × 4 convolutional layer + ReLU activation function'; a gray area is arranged between the effective area and the neglected area; "the output of the prediction layer of each level in the branch with the reference frame" is the input of the convolution layer of w × h × K in the corresponding classification subnet, and the output of the convolution layer of w × h × K in the classification subnet is the input of the Sigmoid function; the Sigmoid activation function maps the output of the w multiplied by h multiplied by K convolutional layers in the classified subnets to 0-1; k is the number of 3 multiplied by 3 convolution kernels of the convolution layers in the classified sub-network and corresponds to K characteristic types; the ReLU function is a ramp function; w and h are the height and width of the standard box, respectively; and the standard box is already marked in the original data.

9. The method for detecting and identifying targets based on FSAF and fast-slow weights as claimed in claim 8, wherein: the prediction box is obtained by the following steps:

wherein l^*The number of feature layers for parameter feedback is selected,

and

Wherein, the characteristic layer with the minimum comprehensive loss is the optimized characteristic layer;

step 3.BB obtains the coordinates of a prediction frame based on the result of the BB optimization feature layer in the step 3. BB;

step 3, BB.1 calculates the offset;

Represented by pixels (i, j) and

wherein (i, j) is the horizontal and vertical coordinates of the pixel,

mapping standard frames on the l-th layer characteristic layer;

Zooming the mapping frame to obtain a prediction frame of the original image;

wherein the pixel (i) is obtained by eliminating S, i.e. the transfer offset, by multiplying the normalization constant S in step 3.BB.1_max,j_max) Distances from four sides of the prediction box.

10. The method for detecting and identifying objects based on FSAF and fast-slow weights as claimed in claim 9, wherein: in step 4.2, a synchronization period k is the iteration number of the standard optimizer in the loop of the lokahead optimizer, the period k is performed according to the optimization effect, and the value range of k is 1000-100000.