CN113850256A - Target detection and identification method based on FSAF and fast-slow weight - Google Patents

Target detection and identification method based on FSAF and fast-slow weight Download PDF

Info

Publication number
CN113850256A
CN113850256A CN202111065576.2A CN202111065576A CN113850256A CN 113850256 A CN113850256 A CN 113850256A CN 202111065576 A CN202111065576 A CN 202111065576A CN 113850256 A CN113850256 A CN 113850256A
Authority
CN
China
Prior art keywords
layer
loss
prediction
feature
standard
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111065576.2A
Other languages
Chinese (zh)
Inventor
聂振钢
赵乐
卢继华
侯杰继
马志峰
韩航程
谢民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202111065576.2A priority Critical patent/CN113850256A/en
Publication of CN113850256A publication Critical patent/CN113850256A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a target detection and identification method based on FSAF and fast-slow weight, and belongs to the technical field of supervised learning and target identification. The method comprises the following steps: 1) building a main network comprising a convolution layer, a characteristic layer and a prediction layer and a RetinaNet branch with a reference frame; 2) the method comprises the following steps of building an FSAF branch and generating an effective area and an neglected area of an image feature layer, specifically: adding a branch without a reference frame in a prediction layer of each level in the RetinaNet branch with the reference frame, and taking a mapping frame of a standard frame in different feature layers as an effective area after being reduced by A1 times and taking the mapping frame as an ignored area after being reduced by A2 times; 3) calculate the sum of the classification loss plus the regression loss based on the FSAF branch: comprehensive loss; 4) and inputting the comprehensive loss corresponding to the optimized feature layer into a standard optimizer and a LookAhead optimizer, so that the comprehensive loss is converged. According to the method, a non-reference frame mechanism and a Lookahead fast-slow weight optimizer are introduced, so that a better recognition effect than that of a RetinaNet network is achieved, and the recognition accuracy and the loss convergence speed are improved.

Description

Target detection and identification method based on FSAF and fast-slow weight
Technical Field
The invention relates to a target detection and identification method based on FSAF and fast-slow weight, belonging to the technical field of supervised learning and target identification.
Background
With the continuous and deep development of deep sea observation and ocean technology, the importance of underwater target identification in fishery, breeding industry, sea defense and military application is increasingly prominent. However, the underwater environment in natural environment is very complex, various types of interference layer are infinite, and the effect of detecting underwater targets is greatly influenced. Techniques that focus on improving the accuracy of underwater target detection have also been rapidly developed.
From the viewpoint of object recognition, underwater image data has the following problems: (1) degradation of edges and details; (2) extremely low contrast of the target to the background; (3) various types of noise caused by the floating objects and the marine domestic garbage. In order to solve these problems, the conventional method mainly performs feature extraction of an image by hand and manpower. That is, after the acquired image is processed, the specified features are manually located and classified in the acquired image, and the feasibility of the method in a huge data set is extremely low. With the development of the technology, a plurality of target and feature recognition technologies are proposed.
Among them, CNN (Convolutional neural network) is currently the most commonly used structure for object recognition and image feature extraction. In application, the convolutional neural network needs to be identified as a picture with the following properties: (1) and (4) locality. Since usually, only part of an image contains features of the image in a picture, and weighted summation of the features can obtain a feature value of an object. (2) Location invariance. In a set of data comprising many images, different features and different numbers of objects at different locations may be included, but the relative coordinates of similar objects or similar features are. (3) And (4) stability. After down-sampling, the picture can basically keep the same feature information, and the target generally has several obvious features.
In the case of small object recognition, the following problems arise: (1) small objects have certain features, however these features are too small compared to the amount of image required to be recognized, and may be dense and scattered where the image appears. (2) a small object has a limited amount of information and, due to its change in pose, is prone to have a reduced positional reference of its features. (3) Small objects may lose important information themselves after sampling.
To solve the above problems, there are many different solutions, such as simply enlarging the input image size directly, using pre-segmentation of the image into different sizes, etc., like YOLOv 1; more complicated, it is advantageous to enlarge the small object in the image using a competing network, or to select a place where the target may appear via an independent neural network before recognition, etc. In addition, feature fusion of the convolutional network structure itself is also considered, such as FPN and DSSD. The feature pyramid network FPN is a kind of deep convolutional neural network, and the feature map of each layer in the feature pyramid network selects subsequent classification and regression according to the target size; the common convolution network performs different target detection according to the feature map of the convolution result of the last layer. DSSDs are improved from SSDs.
In order to obtain a faster regression convergence, the concept of reference frame is also proposed. The reference frame mechanism is introduced as in YOLOv2, and the number of reference frames is increased in YOLOv 3. It can be interpreted approximately as a priori initiative to propose a plurality of frames, refer to the frames and train according to the preset, equivalent to artificially setting a limited number of positions to learn the network, and newer networks such as single-stage RetinaNet, double-stage Faster-RCNN, etc. all adopt the mode. The reference frame mechanism can generate a dense reference frame set, so that the network can directly perform target classification and bounding box coordinate regression on the basis of the dense reference frame set. The dense reference frame can effectively improve the network target recall capability, and the improvement is very obvious for small target detection.
But there is a reference frame mechanism that each instance can only match with the feature layer with the size closest to the instance in the layer when training exists, and the matching rule comes from artificial design in the early stage. The feature layer selected for each instance is therefore based entirely on temporal heuristics, and two similar instances may be assigned to different feature layers due to size differences, resulting in the feature layer selected by the training instance may not be optimal.
On the basis, the FSAF introduces a mechanism without a reference frame, and in practical application, the mechanism without the reference frame (Anchor-Free) reduces the high requirement of the reference frame mechanism on prior knowledge, reduces the generation rate of redundant frames under a limited target and improves the baseline performance of a backbone network. The disadvantage that the range cannot be guaranteed to be large enough to obtain the optimal feature layer under the single frame-referenced network mechanism, so that each feature instance can freely select the optimal hierarchy to optimize the network. Meanwhile, the added overhead is very limited and can be almost ignored.
In addition, compared with a model only adopting an FSAF network and an internal standard optimizer (such as SGD or Adam), the Lookahead optimizer with image preprocessing and fast and slow weight is inserted to obtain a better effect, so that better fitting to target characteristics is obviously achieved, and the model has a faster convergence rate to loss.
There are two types of convergence loss, namely classification loss and regression loss. In the calculation process, in order to obtain the group-route of the example, the concepts of an effective area and an ignored area are introduced, and the results of mapping the example of a certain category and the standard box on the feature map according to different proportions in the feature pyramid are respectively set to be 1 and 0 as the effective area and the ignored area. The classification loss is the regularization of the sum of focal distances of the non-ignored regions divided by the sum of the number of pixels in the active region. The regression loss is the average of the IoU losses for the image valid box region, i.e., the prediction box.
The Lookahead optimizer first updates the "fast weight" k times in its inner loop using the internal criteria optimizer before updating the "slow weight" in the direction of the weight optimum, which reduces the variance. We have found that the Lookahead is less sensitive to sub-optimal hyper-parameters, thus reducing the need for extensive hyper-parameter tuning. By using an internal criteria optimizer in combination with a lookup head fast and slow weight optimizer, faster convergence on loss can be achieved.
Disclosure of Invention
The invention aims to provide a target detection and identification method based on FSAF and fast-slow weight, aiming at the technical defect that the prior RetinaNet and target detection method based on FSAF have lower average accuracy.
In order to achieve the purpose, the invention adopts the following technical scheme:
the target detection and identification method based on FSAF and fast-slow weight comprises the following steps:
step 1, building a main network comprising a convolution layer, a characteristic layer and a prediction layer and a RetinaNet branch with a reference frame;
the number of convolution layers is the number of layers of the feature pyramid, and is marked as N, the number of prediction layers is also N, the number of feature layers is N-2, and the resolution of the convolution layer of the first layer is 1/2 of the resolution of the input imagel
Wherein, the value range of l is 1 to N-1, and the characteristic pyramid is the convolution layers from 1 st to N th;
step 1 comprises the following substeps:
step 1.1, taking the original image as a first layer of convolution layer, and sequentially carrying out 1/2 sampling from bottom to top to obtain 2 nd to N th layers of convolution layers to obtain a characteristic pyramid;
step 1.2, feature fusion is carried out on different layers of the feature pyramid to obtain feature layers from the 3 rd layer to the Nth layer, and the method specifically comprises the following steps:
for the N-th convolution layer, the N-th convolution layer can be directly used as the N-th characteristic layer; for l-3, …, N-1, summing the result of 2 times up-sampling of the l +1 th layer of feature map layer with the result of 1 × 1 convolution of the l th layer of convolution layer, namely performing feature fusion, and sequentially obtaining the N-1 to 3 rd layer of feature map layer after feature fusion;
the 1 × 1 convolution is used for ensuring that the dimension of the convolution layer participating in the fusion is consistent with that of the characteristic layer;
step 1.3, respectively carrying out 3 x 3 convolution on the N layer of the feature map layer obtained in the step 1.2 and the feature map layers from the Nth layer to the 3 rd layers to obtain prediction layers from the Nth layer to the 3 rd layers, wherein N-2 layers are shared;
step 1.4, independently carrying out 3 × 3 convolution on the Nth convolution layer obtained in the step 1.1 to obtain an N +1 th prediction layer;
step 1.5, performing 3 × 3 convolution on the N +1 th prediction layer obtained in the step 1.4 to obtain an N +2 th prediction layer;
so far, N prediction layers with the same number as the convolution layers are obtained through the steps 1.3 to 1.5, and different types of object characteristics are fused in each prediction layer;
step 1.6, adding a RetinaNet branch with a reference frame behind each prediction layer;
thus, a backbone network built by the convolution layer, the characteristic layer and the prediction layer is obtained;
step 2, establishing an FSAF branch based on the main network established in the step 1 and a RetinaNet branch with a reference frame, and generating an effective area and an ignored area of an image feature layer; the method specifically comprises the following steps: adding a branch without a reference frame in a prediction layer of each level in the RetinaNet branch with the reference frame, and taking a mapping frame of a standard frame in different feature layers as an effective area after being reduced by A1 times and taking the mapping frame as an ignored area after being reduced by A2 times;
wherein, all branches without reference frame are collectively called FSAF branches; the value range of A1 is 0.15 to 0.4, and the value range of A2 is 0.45 to 0.6; the non-reference frame branch comprises a classification subnet and a regression subnet, the structure of the classification subnet is 'w multiplied by h multiplied by K convolutional layer + Sigmoid activation function', and the structure of the regression subnet is 'w multiplied by h multiplied by 4 convolutional layer + ReLU activation function'; a gray area is arranged between the effective area and the neglected area;
wherein, the output of the prediction layer of each level in the branch with the reference frame is the input of the convolution layer of w × h × K in the corresponding classification subnet, and the output of the convolution layer of w × h × K in the classification subnet is the input of the Sigmoid function; the Sigmoid activation function maps the output of the w multiplied by h multiplied by K convolutional layers in the classified subnets to 0-1; k is the number of 3 multiplied by 3 convolution kernels of the convolution layers in the classified sub-network and corresponds to K characteristic types; the ReLU function is a ramp function; w and h are the height and width of the standard box, respectively; and the standard box is marked in the original data;
step 3, calculating the comprehensive loss based on the FSAF branch;
wherein the comprehensive loss is the sum of the classification loss and the regression loss;
the step 3 specifically comprises the following steps:
step 3.A, calculating the classification loss by the formula (1):
Figure BDA0003255800210000051
wherein, I is the given example,
Figure BDA0003255800210000052
classification loss for the ith layer feature layer for a given example I;
Figure BDA0003255800210000053
is composed of
Figure BDA0003255800210000054
The sum of the pixel points in the region,
Figure BDA0003255800210000055
for the active area of this example of this layer feature layer, FL (l, i, j) is the focal loss at the location of the l-th layer feature layer (i, j), which is calculated by (2):
FL(pt)=αt(1-pt)γlog(pt) (2)
among them, the focal loss, i.e., focal length, is denoted as FL (p)t);ptRepresenting the classification probability of different classes, gamma being a hyper-parametric index, alphatIs a hyper-parameter coefficient;
step 3.B, calculating the regression loss through the formula (3);
Figure BDA0003255800210000056
wherein,
Figure BDA0003255800210000057
for the regression loss for the ith layer feature layer of example I given, where IoU (l, I, j) is the IoU loss at the ith layer feature layer (I, j) position, and the IoU loss is the comparison calculation of the standard and prediction blocks of the training set, calculated by equation (4):
Figure BDA0003255800210000058
wherein, BoxpFor the current standard Box, BoxlA prediction box obtained for the current calculation; the molecular moiety Box in formula (4)p∩BoxlIs BoxpAnd BoxlArea of common part | Boxp∪BoxlThe denominator part is BoxpAnd BoxlThe sum of the area, ln is a logarithmic function; after the standard frame is processed by an effective area and an neglected area of the example, setting 0 for the area outside the neglected area and setting 1 for the effective area of the standard frame; regarding the position between the effective area and the neglected area as a gray area, and not processing the data of the area;
after the online feature selection is performed on the prediction frame based on the loss calculation result, the selected optimal feature layer and the loss are returned for the next iteration to be inferred, and the method specifically comprises the following steps: after the branch of each example obtains a result, the comprehensive loss of the prediction layer sample of each layer is obtained by averaging, and then the prediction layer with the minimum comprehensive loss is selected as a return characteristic layer;
wherein, the comprehensive loss of the return characteristic layer is the input of the optimizer in the RetinaNet branch with the reference frame; a prediction layer with minimum comprehensive loss, namely an optimal characteristic layer;
the prediction box is obtained by the following steps:
and 3, BA obtaining a characteristic layer with minimum comprehensive loss through a formula (5):
Figure BDA0003255800210000061
wherein l*The number of feature layers for parameter feedback is selected,
Figure BDA0003255800210000062
and
Figure BDA0003255800210000063
the classification loss and the regression loss of the l-th layer feature layer of the I example are shown as the returned feature layer
Figure BDA0003255800210000068
Wherein, the characteristic layer with the minimum comprehensive loss is the optimal characteristic layer;
step 3.BB obtains the coordinates of a prediction frame based on the result of the BB optimal feature layer in the step 3. BB;
step 3, BB.1 calculates the offset;
the method specifically comprises the following steps: for all pixels (i, j) of the effective area, mapping frames of standard frames of the characteristic layer of the layer
Figure BDA0003255800210000064
Represented by pixels (i, j) and
Figure BDA0003255800210000065
distance between four sides, and a normalization constant S is set to 4, and pixel (i, j) is connected with pixel (i, j)
Figure BDA0003255800210000066
The distances of the four sides are subjected to offset processing to obtain a transmission offset;
wherein (i, j) is the horizontal and vertical coordinates of the pixel,
Figure BDA0003255800210000067
mapping standard frames on the l-th layer characteristic layer;
step 3, BB.2 calculates the frame size and the mapping frame of the characteristic layer by taking all pixels in the effective area as the center
Figure BDA0003255800210000071
The integration loss of the predicted frame and the standard frame are consistent, and the pixel (i) with the minimum integration loss is selectedmin,jmin) As the center of the prediction box;
and 3, BB.3, eliminating the influence of the standardized constant S on the offset to obtain the actual distance between the pixel and the prediction frame, obtaining the coordinates of the prediction frame on the mapping upper left corner and lower right corner of the characteristic layer, and reusing the coordinates
Figure BDA0003255800210000072
Zooming the mapping frame to obtain a prediction frame of the original image;
wherein the pixel (i) is obtained by eliminating S, i.e. the transfer offset, by multiplying the normalization constant S in step 3.BB.1max,jmax) Distances from four sides of the prediction frame;
step 4, inputting the comprehensive loss corresponding to the optimal feature layer selected in the step 3 into an internal standard optimizer and a LookAhead optimizer in a branch of the RetinaNet reference frame, so that the comprehensive loss is converged;
step 4, specifically comprising the following substeps:
step 4.1, initializing an external loop count value, an objective function L, a slow weight parameter phi and a fast weight parameter theta;
wherein, the external cycle count value is denoted as t, and the maximum cycle count value is denoted as tmaxInitializing t to 1, initializing slow weight parameter phi0
Step 4.2 in the t-th external cycle, the slow weight parameter phi with the external cycle count value at the moment of t-1t-1Assigning an initial fast weight theta to serve as an initial parameter of the standard optimizer;
the standard optimizer runs in the loop of the Lookahead optimizer, and the iteration number in the loop is a synchronization period k; the synchronization period k is the iteration number of the standard optimizer in the loop of the Lookahead optimizer, the synchronization period k is carried out according to the optimization effect, and the value range of k is 1000-100000-;
4.3, calculating the fast weight in the ith iteration of the loop in the standard optimizer by the standard optimizer in the Lookahead optimizer according to the formula (6):
θt,i=θt,i-1+A(θt,i-1,d) (6)
wherein A is a standard optimizer and is one of standard gradient descent and random gradient descent; d is the sampling value of the current data; thetat,i-1Parameters required for a standard optimizer; thetat,iIs the fast weight optimization result of the ith iteration of the standard optimizer in the t-th outer loop; thetat,i-1Is the fast weight optimization result of the i-1 iteration of the standard optimizer in the t-th outer loop;
step 4.4 updates the slow weight parameter based on the result of step 4.3 by equation (7):
φt=φt-1+β(θt,kt-1) (7)
wherein, thetat,kIs the fast weight optimization result of the kth iteration of the standard optimizer in the t-th outer loop; phi is at-1Is a slow weight parameter at the moment when the outer loop count value is t-1; phi is atIs a slow weight parameter with an outer loop count value at time t; beta is an iteration parameter of the Lookahead optimizer;
step 4.5 judges whether the outer loop count value t is equal to the maximum loop count value tmaxAnd determining whether to finish the method, specifically: if yes, outputting a slow weight parameter phi at the time ttCompleting optimization; otherwise, making t equal to t +1, and jumping to the step 4.2;
so far, from step 1 to step 4, the target detection and identification method based on the FSAF and the fast-slow weight is completed.
Advantageous effects
Compared with the prior art, the target detection and identification method based on FSAF and fast-slow weight has the following beneficial effects:
1. compared with a RetinaNet network only introduced with a reference frame mechanism, the method adds a no-reference frame mechanism, and selects an optimal feature layer by an online feature selection method under the no-reference frame mechanism under the condition of no obvious increase of complexity, so that the FSAF network has obviously better training precision compared with the RetinaNet network in clearer underwater data set comparison;
2. according to the method, by introducing regression loss and classification loss calculation and combining focus loss, IoU loss and online feature selection in loss estimation and optimal feature layer selection, compared with a RetinaNet network, the method realizes the rapid convergence of comprehensive loss, and the convergence effect is obviously better than that of the RetinaNet network;
3. according to the method, a Lookahead fast-slow weight optimizer is inserted, and a fast weight sequence is checked in advance on the basis of an internal standard optimization method to determine the search direction, so that better optimization speed is achieved. In practice, the precision and the convergence speed are improved by inserting the Lookahead optimizer;
4. according to the method, in the comparison of the FSAF and the RtinaNet network, a top prediction layer is removed, so that the calculation amount is effectively reduced, and meanwhile, better target detection precision can be kept;
drawings
FIG. 1 is a schematic diagram of a backbone network of the target detection and identification method based on FSAF and fast-slow weight according to the present invention;
FIG. 2 is a schematic diagram of a FSAF branch of the target detection and identification method based on FSAF and fast-slow weight according to the present invention;
FIG. 3 is a branch schematic diagram of the FSAF and fast-slow weight based target detection and identification method without reference frame according to the present invention;
FIG. 4 is a schematic diagram of an image preprocessing flow in an example of the target detection and identification method based on FSAF and fast-slow weighting according to the present invention;
FIG. 5 is a schematic diagram of a training flow in an example of the target detection and identification method based on FSAF and fast-slow weights according to the present invention;
FIG. 6 is a schematic diagram of loss convergence in an actual environment of the target detection and identification method based on FSAF and fast-slow weighting according to the present invention;
FIG. 7 is a schematic diagram of mAP in an actual environment of the target detection and identification method based on FSAF and fast-slow weight according to the present invention;
FIG. 8 is a test result of the target detection and identification method based on FSAF and fast-slow weight in underwater environment;
FIG. 9 is a graph showing the convergence of the first 40 epoch losses in a small data set comparison based on FSAF and RetinaNet according to the present invention;
FIG. 10 is a test result of the FSAF and fast-slow weight based target detection and identification method of the present invention in a clear environment;
Detailed Description
The object detection and identification method based on FSAF and fast-slow weighting according to the present invention will be further explained and described in detail with reference to the accompanying drawings and embodiments.
Example 1
The method realizes the optimal selection of the feature layer based on the feature pyramid and the RetinaNet network, combines a fast-slow weight optimizer, and has wide application space in underwater target identification and target detection with fuzzy features in dark environment; the method has great practical significance in the application scenes of marine industry, fishery industry and terrestrial fuzzy environment. In an example, simulated environmental testing is performed on a training and testing data set provided by a global underwater robot competition. The data set is obtained by intercepting non-adjacent frames of the video in a simulation environment, so that the underwater fuzzy environment is well simulated, and in addition, good effects are obtained on the data sets of classified identification such as ImageNet, PASCAL VOC, Labelme and the like.
The network model training and testing equipment configuration used in the examples is as follows: i7-9750H, 8GB memory, GPU (GTX1050) and 3GB video memory. The training results of the shallow sea environment picture provided by the URPC data set are visualized by this example. Selecting 2 pictures as a batch, forming an epoch by 1250 steps, and training 60 epochs;
mainly comprises the following steps:
a, selecting and dividing a subset of a data set from an underwater robot game;
the data set pictures are taken in underwater environments with different actual conditions, and the data set pictures comprise 7 types of subsets, wherein the numbers of the subsets are (1)2019V1, 2019V2, 2019V3(2) CHN083846(3) G0024172, G0024173, G0024174(4) GOPR0293, GOPR0294(5) GP010293, GP010294, GP010295, GP010296(6) YDXJ0001, YDXJ0002, YDXJ0003, YDXJ0013(7) YN01, YN02 and YN 03;
dividing an atlas for comparing model effects, numbering a training set with YN at the beginning and 2574 in number, and dividing the residual 2183 pictures into an evaluation set;
b, preprocessing the image;
the image preprocessing flow is shown in fig. 4, and step B includes the following sub-steps:
b.1, reading an image, transferring the image to a GPU, and 1/2 sampling the image with the maximum edge larger than 1000 pixels until the maximum edge of the image is smaller than 1000 pixels;
b.2, carrying out balanced RGB three-channel color processing on the image with the same size;
weighting and adding a global average value and a local average value of a single channel of an image, then proportionally balancing the result to 100, and finally combining the three channels into an image matrix with the original size;
step B.3, defogging and brightness balancing are carried out on the matrix obtained in the step B.2;
the brightness balance is to correct the condition that the partial pixel value of the image after the defogging processing is larger than 255, so the brightness adjustment is carried out after the defogging. The method specifically comprises the following steps:
inverting and centralizing the image pixel values, and adjusting the brightness of the pixel segments larger than the average value;
b.4, performing median filtering on the result obtained in the step B.3 to filter noise in the image;
b.5, restoring the image to 800 multiplied by 800 size and saving the image to a specified folder;
step C, building a main network comprising a convolution layer, a characteristic layer and a prediction layer and a RetinaNet reference frame branch;
wherein, the number of the convolution layers is set to 5, and the number of the characteristic map layers and the prediction layers is also 5
Step C comprises the following substeps:
step C.1, taking the original image as a first layer of convolution layer, and sequentially carrying out 1/2 sampling from bottom to top to obtain 2 nd to 5 th layers of convolution layers to obtain a characteristic pyramid;
and C.2, carrying out feature fusion on different layers of the feature pyramid, specifically:
the 5 th convolution layer can be directly used as a 5 th characteristic layer; for l is 3,4, summing the result of 2 times up-sampling of the l +1 th layer of feature map layer with the result of 1 × 1 convolution of the l th layer of convolution layer, namely performing feature fusion, and sequentially obtaining the 3 rd and 4 th layer of feature map layers after feature fusion;
the 1 × 1 convolution is used for ensuring that the dimension of the convolution layer participating in the fusion is consistent with that of the characteristic layer;
step C.3, respectively carrying out 3 multiplied by 3 convolution on the N layer of the feature map layer obtained in the step 1.2 and the feature map layers of the '3 rd and 4 th layers' to obtain prediction layers from the 5 th layer to the 3 rd layer, wherein the total number of the prediction layers is 3;
step C.4, independently carrying out 3 × 3 convolution on the convolution layer 5 obtained in the step 1.1 to obtain a prediction layer 6;
step C.5, performing 3 × 3 convolution on the 6 th layer prediction layer obtained in the step 1.4 to obtain a 7 th layer prediction layer;
so far, 5 prediction layers with the same number as the convolution layers are obtained through the steps 1.3 to 1.5, and a built backbone network is obtained; the backbone network structure is shown in fig. 1;
step C.6, adding a RetinaNet branch with a reference frame behind each layer of prediction layer;
d, building an FSAF branch based on the main network built in the step C and the RetinaNet branch with a reference frame, and generating an effective area and an ignored area of the image feature layer; the method specifically comprises the following steps: adding a branch without a reference frame in a prediction layer of each level in the RetinaNet branch with the reference frame, and taking a mapping frame of a standard frame in different feature layers as an effective area after being reduced by A1 times and taking the mapping frame as an ignored area after being reduced by A2 times;
wherein, all branches without reference frame are collectively called FSAF branches; the value of A1 is 0.25, and the value of A2 is 0.5; the non-reference frame branch comprises a classification subnet and a regression subnet, the structure of the classification subnet is 'w multiplied by h multiplied by K convolutional layer + Sigmoid activation function', and the structure of the regression subnet is 'w multiplied by h multiplied by 4 convolutional layer + ReLU activation function'; a gray area is arranged between the effective area and the neglected area;
wherein, the output of the prediction layer of each level in the branch with the reference frame is the input of the convolution layer of w × h × K in the corresponding classification subnet, and the output of the convolution layer of w × h × K in the classification subnet is the input of the Sigmoid function; the Sigmoid activation function maps the output of the w multiplied by h multiplied by K convolutional layers in the classified subnets to 0-1; k is the number of 3 multiplied by 3 convolution kernels of the convolution layers in the classified sub-network and corresponds to K characteristic types; the ReLU function is a ramp function; w and h are the height and width of the standard box, respectively; and the standard box is marked in the original data;
wherein the FSAF branch is shown in fig. 2; the branch diagram without reference frame is shown in FIG. 3;
step E, calculating the comprehensive loss based on the FSAF branch;
wherein the comprehensive loss comprises classification loss and regression loss;
the step E specifically comprises the following steps:
step e.a calculates the classification loss by equation (1):
Figure BDA0003255800210000131
wherein, I is the given example,
Figure BDA0003255800210000132
classification loss for the ith layer feature layer for a given example I;
Figure BDA0003255800210000133
is composed of
Figure BDA0003255800210000134
The sum of the pixel points in the region,
Figure BDA0003255800210000135
for the active area of this example of the layer feature layer, FL (l, i, j) isThe focal loss at the position of the ith layer feature layer (i, j) is calculated by (2):
FL(pt)=αt(1-pt)γlog(pt) (2)
among them, the focal loss, i.e., focal length, is denoted as FL (p)t);ptRepresenting the classification probability of different classes, wherein gamma is a hyper-parametric index and takes the value of gamma as 2.0 and alphatThe coefficient is a hyper-parameter coefficient, and the value is alpha is 0.25;
step E.B calculating the return loss by equation (3);
Figure BDA0003255800210000136
wherein,
Figure BDA0003255800210000137
for the regression loss for the ith layer feature layer of example I given, where IoU (l, I, j) is the IoU loss at the ith layer feature layer (I, j) position, and the IoU loss is the comparison calculation of the standard and prediction blocks of the training set, calculated by equation (4):
Figure BDA0003255800210000138
after the example is processed by an effective area and an neglected area, setting 0 for the area outside the neglected area and setting 1 for the effective area of the standard frame to obtain the standard frame; regarding the position between the effective area and the neglected area as a gray area, and not processing the data of the area; setting the threshold of the effective area to be 0.5 and the threshold of the neglected area to be 0.25, wherein the thresholds respectively correspond to A1 and A2 in the step D;
after the online feature selection is performed on the prediction frame based on the loss calculation result, the selected optimal feature layer and the loss are returned for the next iteration to be inferred, and the method specifically comprises the following steps: after the branch of each example obtains a result, the comprehensive loss of the prediction layer sample of each layer is obtained by averaging, and then the prediction layer with the minimum comprehensive loss is selected as a return characteristic layer;
wherein, the comprehensive loss of the return characteristic layer is the input of the optimizer in the RetinaNet branch with the reference frame;
step E.C, selecting online features, calculating to obtain coordinates of the prediction box, specifically including the following substeps:
c1, obtaining a characteristic layer with minimum total loss through formula (5):
Figure BDA0003255800210000141
wherein l*The number of feature layers for parameter feedback is selected,
Figure BDA0003255800210000142
and
Figure BDA0003255800210000143
the classification loss and the regression loss of the I example in the l layer feature layer are respectively;
wherein, the characteristic layer with the minimum total loss is the optimal characteristic layer;
E.C2, obtaining a coordinate of a prediction frame based on the result of the optimal characteristic layer in the step E.C 1;
c2.1 calculating an offset;
the method specifically comprises the following steps: for all pixels (i, j) of the effective area, mapping frames of the standard frames of the feature layer
Figure BDA0003255800210000144
Expressed as a 4-dimensional vector
Figure BDA0003255800210000145
Set normalization constant S to 4, pair
Figure BDA0003255800210000146
Inner pixel, get di,j(ii) S is an offset, expressed as
Figure BDA0003255800210000147
Wherein (i, j) is the horizontal and vertical coordinates of the pixel,
Figure BDA0003255800210000148
for the mapping of the standard box at the l-th layer feature level,
Figure BDA0003255800210000149
are pixels (i, j) and (b), respectively
Figure BDA00032558002100001410
The distance between the upper edge, the lower edge, the left edge and the right edge;
c2.2 calculating the frame size and the mapping frame of the characteristic layer by taking all pixels in the effective area as the center
Figure BDA00032558002100001411
The integration loss of the predicted frame and the standard frame are consistent, and the pixel (i) with the minimum integration loss is selectedmin,jmin) As the center of the prediction box;
step E.C2.3 eliminating the standard constant S from the predicted offset
Figure BDA00032558002100001412
To obtain a distance of the pixel from the prediction frame as
Figure BDA00032558002100001413
Obtaining the coordinates of the mapping upper left corner and lower right corner of the prediction frame in the feature layer
Figure BDA00032558002100001414
And
Figure 1
can be reused
Figure BDA00032558002100001417
Zooming the mapping frame to obtain a prediction frame of the original image;
wherein,
Figure BDA00032558002100001416
respectively representing that the pixel distance selected in the step 3.C2.2 is centered on itself, and the frame size and the feature layer mapping frame of the layer
Figure BDA0003255800210000151
The offset of the top, bottom, left and right sides of the consistent prediction frame corresponds to that in step e.c2.1
Figure BDA0003255800210000152
Rejecting the influence of S on the offset, i.e. in
Figure BDA0003255800210000153
Multiplying each element by the normalization constant S in the step E.C2.1 to obtain the distance between the pixel and four edges of the prediction frame
Figure BDA0003255800210000154
Step F, inputting the comprehensive loss corresponding to the optimal characteristic layer selected in the step E into an internal standard optimizer and a LookAhead optimizer in a RetinaNet reference frame branch for comprehensive loss convergence;
the step F specifically comprises the following substeps:
f.1, initializing an outer loop count value, a slow weight parameter phi and a fast weight parameter theta;
wherein, the external cycle count value is denoted as t, and the maximum cycle count value is denoted as tmaxInitializing a slow weight parameter, wherein the initialization t is 1; the fast weight parameter and the slow weight parameter are weights of all the images during the comprehensive convergence of the calculated images;
step F.2 in the t-th external cycle, the count value of the external cycle is the slow weight parameter phi at the moment of t-1t-1Assigning an initial fast weight theta to serve as an initial parameter of the standard optimizer;
the standard optimizer runs in the loop of the Lookahead optimizer, and the iteration number in the loop is a synchronization period k; the synchronization period k is the iteration number of the standard optimizer in the loop of the Lookahead optimizer, the synchronization period k is carried out according to the optimization effect, and the adjustment value range of k is 1000-100000-;
and F.3, calculating the fast weight in the ith iteration of the loop in the standard optimizer by the standard optimizer in the Lookahead optimizer according to the formula (6):
θt,i=θt,i-1+A(θt,i-1,d) (6)
wherein A is a standard optimizer and is a random gradient descent; d is the sampling value of the current data; thetat,i-1Parameters required for a standard optimizer; thetat,iIs the fast weight optimization result of the ith iteration of the standard optimizer in the t-th outer loop; thetat,i-1Is the fast weight optimization result of the i-1 iteration of the standard optimizer in the t-th outer loop;
step F.4 updates the slow weight parameter based on the result of step 4.3 by equation (7):
φt=φt-1+β(θt,kt-1) (7)
wherein, thetat,kIs the fast weight optimization result of the kth iteration of the standard optimizer in the t-th outer loop; phi is at-1Is a slow weight parameter at the moment when the outer loop count value is t-1; phi is atIs a slow weight parameter with an outer loop count value at time t; beta is an iteration parameter of the Lookahead optimizer;
step F.5 determines whether the outer loop count value t is equal to the maximum loop count value tmaxAnd determining whether to finish the method, specifically: if yes, outputting a slow weight parameter phi at the time ttCompleting optimization; otherwise, making t equal to t +1, and jumping to the step F.2;
g, training images based on the method from the step A to the step F, wherein the training model is set as follows:
the training model used the FSAF model after training through a simulation environment at the time of comparison. Training uses a single GPU, and the learning rate of an internal standard optimizer is set to be 0.000625; the weight updating totals 60 iterations, 0.001 is selected as an initial iteration parameter, and the learning rate is reduced to 10% of the previous time at the 40 th iteration;
starting training the model for calling tools/train in the toolbox; the training flow is shown in fig. 5;
step H, performing visualization processing on the target identification record based on the obtained training result;
specifically, after a prediction frame and comprehensive loss are obtained, calling an mmcv library to display the obtained loss convergence of the prediction frame;
the method comprises the following steps of considering that a target is an underwater organism, has a camouflage pattern and is easy to gather, and screening images as follows:
the method comprises the following steps of considering that a target is an underwater organism, has a camouflage pattern and is easy to gather, and screening images as follows:
for prediction blocks contained by different classes, they are considered low-score samples. When the center of the frame is close to the center of the frame containing the frame, the frame is discarded; otherwise, reducing certain confidence level; for the shielded small frame, the confidence coefficient of the shielded small frame is moderately improved; in order to ensure a better display effect, only a part of the frames with high reliability are reserved. Due to underwater channel interference and the like, a high-credibility frame can be used as a result which can be identified in practical application;
the model to be evaluated further comprises two parts of prediction accuracy and recall rate, wherein the single-class accuracy formula is shown as the formula (8):
P=TP/(TP+FP) (8)
the recall ratio formula is shown in equation (9):
R=TP/(TP+FN) (9)
wherein, P and R are the precision rate and the recall rate of a single required identification category respectively, TP represents the number of samples which are predicted to be positive samples and are predicted to be accurate, FP represents the number of samples which are predicted to be positive samples and are predicted to be wrong, and FN represents the number of samples which are predicted to be positive samples and are predicted to be wrong and are extremely positive samples.
In the criteria of the data set, the average accuracy (mAP) and average recall (mAR) of all feature types are defaulted as a way to measure the goodness of the model. Where mAP50 is the average accuracy obtained when the predicted box with overlap greater than 50% of the actual box is considered as correct prediction and less than it is considered as wrong prediction, and mAP is the average accuracy obtained when the step size is set to 0.05 and the average accuracy obtained with the overlap equal to 50% to 95% is obtained. The recall rate is the same, and by combining the mAP and the mAR, the performance of the model in practical application can be effectively evaluated.
The integrated loss convergence is shown in fig. 6: as can be seen from fig. 6, the addition of image pre-processing and the introduction of the LookAhead optimizer converged slightly faster in the early stage, but converged slower in the late stage instead. The convergence speed is more biased to directly use the convergence effect of the internal standard optimizer;
and the FSAF network is better than the RetinaNet network in convergence speed;
in addition, when only one Epoch is trained in all pictures, the mAP of the model optimized by adopting the LookAhead is far higher than that of the model not optimized by adopting the LookAhead. However, in the following training process, the evaluation results of all the modes on the evaluation set fluctuate greatly, as shown in fig. 7; furthermore, although RetinaNet is less effective in the initial period, the effect achieved at the completion of training is substantially the same as FSAF under the same conditions. The final mAP is shown in Table 1:
TABLE 1 results of evaluation of actual Environment
mAP(IoU=0.5) FSAF FSAF+LookAhead FSAF + LookAhead + pretreatment RetinaNet
First cycle 25.9% 39.5% 42.5% No result
End result 54.2% 54.3% 54.6% 54.4%
The partial inspection visualization results are shown in fig. 8;
the left sides of the two adjacent pictures are original images, and the right sides of the two adjacent pictures are pictures subjected to image preprocessing. It can be seen that after training, the network can mark out most of the objects to be recognized. Some objects which are difficult to be identified by naked eyes in the original image are well marked in the picture after image preprocessing.
Example 2
This embodiment describes a specific implementation of the target detection and identification method based on FSAF and fast-slow weight, which realizes performance comparison between the method and a RetinaNet network in an environment with a clear small data set, shows a comparison result of the two methods, and includes the following steps:
step A, dividing an atlas for comparing model effects into training sets with the number of 200 and evaluation sets with the number of 100, wherein the division mode adopts random division;
in specific implementation, the picture is derived from a training and testing data set provided by an underwater robot competition, and the data set is derived from non-adjacent frames of an intercepted video in a simulation environment;
b, preprocessing the image;
the image and processing are used for obtaining an image input model with basically consistent size and image effect, and the flow of the image and processing is the same as that of the embodiment 1;
step C, building a network through an MMDetection tool box,
the model is constructed according to the requirements of the tool box, and comprises the following sub-steps:
c.1, invoking torchvision to construct a residual error network module for feature extraction;
wherein the torchvision is a tool library which is independent of the pyrch and is related to image operation, and the constructed residual error network is mainly used for carrying out 1/2 down-sampling on an image original layer C1 to obtain convolution layers C2, C3, C4 and C5 with different sizes so as to form five convolution layers;
and C.2, performing 1 × 1 convolution on C3, C4 and C5 to obtain prediction layers P3, P4 and P5. 3 × 3 convolution with step size 2 is performed on C5 to obtain a prediction layer P6, which is different from embodiment 1 in that embodiment 2 removes a prediction layer output P7 additionally added by RetinaNet after the P6 layer;
step C.3, setting partial hyper-parameters used in training and evaluation, wherein the parameter setting is the same as that of the embodiment 1;
step D, building an FSAF branch, wherein the specific steps are consistent with those of the embodiment 1 so as to be convenient for comparison with a reference frame branch of a RetinaNet network;
step E, inserting a Lookahead fast-slow weight optimizer into the FSAF internal standard optimizer, wherein the parameter setting is consistent with that of the embodiment 1;
step F, training images based on a training set, wherein the training setting is the same as that of the embodiment 1;
g, performing result visualization processing on the target recognition based on the training result obtained in the step F; specifically, after a boundary frame, a classification and a prediction score of a predicted track are obtained, an mmcv library is called to qualitatively display the obtained prediction frame, and the specific steps and the subsequent screening mode are the same as those of the embodiment 1;
when the effects of the FSAF and the RetinaNet model are compared, the learning rate of the RetinaNet network is set according to the learning rate/(the number of single-GPU processed pictures × GPU): 0.01/(2 × 8), that is, the learning rate is 0.00625, and compared with this, the FSAF model is trained with two learning rates, namely 0.001 and 0.00625. The convergence of the first 40 epoch losses in the training is shown in fig. 9, which shows that the loss convergence rate of FSAF is significantly better than RetinaNet, and the convergence effect when the learning rate is 0.001 is better than the convergence effect when the learning rate is 0.000625; the performance of the RetinaNet network and the FSAF network at different learning rates is shown in table 2:
from the results in table 2, when training was inadequate, the demonstrated learning efficiency of FSAF was apparently due to RetinaNet; when the learning rate is large enough, the FSAF shows good performance, and a good training effect is achieved in a short time, compared with RetinaNet, the mAP50 reaches 93%, and in the aspect of recall rate, the FSAF is higher than RetinaNet by 2.7% under the same training condition;
TABLE 2 fast evaluation results of small data sets
Figure BDA0003255800210000191
The simulation test results in a clearer environment are shown in fig. 10:
the top left corner of the figure records the number of recognized aquatic biometrics marked with prediction boxes 1, 2, 3, respectively, for different targets. It can be seen that the FSAF model achieves excellent results in a cleaner environment after sufficient training has been performed.
While the foregoing is directed to the preferred embodiment of the present invention, it is not intended that the invention be limited to the embodiment and the drawings disclosed herein. Equivalents and modifications may be made without departing from the spirit of the disclosure, which is to be considered as within the scope of the invention.

Claims (10)

1. The target detection and identification method based on FSAF and fast-slow weight is characterized by comprising the following steps: the method comprises the following steps:
step 1, building a main network comprising a convolution layer, a characteristic layer and a prediction layer and a RetinaNet reference frame branch;
the number of the convolution layers is the number of layers of the characteristic pyramid and is marked as N, the number of the prediction layers is also N, and the number of the characteristic layers is N-2;
step 1 comprises the following substeps:
step 1.1, taking the original image as a first layer of convolution layer, and sequentially carrying out 1/2 sampling from bottom to top to obtain 2 nd to N th layers of convolution layers to obtain a characteristic pyramid;
step 1.2, carrying out feature fusion on different layers of the feature pyramid to obtain feature layers from the 3 rd layer to the Nth layer;
step 1.3, respectively carrying out 3 x 3 convolution on the N layer of the feature map layer obtained in the step 1.2 and the feature map layers from the Nth layer to the 3 rd layers to obtain prediction layers from the Nth layer to the 3 rd layers, wherein N-2 layers are shared;
step 1.4, independently carrying out 3 × 3 convolution on the Nth convolution layer obtained in the step 1.1 to obtain an N +1 th prediction layer;
step 1.5, performing 3 × 3 convolution on the N +1 th prediction layer obtained in the step 1.4 to obtain an N +2 th prediction layer;
so far, the same number of N prediction layers as the number of convolutional layers are obtained through the steps 1.3 to 1.5;
wherein, the convolution layer, the characteristic map layer and the prediction layer also form a built backbone network;
step 1.6, adding a RetinaNet branch with a reference frame behind each prediction layer;
step 2, building an FSAF branch based on the main network built in the step 1 and a RetinaNet reference frame branch, and generating an effective area and an ignored area of an image feature layer; the method specifically comprises the following steps: adding a branch without a reference frame in a prediction layer of each level in the RetinaNet branch with the reference frame, and taking a mapping frame of a standard frame in different feature layers as an effective area after being reduced by A1 times and taking the mapping frame as an ignored area after being reduced by A2 times;
wherein, all branches without reference frame are collectively called FSAF branches;
step 3, calculating the comprehensive loss based on the FSAF branch;
wherein the comprehensive loss comprises classification loss and regression loss;
the step 3 specifically comprises the following steps:
step 3.A, calculating the classification loss by the formula (1):
Figure FDA0003255800200000021
wherein, I is the given example,
Figure FDA0003255800200000022
classification loss for the ith layer feature layer for a given example I;
Figure FDA0003255800200000023
is composed of
Figure FDA0003255800200000024
The sum of the pixel points in the region,
Figure FDA0003255800200000025
for the active area of this example of this layer feature layer, FL (l, i, j) is the focal loss at the location of the l-th layer feature layer (i, j), which is calculated by (2):
FL(pt)=αt(1-pt)γlog(pt) (2)
among them, the focal loss, i.e., focal length, is denoted as FL (p)t);ptRepresenting the classification probability of different classes, gamma being a hyper-parametric index, alphatIs a hyper-parameter coefficient;
step 3.B, calculating the regression loss through the formula (3);
Figure FDA0003255800200000026
wherein,
Figure FDA0003255800200000027
for the regression loss for the ith layer feature layer of example I given, where IoU (l, I, j) is the IoU loss at the ith layer feature layer (I, j) position, and the IoU loss is the comparison calculation of the standard and prediction blocks of the training set, calculated by equation (4):
Figure FDA0003255800200000031
wherein, BoxpFor the current standard Box, BoxlA prediction box obtained for the current calculation; the molecular moiety Box in formula (4)p∩BoxlIs BoxpAnd BoxlArea of common part | Boxp∪BoxlThe denominator part is BoxpAnd BoxlThe sum of the area, ln is a logarithmic function;
step 3.B, after the standard box is processed by an effective area and an neglected area on the example, setting 0 for the area outside the neglected area and setting 1 for the effective area of the standard box to obtain the standard box; regarding the position between the effective area and the neglected area as a gray area, and not processing the data of the area;
and 3. after the prediction frame in the step B carries out online feature selection based on the loss calculation result, returning the selected optimized feature layer and the loss for reasoning and obtaining by next iteration, specifically: after the branch of each example obtains a result, the comprehensive loss of the prediction layer sample of each layer is obtained by averaging, and then the prediction layer with the minimum comprehensive loss is selected as a return characteristic layer;
wherein, the comprehensive loss of the return characteristic layer is input of an optimizer in a RetinaNet branch with a reference frame, and the minimum comprehensive loss is a prediction layer, namely an optimized characteristic layer;
step 4, inputting the comprehensive loss corresponding to the optimized feature layer selected in the step 3 into an internal standard optimizer and a LookAhead optimizer in a reference frame branch of the RetinaNet, so that the comprehensive loss is converged;
step 4, specifically comprising the following substeps:
step 4.1, initializing an external loop count value, an objective function L, a slow weight parameter phi and a fast weight parameter theta;
wherein, the external cycle count value is denoted as t, and the maximum cycle count value is denoted as tmaxInitializing t to 1, initializing slow weight parameter phi0
Step 4.2 in the t-th external cycle, the slow weight parameter phi with the external cycle count value at the moment of t-1t-1Assigning an initial fast weight theta to serve as an initial parameter of the standard optimizer;
the standard optimizer runs in an inner loop of the Lookahead optimizer, and the iteration number in the inner loop is a synchronization period k;
4.3, calculating the fast weight in the ith iteration of the loop in the standard optimizer by the standard optimizer in the Lookahead optimizer according to the formula (6):
θt,i=θt,i-1+A(θt,i-1,d) (6)
wherein A is a standard optimizer and is one of standard gradient descent and random gradient descent; d is the sampling value of the current data; thetat,i-1Parameters required for a standard optimizer; thetat,iIs the fast weight optimization result of the ith iteration of the standard optimizer in the t-th outer loop; thetat,i-1Is the fast weight optimization result of the i-1 iteration of the standard optimizer in the t-th outer loop;
step 4.4 updates the slow weight parameter based on the result of step 4.3 by equation (7):
φt=φt-1+β(θt,kt-1) (7)
wherein, thetat,kIs the fast weight optimization result of the kth iteration of the standard optimizer in the t-th outer loop; phi is at-1Is a slow weight parameter at the moment when the outer loop count value is t-1; phi is atIs a slow weight parameter with an outer loop count value at time t; beta is an iteration parameter of the Lookahead optimizer;
step 4.5 judges whether the outer loop count value t is equal to the maximum loop count value tmaxAnd determining whether to finish the method, specifically: if yes, outputting a slow weight parameter phi at the time ttCompleting optimization; otherwise, let t be t +1, jump to step 4.2.
2. The method for detecting and identifying targets based on FSAF and fast-slow weights as claimed in claim 1, wherein: 1/2 wherein the resolution of the first layer in the convolution layer in step 1 is the input image resolutionl(ii) a And l ranges from 1 to N-1.
3. The method for detecting and identifying targets based on FSAF and fast-slow weights as claimed in claim 2, wherein: the feature pyramid is the 1 st through nth convolutional layers.
4. The method for detecting and identifying targets based on FSAF and fast-slow weights as claimed in claim 3, wherein: step 1.2, specifically: for the N-th convolution layer, the N-th convolution layer can be directly used as the N-th characteristic layer; and for l being 3, … and N-1, summing the result of 2 times upsampling of the l +1 th layer of feature image layer and the result of 1 × 1 convolution of the l th layer of convolution layer, namely performing feature fusion, and sequentially obtaining the N-1 to 3 rd layer of feature image layers after feature fusion.
5. The method for detecting and identifying targets based on FSAF and fast-slow weights as claimed in claim 4, wherein: in step 1.2, the 1 × 1 convolution is used to ensure that the dimensions of the convolution layer participating in the fusion are consistent with those of the feature layer.
6. The method for detecting and identifying targets based on FSAF and fast-slow weights as claimed in claim 5, wherein: and (4) fusing different types of object features into each of the prediction layers obtained in the step 1.5.
7. The method for detecting and identifying targets based on FSAF and fast-slow weights as claimed in claim 6, wherein: the value range of A1 in step 2 is 0.15 to 0.4; a2 has a value in the range of 0.45 to 0.6.
8. The method for detecting and identifying targets based on FSAF and fast-slow weights as claimed in claim 7, wherein: in step 2, the branch without the reference frame comprises a classification subnet and a regression subnet, wherein the classification subnet has a structure of 'w × h × K convolutional layer + Sigmoid activation function', and the regression subnet has a structure of 'w × h × 4 convolutional layer + ReLU activation function'; a gray area is arranged between the effective area and the neglected area; "the output of the prediction layer of each level in the branch with the reference frame" is the input of the convolution layer of w × h × K in the corresponding classification subnet, and the output of the convolution layer of w × h × K in the classification subnet is the input of the Sigmoid function; the Sigmoid activation function maps the output of the w multiplied by h multiplied by K convolutional layers in the classified subnets to 0-1; k is the number of 3 multiplied by 3 convolution kernels of the convolution layers in the classified sub-network and corresponds to K characteristic types; the ReLU function is a ramp function; w and h are the height and width of the standard box, respectively; and the standard box is already marked in the original data.
9. The method for detecting and identifying targets based on FSAF and fast-slow weights as claimed in claim 8, wherein: the prediction box is obtained by the following steps:
and 3, BA obtaining a characteristic layer with minimum comprehensive loss through a formula (5):
Figure FDA0003255800200000061
wherein l*The number of feature layers for parameter feedback is selected,
Figure FDA0003255800200000062
and
Figure FDA0003255800200000063
the classification loss and the regression loss of the l-th layer feature layer of the I example are shown as the returned feature layer
Figure FDA0003255800200000064
Wherein, the characteristic layer with the minimum comprehensive loss is the optimized characteristic layer;
step 3.BB obtains the coordinates of a prediction frame based on the result of the BB optimization feature layer in the step 3. BB;
step 3, BB.1 calculates the offset;
the method specifically comprises the following steps: for all pixels (i, j) of the effective area, mapping frames of standard frames of the characteristic layer of the layer
Figure FDA0003255800200000065
Represented by pixels (i, j) and
Figure FDA0003255800200000066
distance between four sides, and a normalization constant S is set to 4, and pixel (i, j) is connected with pixel (i, j)
Figure FDA0003255800200000067
The distances of the four sides are subjected to offset processing to obtain a transmission offset;
wherein (i, j) is the horizontal and vertical coordinates of the pixel,
Figure FDA0003255800200000068
mapping standard frames on the l-th layer characteristic layer;
step 3, BB.2 calculates the frame size and the mapping frame of the characteristic layer by taking all pixels in the effective area as the center
Figure FDA0003255800200000069
The integration loss of the predicted frame and the standard frame are consistent, and the pixel (i) with the minimum integration loss is selectedmin,jmin) As the center of the prediction box;
and 3, BB.3, eliminating the influence of the standardized constant S on the offset to obtain the actual distance between the pixel and the prediction frame, obtaining the coordinates of the prediction frame on the mapping upper left corner and lower right corner of the characteristic layer, and reusing the coordinates
Figure FDA0003255800200000071
Zooming the mapping frame to obtain a prediction frame of the original image;
wherein the pixel (i) is obtained by eliminating S, i.e. the transfer offset, by multiplying the normalization constant S in step 3.BB.1max,jmax) Distances from four sides of the prediction box.
10. The method for detecting and identifying objects based on FSAF and fast-slow weights as claimed in claim 9, wherein: in step 4.2, a synchronization period k is the iteration number of the standard optimizer in the loop of the lokahead optimizer, the period k is performed according to the optimization effect, and the value range of k is 1000-100000.
CN202111065576.2A 2021-09-10 2021-09-10 Target detection and identification method based on FSAF and fast-slow weight Pending CN113850256A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111065576.2A CN113850256A (en) 2021-09-10 2021-09-10 Target detection and identification method based on FSAF and fast-slow weight

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111065576.2A CN113850256A (en) 2021-09-10 2021-09-10 Target detection and identification method based on FSAF and fast-slow weight

Publications (1)

Publication Number Publication Date
CN113850256A true CN113850256A (en) 2021-12-28

Family

ID=78973732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111065576.2A Pending CN113850256A (en) 2021-09-10 2021-09-10 Target detection and identification method based on FSAF and fast-slow weight

Country Status (1)

Country Link
CN (1) CN113850256A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109685152A (en) * 2018-12-29 2019-04-26 北京化工大学 A kind of image object detection method based on DC-SPP-YOLO
CN110503112A (en) * 2019-08-27 2019-11-26 电子科技大学 A kind of small target deteection of Enhanced feature study and recognition methods
CN111126472A (en) * 2019-12-18 2020-05-08 南京信息工程大学 Improved target detection method based on SSD
WO2020102988A1 (en) * 2018-11-20 2020-05-28 西安电子科技大学 Feature fusion and dense connection based infrared plane target detection method
CN112766184A (en) * 2021-01-22 2021-05-07 东南大学 Remote sensing target detection method based on multi-level feature selection convolutional neural network
WO2021139069A1 (en) * 2020-01-09 2021-07-15 南京信息工程大学 General target detection method for adaptive attention guidance mechanism

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020102988A1 (en) * 2018-11-20 2020-05-28 西安电子科技大学 Feature fusion and dense connection based infrared plane target detection method
CN109685152A (en) * 2018-12-29 2019-04-26 北京化工大学 A kind of image object detection method based on DC-SPP-YOLO
CN110503112A (en) * 2019-08-27 2019-11-26 电子科技大学 A kind of small target deteection of Enhanced feature study and recognition methods
CN111126472A (en) * 2019-12-18 2020-05-08 南京信息工程大学 Improved target detection method based on SSD
WO2021139069A1 (en) * 2020-01-09 2021-07-15 南京信息工程大学 General target detection method for adaptive attention guidance mechanism
CN112766184A (en) * 2021-01-22 2021-05-07 东南大学 Remote sensing target detection method based on multi-level feature selection convolutional neural network

Similar Documents

Publication Publication Date Title
CN109934121B (en) Orchard pedestrian detection method based on YOLOv3 algorithm
CN111310862B (en) Image enhancement-based deep neural network license plate positioning method in complex environment
CN110930454B (en) Six-degree-of-freedom pose estimation algorithm based on boundary box outer key point positioning
CN109583425B (en) Remote sensing image ship integrated recognition method based on deep learning
CN108319972B (en) End-to-end difference network learning method for image semantic segmentation
WO2023015743A1 (en) Lesion detection model training method, and method for recognizing lesion in image
CN108537102B (en) High-resolution SAR image classification method based on sparse features and conditional random field
CN109033978B (en) Error correction strategy-based CNN-SVM hybrid model gesture recognition method
CN110765865B (en) Underwater target detection method based on improved YOLO algorithm
CN109886271B (en) Image accurate segmentation method integrating deep learning network and improving edge detection
CN110716792B (en) Target detector and construction method and application thereof
CN113610905B (en) Deep learning remote sensing image registration method based on sub-image matching and application
CN113256572B (en) Gastroscope image analysis system, method and equipment based on restoration and selective enhancement
CN114581709A (en) Model training, method, apparatus, and medium for recognizing target in medical image
CN111242026A (en) Remote sensing image target detection method based on spatial hierarchy perception module and metric learning
CN112329784A (en) Correlation filtering tracking method based on space-time perception and multimodal response
CN116468995A (en) Sonar image classification method combining SLIC super-pixel and graph annotation meaning network
CN111274964B (en) Detection method for analyzing water surface pollutants based on visual saliency of unmanned aerial vehicle
CN115439738A (en) Underwater target detection method based on self-supervision cooperative reconstruction
CN116977960A (en) Rice seedling row detection method based on example segmentation
CN114998362A (en) Medical image segmentation method based on double segmentation models
CN113191962A (en) Underwater image color recovery method and device based on environment background light and storage medium
CN116503763A (en) Unmanned aerial vehicle cruising forest fire detection method based on binary cooperative feedback
CN114782455B (en) Cotton row center line image extraction method for agricultural machine embedded equipment
CN113850256A (en) Target detection and identification method based on FSAF and fast-slow weight

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination