CN115393388A - Single-target tracking method based on position uncertainty estimation - Google Patents

Single-target tracking method based on position uncertainty estimation Download PDF

Info

Publication number
CN115393388A
CN115393388A CN202110566900.2A CN202110566900A CN115393388A CN 115393388 A CN115393388 A CN 115393388A CN 202110566900 A CN202110566900 A CN 202110566900A CN 115393388 A CN115393388 A CN 115393388A
Authority
CN
China
Prior art keywords
frame
target
search
classification
branch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110566900.2A
Other languages
Chinese (zh)
Inventor
武港山
徐梦强
王利民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202110566900.2A priority Critical patent/CN115393388A/en
Publication of CN115393388A publication Critical patent/CN115393388A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

An accurate target tracking method based on a target transformation regression network comprises the following steps: 1) Generating a training sample stage; 2) A network subject training stage; 3) An off-line training stage of the meta classifier; 4) An online tracking stage; the position uncertainty estimation module designed by the invention can predict the confidence information of the network output position coordinate, and a final prediction frame is generated by using a position voting mechanism in the subsequent stage, so that an accurate regression boundary frame can be given. In addition, the invention provides an online updating strategy based on meta-learning, so that the tracker can adapt to the change of the shape and the scale of the target, and the robustness of the tracker is improved. Compared with the existing single-target tracking method, the tracking method disclosed by the invention has better adaptability to the deformation of the object in the tracking process, and the target regression precision is effectively improved.

Description

Single-target tracking method based on position uncertainty estimation
Technical Field
The invention belongs to the technical field of computer software, relates to a single-target tracking technology, and particularly relates to a single-target tracking method based on position uncertainty estimation.
Background
Target tracking is a fundamental task in the field of computer vision. Generally, the target tracking problem can be summarized simply as: a section of video sequence is given, position information of a target in a first frame is given, and the algorithm needs to accurately track the position of the target in a subsequent video sequence, so that complete track information of target motion is obtained.
The object tracking problem can be viewed as a combination of classification tasks, which provide a coarse location information mainly by classification, and state estimation tasks, which provide more accurate object state information based thereon. In order to make the tracking result more accurate, the design of the state estimation task is very important. The current target tracking algorithm can be classified into the following three types if classified according to state estimation. The first category mainly includes early correlation filtering methods and twin network methods such as DCF, siamFC, etc., which use simple multi-scale testing in the state estimation stage, which is both inaccurate and time-consuming. The second type is mainly a SiamRPN algorithm series, which introduces an RPN module commonly used in the field of target detection into SiamFC, so that the tracker can return to the position and shape, and multi-scale tests are also omitted. While this approach both improves the accuracy of the algorithm and maintains speed, it still has a number of deficiencies. The third category mainly comprises ATOM and DiMP methods, wherein a plurality of candidate frames are randomly generated at rough positions given by classification tasks, and then the candidate frames are iteratively optimized by utilizing gradient descent based on a specially designed IoU prediction network, so that a more accurate prediction frame is obtained. This approach, while making great progress in accuracy, is computationally expensive and requires careful adjustment to introduce too many hyper-parameters. In addition, some anchor-frame-free methods in the field of target detection are also applied to the field of target tracking recently, and good effects are obtained, but the methods are still not accurate enough, and the robustness is still to be improved.
Disclosure of Invention
The invention aims to solve the problems that: tracking at a targetIn the candidate box screening stage in the process, the accuracy of the algorithm is low due to the lack of position confidence, and the robustness of the algorithm is low due to the fact that a tracker cannot adapt to the shape and scale changes which may occur to a target. In the candidate frame screening stage in the previous twin network-based tracking method (such as SiamRPN, siamFC + +), a prediction frame corresponding to the highest point of confidence of the category is generally selected as a final prediction frame, but studies have been made in the field of target detection to indicate that this method is not reasonable and the model can only obtain suboptimal solutions. Meanwhile, most of the methods lack a quick and effective online updating mechanism to adapt to the shape and scale changes of the target frequently occurring in the tracking process. Accordingly, the design of the present inventionTargetA position uncertainty estimation module is introduced into a state estimation task to guide the screening of candidate frames, and a classifier based on meta-learning is introduced into a classification task to perform online updating and is respectively used for improving the accuracy and robustness of a tracker.
The technical scheme of the invention is as follows: a single target tracking method based on position uncertainty estimation includes the steps of firstly training network parameters of a target tracking network in an off-line mode, then selecting a part of video frames with predicted results in the tracking process to serve as on-line training samples of target tracking network classification branches, updating the network parameters of a classifier based on meta learning, improving tracking robustness, and meanwhile improving tracking accuracy by using a position voting mechanism in a candidate frame screening stage of target tracking.
Further, the method comprises the steps of generating training samples, performing main network offline training, performing meta classifier offline training and performing online tracking:
1) Generating a training sample, firstly carrying out target area enhancement processing on each frame of image of each video in an off-line training data set, then cutting out a target search area after enhancement processing and zooming to a fixed size, extracting two frames from each cut video frame sequence according to a certain interval to generate a positive sample pair, respectively randomly extracting a frame from two different video sequences to generate a negative sample pair, wherein one of each sample pair is used as a template frame, the other sample pair is used as a search frame, for the positive sample pair, a classification branch label and a regression branch label are generated according to the search frame and a target marking frame thereof, and for the negative sample pair, only a classification branch label is generated according to the search frame and the target marking frame thereof;
2) The method comprises the following steps of (1) main body network off-line training, including network main body part training and meta classifier training; for training of the network main body part, firstly inputting the template frame and the search frame picture into a twin network, extracting respective classification characteristic graph and regression characteristic graph, and taking the classification characteristic graph of the template frame as a convolution kernel f of a classification branch cls A classification feature map applied to the search frame, and a class score confidence map M generated after the convolution operation cls Taking the regression feature map of the template frame as a convolution kernel f of the regression branch reg Generating regression feature map M from the distance between the center point and the target boundary after convolution operation reg And the corresponding distance confidence map M uncert Respectively representing the distances from the central point of the target to the four boundaries of the object and the confidence values of the predicted distances; then confidence map M according to the category cls Find the highest point of score, at M reg Finding the offset distances corresponding to the point and the nearby points, and voting according to the confidence degrees corresponding to the offset distances to obtain a final predicted target frame;
during training, for a classification branch, the Focal local in RetinaNet is used as a Loss function, for a regression branch, a DIoU Loss function is used, an uncertainty estimation module uses a negative log-likelihood Loss function NPLL (negative power log-likelihood Loss) Loss function, a label obtained by searching a frame is combined, an SGD (generalized minimum delay) optimizer is used, the whole network parameter is updated through a back propagation algorithm, and positive and negative samples are continuously and randomly extracted to repeat the process until the iteration times are reached;
3) The meta classifier is trained off-line, the input of the meta classifier is a classification feature map of a search frame in the inference phase, and the output is a classification confidence map M' cls At this time, the class confidence map is compared with the class confidence map M in step 2) cls Weighting and summing to obtain the final class confidence map M cls ←α·M cls +(1-α)·M′ cls (ii) a In the training stageTraining the meta classifier by adopting an MAML algorithm, and finding out a group of initialization parameters, so that the classifier can quickly learn the information of the target by using a small number of samples and performing gradient updating for a plurality of times in the state;
4) In the on-line tracking, firstly, a target frame search area in a first frame image of a video to be tracked is cut out to serve as a template, then the template frame is expanded to form an on-line training data set containing 5 frame images, network parameters are updated through 5 gradient drops, so that a meta classifier can classify a current tracking target, and in the tracking process, a frame with the highest classification score and a target frame obtained through tracking are selected from every 10 frames in a frame sequence which is tracked and are used as labels to be added into the on-line training data set for updating the meta classifier.
The invention aims to construct an accurate target tracker which can adapt to the deformation of a target, distinguish background interferents and the like, and further improve the robustness of the tracker. As indicated by previous analysis, the SiamRPN series is based on the class confidence in the candidate frame screening process, which is not reasonable and results in a suboptimal model. ATOM algorithm and the like introduce IoU to predict the network and use IoU value predicted by the network to replace the class confidence value, but the method has large calculation amount and limited accuracy improvement. The invention adopts a full convolution twin network structure, invents a single target tracking method based on position uncertainty estimation, and is named as FCST (full probabilistic Simese Tracker). The position uncertainty estimation module designed by the invention can predict the confidence information of the network output position coordinate, and a final prediction frame is generated by using a position voting mechanism in the subsequent stage, so that an accurate regression boundary frame can be given. In addition, the invention provides an online updating strategy based on meta-learning, so that the tracker can adapt to the change of the target shape and scale, and the robustness of the tracker is improved.
Compared with the prior art, the invention has the following advantages.
The invention provides a single target tracking method (FCST) based on position uncertainty estimation. The method adopts a full convolution twin network structure, introduces a position uncertainty estimation module in a target state estimation task, generates a final prediction frame by a position voting method, and improves the tracking accuracy while ensuring the object tracking efficiency.
The invention introduces an online updating classifier based on meta-learning in a classification task, and can adapt to the shape scale change of a target only by a small number of training samples through a plurality of iterations in the tracking process. Compared with the existing twin network-based tracking method, the FCST tracking method provided by the invention can have better adaptability to object deformation in the tracking process, and effectively improves the robustness of target classification.
According to the method, a good result is obtained on a single-object tracking task, and the target regression precision and the target classification robustness are improved. Compared with the existing method, the TREG tracking method provided by the invention has the advantages that the good tracking success rate and the positioning accuracy are embodied in a plurality of visual tracking test reference data sets.
Drawings
FIG. 1 is a system framework diagram used by the present invention.
FIG. 2 is a diagram of a meta classifier structure.
Fig. 3 is a schematic diagram of a multivariate information fusion module provided by the present invention.
Fig. 4 is a schematic diagram of feature extraction fusion proposed by the present invention.
Detailed Description
The invention provides an accurate single-target tracking method based on a target transformation regression network. The off-line training is carried out on four training data sets of TrackingNet-Train, laSOT-Train, COCO-Train and GOT-10k-Train, the Test on a plurality of Test sets of OTB100, VOT2018, laSOT-Test and Got10k-Test achieves higher accuracy and tracking success rate, and the Test is implemented by specifically using a Python3.6 programming language and a Pytroch 1.4 deep learning framework.
FIG. 1 is a system framework diagram used by the present invention to implement a target tracking task by generating a target classification and target regression template through a full convolution twin network to guide the classification and regression task and the strategy of updating the classification and regression template online. The whole method comprises a training example generation stage, a main body network training stage, a meta classifier off-line training stage and an on-line tracking stage, and the specific implementation steps are as follows:
1) And a data preparation phase, namely a training sample generation phase. The method comprises the steps of generating training samples in an off-line training process, firstly performing target area enhancement processing on each frame of image of each video in an off-line training data set, then cutting out an enhanced target search area and zooming to a fixed size, then extracting two frames from each cut video frame sequence at a certain interval to generate a positive sample pair, and respectively randomly extracting one frame from two different video sequences to generate a negative sample pair. One of each sample pair serves as a template frame and the other serves as a search frame. For the positive sample pairs, a classification branch label and a regression branch label are generated according to the search frame and the target labeling box thereof. And for the negative sample pairs, only generating classification branch labels according to the search frames and the target labeling frames thereof. If the position of the original image corresponding to a certain coordinate point on the category label graph falls into the central area of the marking frame, the position is marked as 1, if the position falls out of the area of the marking frame, the position is marked as 0, and other positions on the graph are marked as-1; if the position of the original image corresponding to a certain coordinate point on the category label graph falls into the central area of the labeling frame, the position is marked as 0, and other areas are marked as-1. For the
2) And a network body part training stage, which is specifically as follows.
2.1 Extract template branch features: firstly, the modified GoogLeNet is used as a backbone of the twin network for extracting features, and a template frame Z is subjected to feature extraction i ∈R B×3×127×127 Extracting the features to obtain F temp ∈R B×256×5×5 Wherein the meaning of temp is labeled template frame extracted feature, B represents the size of batch size. Among these, *** lenet uses parameters pre-trained on ImageNet.
2.2 Template branch feature adjustment: in order to adapt the obtained features to different tasks (classification task, regression task), the features need to be adjusted. Branching the template obtained in step 2.1)The features are input to a network containing a single convolutional layer, which uses a convolutional kernel of 3*3, with a step size of 1, 256 input channels, and 256 output channels. The adjusted branch characteristic size of the template is F temp ∈R B×256×5×5 Become into
Figure BDA0003081241850000051
Similarly, in order to obtain the characteristics suitable for the regression task, the template branch characteristics are input into another single-layer convolution network, and the size, the step length and the channel number of the convolution kernel are the same as those of the single-layer convolution network. Adjusting the branch characteristic size of the template to F temp ∈R B×256×5×5 Become into
Figure BDA0003081241850000052
The subscript cls indicates that the feature is a feature used by the template branch to perform a classification task, and the subscript reg indicates that the feature is a feature used by the template branch to perform a regression task.
2.3 Search frame feature extraction: the search frame is input into the other branch of the twin network, the network of the branch is also adjusted GoogLeNet, and the network parameters are also parameters pre-trained on ImageNet. The search frame size is X, unlike the template frame size i ∈R B×3×255×255 After the features are extracted through the main network, the obtained features of the search branches are F search ∈R B×256×27×27 Where 256 represents the number of channels.
2.4 Search frame feature adjustment: also in order to enable the search branch feature to be adapted to different tasks, it needs to be adapted. Inputting the search branch characteristics obtained in the step 2.3) into a network containing a single convolutional layer, wherein the convolutional layer adopts a convolutional kernel of 3*3, the step length is 1, the number of input channels is 256, and the number of output channels is 256. The adjusted search branch feature size is represented by F search ∈R B×256×27×27 Become into
Figure BDA0003081241850000053
Also, to obtain features that are adapted to the regression task, inputting the search branch characteristics into another single-layer convolution network, the size of a convolution kernel,The step length and the number of channels are the same as above. The adjusted search branch feature size is represented by F search ∈R B×256×27×27 Become into
Figure BDA0003081241850000054
The subscript cls indicates that the feature is a feature used by the search branch to perform a classification task, and the subscript reg indicates that the feature is a feature used by the search branch to perform a regression task.
2.5 Get a classification confidence map: using the classification characteristics of the template branches as convolution kernels
Figure BDA0003081241850000055
Classification characteristics with search branches
Figure BDA0003081241850000056
After performing a convolution operation (i.e., a cross-correlation operation), a magnitude of F is obtained cls ∈R B×256×23×23 Then outputs the final class confidence map M through three layers of convolution networks cls ∈R B ×1×19×19 . The convolution kernel size of the first two layers of the three layers of convolution networks is 3*3, the step length is 1, the input channel is 256, and the output channel is 256. The size of the last layer of convolution kernel is 1*1, the step length is 1, the number of input channels is 256, the number of output channels is 1, and the method mainly functions in fusing different dimension information.
2.6 ) obtaining a distance regression map M reg And corresponding position uncertainty map M uncert : regression features of template branches
Figure BDA0003081241850000057
Regression features as convolution kernels with search branches
Figure BDA0003081241850000058
Performing convolution operation to obtain a value of F reg ∈R B×256×23×23 The characteristics of (1). Then inputting the characteristic into a two-layer convolution network, wherein the convolution kernel size is 3*3, the step length is 1, the input channel and the output channel are both 256, and the output characteristic size is F reg ∈R B ×256×19×19 . Finally, the convolution kernels are respectively input into two parallel single-layer convolution layers, the convolution kernels of the two convolution layers are 1*1, the step length is 1, the number of input channels is 256, the number of output channels is 4, and the output characteristics are respectively M reg ∈R B×4×19×19 And M uncert ∈R B×4×19×19 Distances of the center point of the target from the four boundaries of the object and confidence values of the predicted distances are respectively represented.
2.7 For offline training of classification branches, focal Loss proposed by RetinaNet is used as a Loss function, a deviation distance prediction module in a regression (state prediction) branch uses DIoU Loss as a Loss function, a position confidence estimation module in the regression branch uses a negative log-likelihood Loss function NPLL (negative log-likelihood Loss) Loss function, an SGD optimizer is used in an experiment, batchSize is set to be 16, the total number of training rounds is 20, an initial value of a learning rate is 0.0001, after 15 rounds, the result is divided by 10, an attenuation rate is set to be 0.1, training is performed on 8 RTX 2080ti, and the whole network parameters are updated through a back propagation algorithm. And continuously repeating the step 2.1) to the step 2.7) until the iteration number is reached.
3) The off-line training stage of the meta-classifier is required to be carried out after the training of the main network, wherein the meta-classifier is a network comprising a plurality of convolution layers and shares the characteristic F after carrying out convolution operation on the template branch classification characteristic and the search branch classification characteristic with the main network cls ∈R B×256×23×23 The output is still a category confidence map as the input of the meta classifier, and the size of the category confidence map is consistent with the category confidence map output in the step 2.5), and is marked as M' cls ∈R B×1×19×19 . The method adopts the MAML algorithm (Model-empirical Meta Learning) to train the method, and mainly comprises inner layer optimization and outer layer optimization. Specifically, as follows, the following description will be given,
given a video sequence V for training i First, collect training sample set
Figure BDA0003081241850000061
In the field of meta-learning, which is also known as Support Set, a classifier is defined as f (x; θ) 0 ) Where x is the input picture, θ 0 For network initialization parameters, the network is updated using a k-step stochastic gradient descent algorithm on the training set:
Figure BDA0003081241850000062
where α is the weighting factor, α =0.6 in the experiment, l is the loss function, and f (x, y) is the sample pair in the training set. In the MAML algorithm, the above equation is called inner-level optimization.
In order to evaluate the generalization performance of the classifier, the same video sequence V needs to be used i In the collection of sample sets
Figure BDA0003081241850000063
In the field of meta-learning, which is also known as Target Set, it is computed using an inner-layer optimized model
Figure BDA0003081241850000064
Loss in (c):
Figure BDA0003081241850000065
wherein
Figure BDA0003081241850000066
Represents the union of Support Set and Target Set. The training goal of the whole network is to find the network initialization parameter theta which can meet all video sequences as well as possible 0 It can be expressed as,
Figure BDA0003081241850000067
this formula is called outer-level optimization (outer-level optimization) and is updated by optimization using the Adam algorithm.
Firstly, training a main body network according to the step 2), and then training a meta classifier off line on the basis of the main body network. In the off-line training process, 8 video sequences (i.e. batch size = 8) were randomly picked per batch, 600 iterations per round, for a total of 100 rounds of training. In the training process, the inner layer is optimized by using a random gradient descent algorithm for 5 times of gradient updating, the learning rate is 0.01, the outer layer is optimized by using an Adam algorithm, and the learning rate is 0.001.
4) In the online tracking stage, the meta classifier needs to be updated online, specifically, given a video sequence and a label of a first frame, an algorithm first uses the first frame picture and the label as a training positive sample, and finely adjusts the meta classifier, so that the meta classifier can perform classification tasks on the current target in the subsequent process. Because the first frame only has one sample, in order to make the generalization capability of the meta classifier stronger, the training sample Set is extended by using a data enhancement mode to form a Support Set.
In the subsequent tracking process, the method continuously collects the previous tracking results for updating operation in the subsequent process, uses a position voting mechanism to improve the tracking accuracy in the subsequent candidate frame screening stage, and selects a frame with the highest classification score and a tracked target frame from every 10 frames in the tracked frame sequence as labels to be added into an online training data set for updating the meta classifier. Since the tracking result is not as reliable as the first frame annotation data, there may be inaccurate or even erroneous results, and therefore the confidence level is only when the position of the prediction frame is greater than a certain threshold θ loc And the class confidence is also larger than a certain threshold value theta cls Then it is selected as Support Set. In the implementation of the experiment, at most 15 samples are cached in the Support Set, and meanwhile, the updating is performed every 10 frames in consideration of the time consumption of the online updating operation. Only one gradient descent is performed per update to save time.
When the tracking is started, firstly cutting out a target frame region in a first frame image of a video to be tracked and scaling the target frame region to 127 × 127 as the input of a template branch, calculating the size of a search range of a current frame according to a prediction result of a previous frame for a subsequent frame sequence, then cutting out a search region with a corresponding size and scaling the search region to 255 × 255, and inputting the search region into a twin networkThe other branch. When the subject network outputs the category confidence map M cls Bias distance regression graph M reg Position uncertainty map M un rt And the meta classifier outputs a category confidence map M' cls Then, firstly, carrying out weighted fusion operation on the two category confidence maps to obtain a final category response map:
M cls ←α·M cls +(1-α)·M′ cls
α is a weighting factor, α =0.6 in the experiment.
Finding the highest point of the score value on the category response graph, selecting an N +1 prediction frame corresponding to the position and adjacent N positions around the position as a candidate set,
Figure BDA0003081241850000071
each element in the set contains an upper, lower, left and right position offset and its corresponding confidence, where l i A value representing the left boundary of the prediction box i,
Figure BDA0003081241850000072
is its corresponding uncertainty; t is t i A value representing the upper boundary of the prediction box i,
Figure BDA0003081241850000073
is its corresponding uncertainty; r is i A value representing the right boundary of the prediction box i,
Figure BDA0003081241850000074
for its corresponding uncertainty, b i A value representing the lower boundary of the prediction box i,
Figure BDA0003081241850000075
for its corresponding uncertainty, i is the predicted target box sequence number. It is then divided into four subsets according to four boundaries:
Figure BDA0003081241850000081
Figure BDA0003081241850000082
Figure BDA0003081241850000083
Figure BDA0003081241850000084
and then sequentially selecting K items with the highest confidence degrees from each subset to form a new subset:
Figure BDA0003081241850000085
Figure BDA0003081241850000086
Figure BDA0003081241850000087
Figure BDA0003081241850000088
the final prediction box is denoted as B pred ={l pred ,t pred ,r pred ,b pred H, wherein l pred For the final prediction of the left boundary value of the frame, the specific calculation formula is as follows:
Figure BDA0003081241850000089
on a test data set, the tracking efficiency is 30fps, and on the tracking precision, auc reaches 70.1% on an OTB100 data set, and Pre reaches 91.5%. On VOT2018 data set, EAO reaches0.474, robustness reached 0.164, accuracy reached 0.609; suc reached 56.2% on the LaSOT dataset; SR on GOT10 dataset .5 Is 0.723,SR .75 0.530 and AR 0.614.

Claims (4)

1. A single target tracking method based on position uncertainty estimation is characterized in that network parameters of a target tracking network are trained offline, then partial video frames with predicted results are selected in the tracking process and used as online training samples of target tracking network classification branches, network parameters of a classifier based on meta-learning are updated, tracking robustness is improved, and meanwhile a position voting mechanism is used in a candidate frame screening stage of target tracking to improve tracking accuracy.
2. The method of claim 1, wherein the method comprises generating training samples, subject network offline training, meta classifier offline training, and online tracking:
1) Generating a training sample, firstly, performing target area enhancement processing on each frame image of each video in an offline training data set, then cutting out a target search area after enhancement processing and zooming to a fixed size, extracting two frames from each cut video frame sequence according to a certain interval to generate a positive sample pair, respectively randomly extracting one frame from two different video sequences to generate a negative sample pair, wherein one of each sample pair is used as a template frame, the other one is used as a search frame, for the positive sample pair, a classification branch label and a regression branch label are generated according to the search frame and a target labeling frame thereof, and for the negative sample pair, only a classification branch label is generated according to the search frame and the target labeling frame thereof;
2) The method comprises the following steps of (1) main body network offline training, including network main body part training and meta classifier training; for training of the network main body part, firstly inputting the template frame and the search frame picture into a twin network, extracting respective classification characteristic graph and regression characteristic graph, and taking the classification characteristic graph of the template frame as a convolution kernel f of a classification branch cls Acting on search framesA classification feature map, which is subjected to the convolution operation to generate a class score confidence map M cls Taking the regression feature map of the template frame as a convolution kernel f of the regression branch reg Generating regression feature map M from the distance between the center point and the target boundary after convolution operation reg And the corresponding distance confidence map M uncert Respectively representing the distances from the central point of the target to the four boundaries of the object and the confidence values of the predicted distances; then confidence map M according to the category cls Find the highest point of score, at M reg Finding the offset distances corresponding to the point and the nearby points, and voting according to the confidence degrees corresponding to the offset distances to obtain a final predicted target frame;
during training, for a classification branch, the Focal local in RetinaNet is used as a Loss function, for a regression branch, a DIoU Loss function is used, an uncertainty estimation module uses a negative log-likelihood Loss function NPLL, a label obtained by combining a search frame is used, an SGD optimizer is used, the whole network parameter is updated through a back propagation algorithm, and positive and negative samples are continuously and randomly extracted to repeat the process until the iteration times are reached;
3) The meta classifier is trained off-line, the input of the meta classifier is a classification feature map of a search frame in the inference phase, and the output is a classification confidence map M' cls At this time, the class confidence map is compared with the class confidence map M in step 2) cls Weighting and summing to obtain the final class confidence map M cls ←α·M cls +(1-α)·M′ cls Wherein α is a weighting factor; in the training stage, the element classifier is trained by adopting an MAML algorithm, and a group of initialization parameters are found, so that the classifier can quickly learn the information of the target by using a small number of samples and performing gradient updating for a plurality of times in the state;
4) In the on-line tracking, firstly, a target frame search area in a first frame image of a video to be tracked is cut out to serve as a template, then the template frame is expanded to form an on-line training data set containing 5 frame images, network parameters are updated through 5 gradient drops, so that a meta classifier can classify a current tracking target, and in the tracking process, a frame with the highest classification score and a target frame obtained through tracking are selected from every 10 frames in a frame sequence which is tracked and are used as labels to be added into the on-line training data set for updating the meta classifier.
3. The single-target tracking method based on the position uncertainty estimation as claimed in claim 2, wherein the network body part training specifically comprises:
2.1 Extract template branch features: for template frame Z i ∈R B×3×127×127 Extracting the features to obtain the branch features F of the template temp ∈R B ×256×5×5
2.2 Template branch feature adjustment: inputting the template branch characteristics obtained in the step 2.1) into two networks containing single convolutional layers respectively, wherein the sizes of the adjusted template branch characteristics are respectively as follows: f temp ∈R B×256×5×5 Become into
Figure FDA0003081241840000021
And F temp ∈R B×256×5×5 Become into
Figure FDA0003081241840000022
Subscript cls indicates that the feature is a feature used by the template branch to perform a classification task, and subscript reg indicates that the feature is a feature used by the template branch to perform a regression task;
2.3 Search frame feature extraction: search frame size of X i ∈R B×3×255×255 After extracting the features through the main network, obtaining the search branch feature as F search ∈R B×256×27×27
2.4 Search frame feature adjustment: inputting the search branch characteristics obtained in the step 2.3) into two networks containing a single convolution layer respectively, wherein the sizes of the adjusted search branch characteristics are respectively as follows: f search ∈R B×256×27×27 Become into
Figure FDA0003081241840000023
And F search ∈R B×256×27×27 Become into
Figure FDA0003081241840000024
Subscript cls indicates that the feature is a feature used by the search branch to perform a classification task, and subscript reg indicates that the feature is a feature used by the search branch to perform a regression task;
2.5 Get a classification confidence map: taking the classification characteristics of the template branches as convolution kernels
Figure FDA0003081241840000025
Classification characteristics with search branches
Figure FDA0003081241840000026
After performing the convolution operation, F is obtained cls ∈R B×256×23×23 Then outputs the final class confidence map M through three layers of convolution networks cls ∈R B×1×19×19
2.6 ) obtaining a distance regression map M reg And a corresponding position uncertainty map M uncert : regression features of template branches
Figure FDA0003081241840000027
Regression features as convolution kernels with search branches
Figure FDA0003081241840000028
Performing convolution operation to obtain a size of F reg ∈R B×256×23×23 The features of (1); then inputting the feature into a two-layer convolution network, and obtaining a feature F by convolution reg ∈R B×256×19×19 Finally, the two parallel single-layer convolution layers are respectively input, and the output characteristics are respectively M reg ∈R B×4×19×19 And M uncert ∈R B×4×19×19 Respectively representing the distances from the central point of the target to the four boundaries of the object and the confidence values of the predicted distances;
2.7 For offline training of classification branches, focal local proposed by RetinaNet is used as a Loss function, a deviation distance prediction module in a regression branch uses DIoU local as the Loss function, a position confidence estimation module in the regression branch uses NPLL (negative power log-likelihood Loss) Loss function, an SGD optimizer is used, batchSize is set to be 16, the total number of training rounds is 20, an initial learning rate value is 0.0001, a value of 10 is divided after 15 rounds, an attenuation rate is set to be 0.1, the whole network parameters are updated through a back propagation algorithm, and the steps 2.1) to 2.7 are repeated continuously until the number of iterations is reached.
4. The single-target tracking method based on the position uncertainty estimation according to claim 1 or 2, characterized in that the target tracking network adopts a full convolution twin network to generate a target classification and target regression template to guide classification and regression tasks and update strategies of the classification and regression template on line to realize the target tracking task.
CN202110566900.2A 2021-05-24 2021-05-24 Single-target tracking method based on position uncertainty estimation Pending CN115393388A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110566900.2A CN115393388A (en) 2021-05-24 2021-05-24 Single-target tracking method based on position uncertainty estimation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110566900.2A CN115393388A (en) 2021-05-24 2021-05-24 Single-target tracking method based on position uncertainty estimation

Publications (1)

Publication Number Publication Date
CN115393388A true CN115393388A (en) 2022-11-25

Family

ID=84114183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110566900.2A Pending CN115393388A (en) 2021-05-24 2021-05-24 Single-target tracking method based on position uncertainty estimation

Country Status (1)

Country Link
CN (1) CN115393388A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220326768A1 (en) * 2021-04-09 2022-10-13 Honda Motor Co., Ltd. Information processing apparatus, information processing method, learning method, and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220326768A1 (en) * 2021-04-09 2022-10-13 Honda Motor Co., Ltd. Information processing apparatus, information processing method, learning method, and storage medium
US12013980B2 (en) * 2021-04-09 2024-06-18 Honda Motor Co., Ltd. Information processing apparatus, information processing method, learning method, and storage medium

Similar Documents

Publication Publication Date Title
CN110443818B (en) Graffiti-based weak supervision semantic segmentation method and system
CN108985334B (en) General object detection system and method for improving active learning based on self-supervision process
CN108647577B (en) Self-adaptive pedestrian re-identification method and system for difficult excavation
CN111144364B (en) Twin network target tracking method based on channel attention updating mechanism
CN113326731B (en) Cross-domain pedestrian re-identification method based on momentum network guidance
CN110135502B (en) Image fine-grained identification method based on reinforcement learning strategy
US11816149B2 (en) Electronic device and control method thereof
CN108875610B (en) Method for positioning action time axis in video based on boundary search
CN113628244B (en) Target tracking method, system, terminal and medium based on label-free video training
CN113129335B (en) Visual tracking algorithm and multi-template updating strategy based on twin network
Li et al. Robust deep neural networks for road extraction from remote sensing images
CN112651998A (en) Human body tracking algorithm based on attention mechanism and double-current multi-domain convolutional neural network
CN112287906B (en) Template matching tracking method and system based on depth feature fusion
CN116229112A (en) Twin network target tracking method based on multiple attentives
CN115424177A (en) Twin network target tracking method based on incremental learning
CN113283467B (en) Weak supervision picture classification method based on average loss and category-by-category selection
CN115393388A (en) Single-target tracking method based on position uncertainty estimation
Chun et al. USD: Uncertainty-based One-phase Learning to Enhance Pseudo-Label Reliability for Semi-Supervised Object Detection
CN116561562B (en) Sound source depth optimization acquisition method based on waveguide singular points
CN116958057A (en) Strategy-guided visual loop detection method
CN110889418A (en) Gas contour identification method
Zhu et al. Find gold in sand: Fine-grained similarity mining for domain-adaptive crowd counting
CN113032612B (en) Construction method of multi-target image retrieval model, retrieval method and device
CN114220086A (en) Cost-efficient scene character detection method and system
CN114596338A (en) Twin network target tracking method considering time sequence relation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination