CN115393388A

CN115393388A - Single-target tracking method based on position uncertainty estimation

Info

Publication number: CN115393388A
Application number: CN202110566900.2A
Authority: CN
Inventors: 武港山; 徐梦强; 王利民
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2022-11-25

Abstract

An accurate target tracking method based on a target transformation regression network comprises the following steps: 1) Generating a training sample stage; 2) A network subject training stage; 3) An off-line training stage of the meta classifier; 4) An online tracking stage; the position uncertainty estimation module designed by the invention can predict the confidence information of the network output position coordinate, and a final prediction frame is generated by using a position voting mechanism in the subsequent stage, so that an accurate regression boundary frame can be given. In addition, the invention provides an online updating strategy based on meta-learning, so that the tracker can adapt to the change of the shape and the scale of the target, and the robustness of the tracker is improved. Compared with the existing single-target tracking method, the tracking method disclosed by the invention has better adaptability to the deformation of the object in the tracking process, and the target regression precision is effectively improved.

Description

Single-target tracking method based on position uncertainty estimation

Technical Field

The invention belongs to the technical field of computer software, relates to a single-target tracking technology, and particularly relates to a single-target tracking method based on position uncertainty estimation.

Background

Target tracking is a fundamental task in the field of computer vision. Generally, the target tracking problem can be summarized simply as: a section of video sequence is given, position information of a target in a first frame is given, and the algorithm needs to accurately track the position of the target in a subsequent video sequence, so that complete track information of target motion is obtained.

The object tracking problem can be viewed as a combination of classification tasks, which provide a coarse location information mainly by classification, and state estimation tasks, which provide more accurate object state information based thereon. In order to make the tracking result more accurate, the design of the state estimation task is very important. The current target tracking algorithm can be classified into the following three types if classified according to state estimation. The first category mainly includes early correlation filtering methods and twin network methods such as DCF, siamFC, etc., which use simple multi-scale testing in the state estimation stage, which is both inaccurate and time-consuming. The second type is mainly a SiamRPN algorithm series, which introduces an RPN module commonly used in the field of target detection into SiamFC, so that the tracker can return to the position and shape, and multi-scale tests are also omitted. While this approach both improves the accuracy of the algorithm and maintains speed, it still has a number of deficiencies. The third category mainly comprises ATOM and DiMP methods, wherein a plurality of candidate frames are randomly generated at rough positions given by classification tasks, and then the candidate frames are iteratively optimized by utilizing gradient descent based on a specially designed IoU prediction network, so that a more accurate prediction frame is obtained. This approach, while making great progress in accuracy, is computationally expensive and requires careful adjustment to introduce too many hyper-parameters. In addition, some anchor-frame-free methods in the field of target detection are also applied to the field of target tracking recently, and good effects are obtained, but the methods are still not accurate enough, and the robustness is still to be improved.

Disclosure of Invention

The invention aims to solve the problems that: tracking at a targetIn the candidate box screening stage in the process, the accuracy of the algorithm is low due to the lack of position confidence, and the robustness of the algorithm is low due to the fact that a tracker cannot adapt to the shape and scale changes which may occur to a target. In the candidate frame screening stage in the previous twin network-based tracking method (such as SiamRPN, siamFC + +), a prediction frame corresponding to the highest point of confidence of the category is generally selected as a final prediction frame, but studies have been made in the field of target detection to indicate that this method is not reasonable and the model can only obtain suboptimal solutions. Meanwhile, most of the methods lack a quick and effective online updating mechanism to adapt to the shape and scale changes of the target frequently occurring in the tracking process. Accordingly, the design of the present inventionTargetA position uncertainty estimation module is introduced into a state estimation task to guide the screening of candidate frames, and a classifier based on meta-learning is introduced into a classification task to perform online updating and is respectively used for improving the accuracy and robustness of a tracker.

The technical scheme of the invention is as follows: a single target tracking method based on position uncertainty estimation includes the steps of firstly training network parameters of a target tracking network in an off-line mode, then selecting a part of video frames with predicted results in the tracking process to serve as on-line training samples of target tracking network classification branches, updating the network parameters of a classifier based on meta learning, improving tracking robustness, and meanwhile improving tracking accuracy by using a position voting mechanism in a candidate frame screening stage of target tracking.

Further, the method comprises the steps of generating training samples, performing main network offline training, performing meta classifier offline training and performing online tracking:

1) Generating a training sample, firstly carrying out target area enhancement processing on each frame of image of each video in an off-line training data set, then cutting out a target search area after enhancement processing and zooming to a fixed size, extracting two frames from each cut video frame sequence according to a certain interval to generate a positive sample pair, respectively randomly extracting a frame from two different video sequences to generate a negative sample pair, wherein one of each sample pair is used as a template frame, the other sample pair is used as a search frame, for the positive sample pair, a classification branch label and a regression branch label are generated according to the search frame and a target marking frame thereof, and for the negative sample pair, only a classification branch label is generated according to the search frame and the target marking frame thereof;

2) The method comprises the following steps of (1) main body network off-line training, including network main body part training and meta classifier training; for training of the network main body part, firstly inputting the template frame and the search frame picture into a twin network, extracting respective classification characteristic graph and regression characteristic graph, and taking the classification characteristic graph of the template frame as a convolution kernel f of a classification branch _cls A classification feature map applied to the search frame, and a class score confidence map M generated after the convolution operation _cls Taking the regression feature map of the template frame as a convolution kernel f of the regression branch _reg Generating regression feature map M from the distance between the center point and the target boundary after convolution operation _reg And the corresponding distance confidence map M _uncert Respectively representing the distances from the central point of the target to the four boundaries of the object and the confidence values of the predicted distances; then confidence map M according to the category _cls Find the highest point of score, at M _reg Finding the offset distances corresponding to the point and the nearby points, and voting according to the confidence degrees corresponding to the offset distances to obtain a final predicted target frame;

during training, for a classification branch, the Focal local in RetinaNet is used as a Loss function, for a regression branch, a DIoU Loss function is used, an uncertainty estimation module uses a negative log-likelihood Loss function NPLL (negative power log-likelihood Loss) Loss function, a label obtained by searching a frame is combined, an SGD (generalized minimum delay) optimizer is used, the whole network parameter is updated through a back propagation algorithm, and positive and negative samples are continuously and randomly extracted to repeat the process until the iteration times are reached;

3) The meta classifier is trained off-line, the input of the meta classifier is a classification feature map of a search frame in the inference phase, and the output is a classification confidence map M' _cls At this time, the class confidence map is compared with the class confidence map M in step 2) _cls Weighting and summing to obtain the final class confidence map M _cls ←α·M _cls +(1-α)·M′ _cls (ii) a In the training stageTraining the meta classifier by adopting an MAML algorithm, and finding out a group of initialization parameters, so that the classifier can quickly learn the information of the target by using a small number of samples and performing gradient updating for a plurality of times in the state;

4) In the on-line tracking, firstly, a target frame search area in a first frame image of a video to be tracked is cut out to serve as a template, then the template frame is expanded to form an on-line training data set containing 5 frame images, network parameters are updated through 5 gradient drops, so that a meta classifier can classify a current tracking target, and in the tracking process, a frame with the highest classification score and a target frame obtained through tracking are selected from every 10 frames in a frame sequence which is tracked and are used as labels to be added into the on-line training data set for updating the meta classifier.

The invention aims to construct an accurate target tracker which can adapt to the deformation of a target, distinguish background interferents and the like, and further improve the robustness of the tracker. As indicated by previous analysis, the SiamRPN series is based on the class confidence in the candidate frame screening process, which is not reasonable and results in a suboptimal model. ATOM algorithm and the like introduce IoU to predict the network and use IoU value predicted by the network to replace the class confidence value, but the method has large calculation amount and limited accuracy improvement. The invention adopts a full convolution twin network structure, invents a single target tracking method based on position uncertainty estimation, and is named as FCST (full probabilistic Simese Tracker). The position uncertainty estimation module designed by the invention can predict the confidence information of the network output position coordinate, and a final prediction frame is generated by using a position voting mechanism in the subsequent stage, so that an accurate regression boundary frame can be given. In addition, the invention provides an online updating strategy based on meta-learning, so that the tracker can adapt to the change of the target shape and scale, and the robustness of the tracker is improved.

Compared with the prior art, the invention has the following advantages.

The invention provides a single target tracking method (FCST) based on position uncertainty estimation. The method adopts a full convolution twin network structure, introduces a position uncertainty estimation module in a target state estimation task, generates a final prediction frame by a position voting method, and improves the tracking accuracy while ensuring the object tracking efficiency.

The invention introduces an online updating classifier based on meta-learning in a classification task, and can adapt to the shape scale change of a target only by a small number of training samples through a plurality of iterations in the tracking process. Compared with the existing twin network-based tracking method, the FCST tracking method provided by the invention can have better adaptability to object deformation in the tracking process, and effectively improves the robustness of target classification.

According to the method, a good result is obtained on a single-object tracking task, and the target regression precision and the target classification robustness are improved. Compared with the existing method, the TREG tracking method provided by the invention has the advantages that the good tracking success rate and the positioning accuracy are embodied in a plurality of visual tracking test reference data sets.

Drawings

FIG. 1 is a system framework diagram used by the present invention.

FIG. 2 is a diagram of a meta classifier structure.

Fig. 3 is a schematic diagram of a multivariate information fusion module provided by the present invention.

Fig. 4 is a schematic diagram of feature extraction fusion proposed by the present invention.

Detailed Description

The invention provides an accurate single-target tracking method based on a target transformation regression network. The off-line training is carried out on four training data sets of TrackingNet-Train, laSOT-Train, COCO-Train and GOT-10k-Train, the Test on a plurality of Test sets of OTB100, VOT2018, laSOT-Test and Got10k-Test achieves higher accuracy and tracking success rate, and the Test is implemented by specifically using a Python3.6 programming language and a Pytroch 1.4 deep learning framework.

FIG. 1 is a system framework diagram used by the present invention to implement a target tracking task by generating a target classification and target regression template through a full convolution twin network to guide the classification and regression task and the strategy of updating the classification and regression template online. The whole method comprises a training example generation stage, a main body network training stage, a meta classifier off-line training stage and an on-line tracking stage, and the specific implementation steps are as follows:

1) And a data preparation phase, namely a training sample generation phase. The method comprises the steps of generating training samples in an off-line training process, firstly performing target area enhancement processing on each frame of image of each video in an off-line training data set, then cutting out an enhanced target search area and zooming to a fixed size, then extracting two frames from each cut video frame sequence at a certain interval to generate a positive sample pair, and respectively randomly extracting one frame from two different video sequences to generate a negative sample pair. One of each sample pair serves as a template frame and the other serves as a search frame. For the positive sample pairs, a classification branch label and a regression branch label are generated according to the search frame and the target labeling box thereof. And for the negative sample pairs, only generating classification branch labels according to the search frames and the target labeling frames thereof. If the position of the original image corresponding to a certain coordinate point on the category label graph falls into the central area of the marking frame, the position is marked as 1, if the position falls out of the area of the marking frame, the position is marked as 0, and other positions on the graph are marked as-1; if the position of the original image corresponding to a certain coordinate point on the category label graph falls into the central area of the labeling frame, the position is marked as 0, and other areas are marked as-1. For the

2) And a network body part training stage, which is specifically as follows.

2.1 Extract template branch features: firstly, the modified GoogLeNet is used as a backbone of the twin network for extracting features, and a template frame Z is subjected to feature extraction _i ∈R ^{B×3×127×127} Extracting the features to obtain F ^temp ∈R ^B×256×5×5 Wherein the meaning of temp is labeled template frame extracted feature, B represents the size of batch size. Among these, *** lenet uses parameters pre-trained on ImageNet.

2.2 Template branch feature adjustment: in order to adapt the obtained features to different tasks (classification task, regression task), the features need to be adjusted. Branching the template obtained in step 2.1)The features are input to a network containing a single convolutional layer, which uses a convolutional kernel of 3*3, with a step size of 1, 256 input channels, and 256 output channels. The adjusted branch characteristic size of the template is F ^temp ∈R ^B×256×5×5 Become into

Similarly, in order to obtain the characteristics suitable for the regression task, the template branch characteristics are input into another single-layer convolution network, and the size, the step length and the channel number of the convolution kernel are the same as those of the single-layer convolution network. Adjusting the branch characteristic size of the template to F ^temp ∈R ^B×256×5×5 Become into

The subscript cls indicates that the feature is a feature used by the template branch to perform a classification task, and the subscript reg indicates that the feature is a feature used by the template branch to perform a regression task.

2.3 Search frame feature extraction: the search frame is input into the other branch of the twin network, the network of the branch is also adjusted GoogLeNet, and the network parameters are also parameters pre-trained on ImageNet. The search frame size is X, unlike the template frame size _i ∈R ^{B×3×255×255} After the features are extracted through the main network, the obtained features of the search branches are F ^search ∈R ^{B×256×27×27} Where 256 represents the number of channels.

2.4 Search frame feature adjustment: also in order to enable the search branch feature to be adapted to different tasks, it needs to be adapted. Inputting the search branch characteristics obtained in the step 2.3) into a network containing a single convolutional layer, wherein the convolutional layer adopts a convolutional kernel of 3*3, the step length is 1, the number of input channels is 256, and the number of output channels is 256. The adjusted search branch feature size is represented by F ^search ∈R ^{B×256×27×27} Become into

Also, to obtain features that are adapted to the regression task, inputting the search branch characteristics into another single-layer convolution network, the size of a convolution kernel,The step length and the number of channels are the same as above. The adjusted search branch feature size is represented by F ^search ∈R ^{B×256×27×27} Become into

The subscript cls indicates that the feature is a feature used by the search branch to perform a classification task, and the subscript reg indicates that the feature is a feature used by the search branch to perform a regression task.

2.5 Get a classification confidence map: using the classification characteristics of the template branches as convolution kernels

Classification characteristics with search branches

After performing a convolution operation (i.e., a cross-correlation operation), a magnitude of F is obtained _cls ∈R ^{B×256×23×23} Then outputs the final class confidence map M through three layers of convolution networks _cls ∈R ^B ^×1×19×19 . The convolution kernel size of the first two layers of the three layers of convolution networks is 3*3, the step length is 1, the input channel is 256, and the output channel is 256. The size of the last layer of convolution kernel is 1*1, the step length is 1, the number of input channels is 256, the number of output channels is 1, and the method mainly functions in fusing different dimension information.

2.6 ) obtaining a distance regression map M _reg And corresponding position uncertainty map M _uncert : regression features of template branches

Regression features as convolution kernels with search branches

Performing convolution operation to obtain a value of F _reg ∈R ^{B×256×23×23} The characteristics of (1). Then inputting the characteristic into a two-layer convolution network, wherein the convolution kernel size is 3*3, the step length is 1, the input channel and the output channel are both 256, and the output characteristic size is F _reg ∈R ^B ^{×256×19×19} . Finally, the convolution kernels are respectively input into two parallel single-layer convolution layers, the convolution kernels of the two convolution layers are 1*1, the step length is 1, the number of input channels is 256, the number of output channels is 4, and the output characteristics are respectively M _reg ∈R ^B×4×19×19 And M _uncert ∈R ^B×4×19×19 Distances of the center point of the target from the four boundaries of the object and confidence values of the predicted distances are respectively represented.

2.7 For offline training of classification branches, focal Loss proposed by RetinaNet is used as a Loss function, a deviation distance prediction module in a regression (state prediction) branch uses DIoU Loss as a Loss function, a position confidence estimation module in the regression branch uses a negative log-likelihood Loss function NPLL (negative log-likelihood Loss) Loss function, an SGD optimizer is used in an experiment, batchSize is set to be 16, the total number of training rounds is 20, an initial value of a learning rate is 0.0001, after 15 rounds, the result is divided by 10, an attenuation rate is set to be 0.1, training is performed on 8 RTX 2080ti, and the whole network parameters are updated through a back propagation algorithm. And continuously repeating the step 2.1) to the step 2.7) until the iteration number is reached.

3) The off-line training stage of the meta-classifier is required to be carried out after the training of the main network, wherein the meta-classifier is a network comprising a plurality of convolution layers and shares the characteristic F after carrying out convolution operation on the template branch classification characteristic and the search branch classification characteristic with the main network _cls ∈R ^{B×256×23×23} The output is still a category confidence map as the input of the meta classifier, and the size of the category confidence map is consistent with the category confidence map output in the step 2.5), and is marked as M' _cls ∈R ^B×1×19×19 . The method adopts the MAML algorithm (Model-empirical Meta Learning) to train the method, and mainly comprises inner layer optimization and outer layer optimization. Specifically, as follows, the following description will be given,

given a video sequence V for training _i First, collect training sample set

In the field of meta-learning, which is also known as Support Set, a classifier is defined as f (x; θ) ₀ ) Where x is the input picture, θ ₀ For network initialization parameters, the network is updated using a k-step stochastic gradient descent algorithm on the training set:

where α is the weighting factor, α =0.6 in the experiment, l is the loss function, and f (x, y) is the sample pair in the training set. In the MAML algorithm, the above equation is called inner-level optimization.

In order to evaluate the generalization performance of the classifier, the same video sequence V needs to be used _i In the collection of sample sets

In the field of meta-learning, which is also known as Target Set, it is computed using an inner-layer optimized model

Loss in (c):

wherein

Represents the union of Support Set and Target Set. The training goal of the whole network is to find the network initialization parameter theta which can meet all video sequences as well as possible ₀ It can be expressed as,

this formula is called outer-level optimization (outer-level optimization) and is updated by optimization using the Adam algorithm.

Firstly, training a main body network according to the step 2), and then training a meta classifier off line on the basis of the main body network. In the off-line training process, 8 video sequences (i.e. batch size = 8) were randomly picked per batch, 600 iterations per round, for a total of 100 rounds of training. In the training process, the inner layer is optimized by using a random gradient descent algorithm for 5 times of gradient updating, the learning rate is 0.01, the outer layer is optimized by using an Adam algorithm, and the learning rate is 0.001.

4) In the online tracking stage, the meta classifier needs to be updated online, specifically, given a video sequence and a label of a first frame, an algorithm first uses the first frame picture and the label as a training positive sample, and finely adjusts the meta classifier, so that the meta classifier can perform classification tasks on the current target in the subsequent process. Because the first frame only has one sample, in order to make the generalization capability of the meta classifier stronger, the training sample Set is extended by using a data enhancement mode to form a Support Set.

In the subsequent tracking process, the method continuously collects the previous tracking results for updating operation in the subsequent process, uses a position voting mechanism to improve the tracking accuracy in the subsequent candidate frame screening stage, and selects a frame with the highest classification score and a tracked target frame from every 10 frames in the tracked frame sequence as labels to be added into an online training data set for updating the meta classifier. Since the tracking result is not as reliable as the first frame annotation data, there may be inaccurate or even erroneous results, and therefore the confidence level is only when the position of the prediction frame is greater than a certain threshold θ _loc And the class confidence is also larger than a certain threshold value theta _cls Then it is selected as Support Set. In the implementation of the experiment, at most 15 samples are cached in the Support Set, and meanwhile, the updating is performed every 10 frames in consideration of the time consumption of the online updating operation. Only one gradient descent is performed per update to save time.

When the tracking is started, firstly cutting out a target frame region in a first frame image of a video to be tracked and scaling the target frame region to 127 × 127 as the input of a template branch, calculating the size of a search range of a current frame according to a prediction result of a previous frame for a subsequent frame sequence, then cutting out a search region with a corresponding size and scaling the search region to 255 × 255, and inputting the search region into a twin networkThe other branch. When the subject network outputs the category confidence map M _cls Bias distance regression graph M _reg Position uncertainty map M _un _rt And the meta classifier outputs a category confidence map M' _cls Then, firstly, carrying out weighted fusion operation on the two category confidence maps to obtain a final category response map:

M _cls ←α·M _cls +(1-α)·M′ _cls

α is a weighting factor, α =0.6 in the experiment.

Finding the highest point of the score value on the category response graph, selecting an N +1 prediction frame corresponding to the position and adjacent N positions around the position as a candidate set,

each element in the set contains an upper, lower, left and right position offset and its corresponding confidence, where l ⁱ A value representing the left boundary of the prediction box i,

is its corresponding uncertainty; t is t ⁱ A value representing the upper boundary of the prediction box i,

is its corresponding uncertainty; r is ⁱ A value representing the right boundary of the prediction box i,

for its corresponding uncertainty, b ⁱ A value representing the lower boundary of the prediction box i,

for its corresponding uncertainty, i is the predicted target box sequence number. It is then divided into four subsets according to four boundaries:

and then sequentially selecting K items with the highest confidence degrees from each subset to form a new subset:

the final prediction box is denoted as B _pred ＝{l _pred ,t _pred ,r _pred ,b _pred H, wherein l _pred For the final prediction of the left boundary value of the frame, the specific calculation formula is as follows:

on a test data set, the tracking efficiency is 30fps, and on the tracking precision, auc reaches 70.1% on an OTB100 data set, and Pre reaches 91.5%. On VOT2018 data set, EAO reaches0.474, robustness reached 0.164, accuracy reached 0.609; suc reached 56.2% on the LaSOT dataset; SR on GOT10 dataset _.5 Is 0.723,SR _.75 0.530 and AR 0.614.

Claims

1. A single target tracking method based on position uncertainty estimation is characterized in that network parameters of a target tracking network are trained offline, then partial video frames with predicted results are selected in the tracking process and used as online training samples of target tracking network classification branches, network parameters of a classifier based on meta-learning are updated, tracking robustness is improved, and meanwhile a position voting mechanism is used in a candidate frame screening stage of target tracking to improve tracking accuracy.

2. The method of claim 1, wherein the method comprises generating training samples, subject network offline training, meta classifier offline training, and online tracking:

1) Generating a training sample, firstly, performing target area enhancement processing on each frame image of each video in an offline training data set, then cutting out a target search area after enhancement processing and zooming to a fixed size, extracting two frames from each cut video frame sequence according to a certain interval to generate a positive sample pair, respectively randomly extracting one frame from two different video sequences to generate a negative sample pair, wherein one of each sample pair is used as a template frame, the other one is used as a search frame, for the positive sample pair, a classification branch label and a regression branch label are generated according to the search frame and a target labeling frame thereof, and for the negative sample pair, only a classification branch label is generated according to the search frame and the target labeling frame thereof;

2) The method comprises the following steps of (1) main body network offline training, including network main body part training and meta classifier training; for training of the network main body part, firstly inputting the template frame and the search frame picture into a twin network, extracting respective classification characteristic graph and regression characteristic graph, and taking the classification characteristic graph of the template frame as a convolution kernel f of a classification branch _cls Acting on search framesA classification feature map, which is subjected to the convolution operation to generate a class score confidence map M _cls Taking the regression feature map of the template frame as a convolution kernel f of the regression branch _reg Generating regression feature map M from the distance between the center point and the target boundary after convolution operation _reg And the corresponding distance confidence map M _uncert Respectively representing the distances from the central point of the target to the four boundaries of the object and the confidence values of the predicted distances; then confidence map M according to the category _cls Find the highest point of score, at M _reg Finding the offset distances corresponding to the point and the nearby points, and voting according to the confidence degrees corresponding to the offset distances to obtain a final predicted target frame;

during training, for a classification branch, the Focal local in RetinaNet is used as a Loss function, for a regression branch, a DIoU Loss function is used, an uncertainty estimation module uses a negative log-likelihood Loss function NPLL, a label obtained by combining a search frame is used, an SGD optimizer is used, the whole network parameter is updated through a back propagation algorithm, and positive and negative samples are continuously and randomly extracted to repeat the process until the iteration times are reached;

3) The meta classifier is trained off-line, the input of the meta classifier is a classification feature map of a search frame in the inference phase, and the output is a classification confidence map M' _cls At this time, the class confidence map is compared with the class confidence map M in step 2) _cls Weighting and summing to obtain the final class confidence map M _cls ←α·M _cls +(1-α)·M′ _cls Wherein α is a weighting factor; in the training stage, the element classifier is trained by adopting an MAML algorithm, and a group of initialization parameters are found, so that the classifier can quickly learn the information of the target by using a small number of samples and performing gradient updating for a plurality of times in the state;

3. The single-target tracking method based on the position uncertainty estimation as claimed in claim 2, wherein the network body part training specifically comprises:

2.1 Extract template branch features: for template frame Z _i ∈R ^{B×3×127×127} Extracting the features to obtain the branch features F of the template ^temp ∈R ^B ^×256×5×5 ；

2.2 Template branch feature adjustment: inputting the template branch characteristics obtained in the step 2.1) into two networks containing single convolutional layers respectively, wherein the sizes of the adjusted template branch characteristics are respectively as follows: f ^temp ∈R ^B×256×5×5 Become into

And F ^temp ∈R ^B×256×5×5 Become into

Subscript cls indicates that the feature is a feature used by the template branch to perform a classification task, and subscript reg indicates that the feature is a feature used by the template branch to perform a regression task;

2.3 Search frame feature extraction: search frame size of X _i ∈R ^{B×3×255×255} After extracting the features through the main network, obtaining the search branch feature as F ^search ∈R ^{B×256×27×27} ；

2.4 Search frame feature adjustment: inputting the search branch characteristics obtained in the step 2.3) into two networks containing a single convolution layer respectively, wherein the sizes of the adjusted search branch characteristics are respectively as follows: f ^search ∈R ^{B×256×27×27} Become into

And F ^search ∈R ^{B×256×27×27} Become into

Subscript cls indicates that the feature is a feature used by the search branch to perform a classification task, and subscript reg indicates that the feature is a feature used by the search branch to perform a regression task;

2.5 Get a classification confidence map: taking the classification characteristics of the template branches as convolution kernels

Classification characteristics with search branches

After performing the convolution operation, F is obtained _cls ∈R ^{B×256×23×23} Then outputs the final class confidence map M through three layers of convolution networks _cls ∈R ^B×1×19×19 ；

2.6 ) obtaining a distance regression map M _reg And a corresponding position uncertainty map M _uncert : regression features of template branches

Regression features as convolution kernels with search branches

Performing convolution operation to obtain a size of F _reg ∈R ^{B×256×23×23} The features of (1); then inputting the feature into a two-layer convolution network, and obtaining a feature F by convolution _reg ∈R ^{B×256×19×19} Finally, the two parallel single-layer convolution layers are respectively input, and the output characteristics are respectively M _reg ∈R ^B×4×19×19 And M _uncert ∈R ^B×4×19×19 Respectively representing the distances from the central point of the target to the four boundaries of the object and the confidence values of the predicted distances;

2.7 For offline training of classification branches, focal local proposed by RetinaNet is used as a Loss function, a deviation distance prediction module in a regression branch uses DIoU local as the Loss function, a position confidence estimation module in the regression branch uses NPLL (negative power log-likelihood Loss) Loss function, an SGD optimizer is used, batchSize is set to be 16, the total number of training rounds is 20, an initial learning rate value is 0.0001, a value of 10 is divided after 15 rounds, an attenuation rate is set to be 0.1, the whole network parameters are updated through a back propagation algorithm, and the steps 2.1) to 2.7 are repeated continuously until the number of iterations is reached.

4. The single-target tracking method based on the position uncertainty estimation according to claim 1 or 2, characterized in that the target tracking network adopts a full convolution twin network to generate a target classification and target regression template to guide classification and regression tasks and update strategies of the classification and regression template on line to realize the target tracking task.