CN114550040A

CN114550040A - End-to-end single target tracking method and device based on mixed attention mechanism

Info

Publication number: CN114550040A
Application number: CN202210152336.4A
Authority: CN
Inventors: 王利民; 崔玉涛; 蒋承; 武港山
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2022-02-18
Filing date: 2022-02-18
Publication date: 2022-05-27

Abstract

An end-to-end single target tracking method based on a mixed attention mechanism constructs a tracking framework MixFormer based on Transformer tracking, and is used for target tracking, and the construction of the tracking framework comprises the following steps: 1) a data preparation stage; 2) a network configuration stage; 3) an off-line training stage; 4) and an online tracking stage. The invention adopts the backbone network based on mixed attention to simultaneously extract the features and fuse the target information, thereby obtaining a simple and clear tracking frame and effectively improving the performance. In addition, the tracking method of the invention has better adaptability to the deformation of the object in the tracking process, and effectively improves the accuracy of target regression.

Description

End-to-end single target tracking method and device based on mixed attention mechanism

Technical Field

The invention belongs to the technical field of computer software, relates to a single-target tracking technology, and particularly relates to an end-to-end single-target tracking method and device based on a mixed attention mechanism.

Background

As a basic task in computer vision, visual object tracking aims to estimate, for an arbitrary general object in a video, the spatial position where it appears in each frame and mark out the object borders. Although target tracking has made significant progress, it remains a challenge to design a simple and efficient end-to-end tracker. The main challenges are from scale changes, object deformation, occlusion, and confusion from similar objects.

Currently popular trackers typically contain three components to accomplish the tracking task: (1) the CNN backbone network is used for extracting general characteristics of a target to be tracked and a search area; (2) the fusion module is used for carrying out information communication between the tracking target and the search area so as to carry out subsequent target perception positioning; (3) and the accurate positioning bounding box module is used for generating a final tracking result. The fusion module is the most critical part in the tracking algorithm and is responsible for integrating target information to be tracked into the characteristics of a search area, so that a specific frame is marked according to a specific tracking target. Traditional information fusion methods include correlation-based operations and online model update algorithms. Recently, due to the global and dynamic modeling capability of the Transformer, the Transformer is introduced into the tracking field to perform information interaction of a tracking target and a tracking area, and good tracking performance is generated. The method mainly utilizes a Transformer model to perform feature fusion on a target feature and a search area, and then predicts the fused feature to realize tracking, however, the Transformer-based trackers still rely on a convolution backbone network to perform feature extraction, and only apply attention operation in a relatively high-level and abstract representation space. The representation capability of the convolutional backbone network is limited, firstly it is usually pre-trained based on general object recognition tasks, and secondly it is possible to ignore the more detailed structural information used for tracking.

Disclosure of Invention

The invention aims to solve the problems that: how to design a simple end-to-end target tracking framework, the feature extraction is carried out without depending on a convolution network, and further the feature extraction and the information fusion module can be unified.

The technical scheme of the invention is as follows: an end-to-end single target tracking method based on a mixed attention mechanism constructs a tracking frame MixFormer for target tracking, wherein the tracking frame MixFormer is a Transformer tracking network trained end-to-end and comprises a trunk network and a tracking head, and the construction realization of the tracking frame MixFormer comprises the following stages:

1) in the data preparation stage, target search areas are cut out from all video frames in a training data set, two frames are extracted from the first half of a frame sequence of each video to be used as template frames, one frame is extracted from the second half to be used as a test frame, a target frame is labeled on the test frame to be used as a verification frame, and diagonal coordinates of the target frame in each verification frame are used as real labels in an offline training process;

2) in the network configuration stage, a main network is a feature extractor based on a mixed attention module, the feature extraction and the information fusion are unified through a Transformer structure, a tracking head is a regression head and is realized by adopting a convolution network; simultaneously inputting the template frame and the test frame into a backbone network to generate a test frame characteristic fused with template information, and then generating diagonal coordinates of a target by the test frame characteristic through a regression head to serve as a final target frame generated by the test frame;

the method comprises the steps that a main network carries out self-attention and mutual-attention operation on features of a template frame and a test frame based on a mixed attention mechanism, wherein the self-attention is used for extracting self features of the template frame and the test frame, and the mutual attention is used for interacting feature information of a target frame and the test frame to obtain the test frame features fused with template information;

3) in the off-line training stage, for the training of the regression head target frame, an L1 loss function and a GIoU loss function are adopted for supervision, an AdamW optimizer is used in combination with a real label obtained by a verification frame, the whole network parameters are updated through a back propagation algorithm, and the configured network is continuously trained until the iteration times are reached, so that a tracking frame MixFormer is obtained;

and (3) performing on-line tracking, namely marking a target search area as a template frame on a first frame of a video to be tracked, using a subsequent frame as a test frame, inputting a tracking frame MixFormer obtained by training, outputting a target frame on the test frame, and realizing target tracking.

Furthermore, the mutual attention operation of the backbone network only carries out the mutual attention from the template frame to the test frame in a one-way, and does not carry out the mutual attention from the test frame to the template frame, so as to obtain the test frame characteristics fused with the template information.

Furthermore, the tracking head also comprises a classification head, the classification head is used for obtaining the confidence coefficient of a classification target of the test frame, the classification head is provided with a preset learnable confidence coefficient vector, the preset learnable confidence coefficient vector is respectively subjected to attention operation with the characteristics of the test frame and the characteristics of the template frame, the information of the test frame and the characteristics of the template frame is sensed and predicted to obtain the confidence coefficient of the classification target of the current test frame, and in the tracking process, a frame supplement with the confidence coefficient meeting the conditions is selected from the tracked video frame sequence to serve as the template frame.

When in on-line tracking, firstly cutting out a target search area in a first frame image of a video to be tracked as a template frame F_trainThe frame to be tracked is taken as a test frame F_testObtaining a test frame F through a tracking frame MixFormer_testIn the tracking process, a frame with the highest confidence coefficient and a target frame obtained by tracking are selected from every N frames in the tracked frame sequence to serve as a label, and the label is supplemented to serve as a template frame F_train。

The invention constructs a clean and effective tracking framework which only comprises a backbone network and a tracking head for simultaneously carrying out feature extraction and information fusion. This coupled paradigm of the tracking framework of the present invention has the following advantages. First, it will make our feature extraction more adaptive to a particular tracked target and capture more discriminative features related to the target. In addition, it allows more scales of fusion of target information, thereby better capturing the correlation between the target and the search area.

The invention further provides an end-to-end single-target tracking device based on the hybrid attention mechanism based on the tracking method, which comprises a computer storage medium, wherein a computer program is configured in the computer storage medium and used for realizing the MixFormer tracking framework, and the computer program realizes the tracking method when being executed.

Compared with the prior art, the invention has the following advantages.

The invention provides an end-to-end single target tracking method based on a mixed attention mechanism, which constructs a tracking frame MixFormer based on a Transformer, adopts a specially designed Transformer backbone network, namely a feature extractor based on a mixed attention module MAM to simultaneously extract features and fuse target information, and as shown in figure 2, firstly, split spliced vectors of a target frame and a test frame and respectively form a 2D vector, and then, through a multi-head attention function, splice the two generated 2D vectors and obtain the test frame features fused with template information through a linear layer. Finally, as shown in fig. 1, a tracking target frame is obtained through two simple regression heads and classification heads, and a tracking label is further updated through the online tracking result, so that a simple and clear tracking frame is obtained, and the tracking accuracy can be effectively improved.

The invention designs a template sample space capable of being updated on line, and a template sample more suitable for current tracking is screened by a confidence coefficient prediction module in the tracking process, so that the robustness of a model is improved. Compared with the existing tracking method, the online tracking method disclosed by the invention has better adaptability to the deformation of the object in the tracking process, and the target regression precision is effectively improved.

The invention obtains good accuracy on the visual object tracking task and improves the object regression precision. Compared with the existing method, the MixFormer tracking method provided by the invention achieves the best tracking success rate and positioning accuracy in a plurality of visual tracking test reference data sets (LaSOT, TrackingNet, GOT-10k, VOT2020 and UAV 123).

Drawings

FIG. 1 is a schematic diagram of a MixFormer tracking framework according to the present invention.

Fig. 2 is a schematic diagram of a hybrid attention module MAM of the backbone network according to the present invention.

FIG. 3 is a schematic diagram of the confidence prediction module SPM of the classification head of the present invention.

Detailed Description

The invention provides a MixFormer of a tracking framework, aiming at unifying a feature extraction module and an information fusion module through a Transformer structure. The attention module is a very flexible architecture building block, has dynamic and global modeling capability, has few assumptions on a data structure, and can be universally applied to different types of relational modeling. The core idea of the present invention is to utilize this flexibility of attention operations to propose a hybrid attention module MAM, as shown in fig. 2, which first divides the spliced vectors of the target template frame and the test frame and splits them into a 2D vector respectively with Reshape, and then splices the two generated 2D vectors and passes a linear layer through a multi-head attention function. The feature extractor based on the MAM can be obtained by repeating the MAM module for multiple times, and the network depth is deepened through the multiple serial MAM modules. The module performs feature extraction and information interaction of the target template and the search area simultaneously. In the MAM, the present invention designs a hybrid interaction scheme to perform self-attention and mutual-attention operations on features from the target template and the search area, i.e., the template frame and the test frame. Self-attention is responsible for extracting the self-features of the target template or search area, and mutual attention guarantees communication between them to blend the target and search area information. In addition, in order to reduce the computation cost of the MAM and allow the use of multiple templates to handle the problems of online target deformation, we further propose a specific asymmetric attention mechanism, i.e. only performing one-way mutual attention from the template frame to the test frame and not performing the mutual attention from the test frame to the template frame in the process of performing the mutual attention by the MAM.

The invention provides an end-to-end single target tracking method based on a hybrid attention mechanism, which is implemented through a tracking network-Train,Performing off-line training on four training data sets of LaSOT-Train, COCO-Train and GOT-10k-Train, wherein the training data sets are UAVs 123, VOTs 2020 and LaSOTs-Test、TrackingNet-Test, GOT-10k fiveThe test on each test set achieves high accuracy and tracking success rate, and is implemented by using a Python3.7 programming language and a Pytroch 1.7 deep learning framework.

Fig. 1 is a schematic diagram of a tracking framework of the method of the present invention, and a target frame of an object to be tracked is directly obtained through a designed end-to-end trained transform tracking network, so as to implement a target tracking task. The whole method comprises a data preparation stage, a network configuration stage, an off-line training stage and an on-line tracking stage, and the specific implementation steps are as follows:

1) and a data preparation stage, namely a training sample generation stage. In the off-line training process, a training sample is generated in the off-line training process, firstly, each frame image of each video in an off-line training data set is subjected to target area dithering, then a target search area after dithering is cut out, three frames are extracted from the first half of each video frame sequence to serve as a training frame, one frame is extracted from the second half of each video frame sequence to serve as a test frame, a target frame is labeled on the test frame to serve as a verification frame, and coordinates of an upper left corner point and a lower right corner point of the target frame in each verification frame serve as real labels in the off-line training process.

2) The overall structure and flow of the network of the invention are very simple compared with other trackers, and are divided into three parts, the whole framework and flow are shown in fig. 1, and the specific operations are as follows.

2.1) extracting test frame features depending on the tracking template: given a T frame tracking template and a test frame, the T frame tracking template is spliced into a template frame, and the input sizes of the template frame and the test frame are T multiplied by 128 multiplied by 3 and 320 multiplied by 3 respectively. First, a convolutional layer with a convolutional kernel size of 7 and a step size of 4 is used to generate overlapped block vectors. The resulting block vectors are then flattened and concatenated to produce a concatenated vector F_tokenThen input into a mixed attention module MAM, thereby generating a fused tracking target messageAnd a mixed vector of information and test frame information. The specific structure of the MAM is shown in FIG. 2, and the splicing vector F of the MAM and the MAM is firstly spliced_tokenDividing and carrying out Reshape operation, carrying out self-attention operation to obtain self characteristics of the template frame and the test frame, respectively obtaining respective query, key and value through common multi-head attention functions, carrying out parallel attention operation as shown in figure 2, finally splicing the two, then realizing interaction through a linear layer, and then carrying out interaction with an initial vector F_tokenAnd adding to obtain a primary mixed vector. Through repeating the operation M times, the self-attention and mutual-attention operation is carried out after the mixed vector is divided into Reshape, the network depth is deepened, the final mixed feature is obtained, then the corresponding feature of the test frame in the mixed feature is only required to be divided, and the Reshape operation is carried out to obtain the test frame feature F fused with the template information_testThe size is 20 × 20 × 384.

2.2) obtaining a tracking frame of the target in the test frame: the tracking regression head adopts a convolution network, and 5 simple convolution layers are used for acting on the F obtained in the step 2.1)_testThe number of input channels is 384, the number of output channels is 2, thermodynamic diagrams of the upper left corner and the lower right corner of the target are obtained respectively, the size of each thermodynamic diagram is 20 multiplied by 1, and finally the coordinates of the upper left corner and the lower right corner are obtained in a mode of taking the maximum value of the thermodynamic diagrams, so that a target frame, namely a tracking frame of the target is obtained.

2.3) obtaining the classification confidence of the test sample: the tracking framework of the invention is also provided with a classification head in the tracking head aiming at the on-line tracking, and F is obtained in the step 2.1)_testThrough a classification confidence Prediction module spm (score Prediction module), the classification confidence of each test frame, i.e. whether each test frame is a positive sample, can be obtained. The structure of the SPM is shown in fig. 3, and the SPM performs Attention operation with the test frame feature and the target template frame feature respectively through a preset learnable confidence vector to sense the information of the test frame feature and the target template frame feature, so as to predict and obtain the final classification confidence, and is used for selecting an online sample with higher quality as a template frame for tracking in the online tracking stage.

The network configuration phase is described in detail below in one embodiment. And loading parameters of an ImageNet pre-training model into the network by using the MAM-based backbone network, and extracting characteristics depending on a target template from a test frame, wherein the size of a characteristic graph is 20 multiplied by 384, and the characteristic graph represents the length, the width and the channel number of the characteristic graph respectively. And then inputting the feature map into a regression head and a classification head SPM to respectively obtain a final target frame and a classification confidence coefficient of the target frame. The target frame can be used as a tracking result, and the classification confidence coefficient is used for selecting an online sample.

3) In the off-line training stage, cross entropy is used as a loss function for off-line training of classification branches, a GIoU loss function and an L1 loss function are used for regression branches, an AdamW optimizer is used, a single-card BatchSize is set to 32, the total number of training rounds is set to 500, the learning rate is divided by 10 at 400 rounds, training is performed on 8 Nvidia Tesla V100, the whole network parameters are updated through a back propagation algorithm, and the steps 2.1) to 2.3) are repeated continuously until the number of iterations is reached.

4) And in the on-line tracking stage, a target search area is marked on a first frame of a video to be tracked as a template frame, a subsequent frame is used as a test frame and is input into a trained network, and a tracking target frame is obtained on the basis of initial parameters.

As a preferred mode, firstly, a target search area in a first frame image of a video to be tracked is cut out to be used as a template frame, the template frame is used as an initial target template, a frame to be tracked in the video to be tracked is used as a test frame, the target template and the test frame are input into the network in the step 2) to obtain a target frame on the test frame, in the tracking process, a frame with the highest classification confidence coefficient obtained by SPM and a target frame obtained by tracking are selected from every 20 frames in a frame sequence which is tracked, the frame and the target frame are used as labels, the labels are added to an online target template set and used as a template frame, and online target tracking with adaptability to video target deformation is achieved.

On a test data set, the tracking efficiency is 22fps (Nvidia GTX 1080Ti), and on the tracking precision, the Auc on a GOT-10k data set reaches 70.5%; auc reached 69.5% on the LaSOT dataset; on the TrackingNet dataset, Auc reached 83.6% and Pre reached 82.8%; on the VOT2020 dataset, EAO reached 0.550, Robustness reached 0.843, Accuracy reached 0.760, and on the UAV123 dataset, Auc reached 70.4%. The indicators on the test data set exceed the best currently available methods.

Claims

1. An end-to-end single target tracking method based on a mixed attention mechanism is characterized in that a tracking frame MixFormer is constructed for target tracking, the tracking frame MixFormer is an end-to-end training Transformer tracking network and comprises a trunk network and a tracking head, and the construction realization of the tracking frame MixFormer comprises the following stages:

2. The method as claimed in claim 1, wherein the mutual attention operation of the backbone network only performs the mutual attention from the template frame to the test frame in one direction, and does not perform the mutual attention from the test frame to the template frame, thereby obtaining the test frame characteristics fused with the template information.

3. The end-to-end single-target tracking method based on the hybrid attention mechanism as claimed in claim 1, wherein the backbone network specifically comprises: respectively generating block vectors for the template frame and the test frame, performing self-attention operation to obtain self characteristics of the template and self characteristics of the test frame, respectively obtaining respective query, key and value through a common multi-head attention function, then performing mutual attention operation, splicing the query, key and value of the template and the test frame, and then performing leveling splicing on the block vectors of the template frame and the test frame to obtain a splicing vector F_tokenOutput of the Linear layer and F_tokenAdding to obtain a primary mixed vector, dividing the mixed vector, performing self-attention and mutual-attention operation to obtain a new mixed vector, repeating for M times to obtain a final mixed feature, and obtaining the test frame feature fused with the template information after dividing the Reshape.

4. The method as claimed in claim 1, 2 or 3, wherein the tracking head further comprises a classification head for obtaining a confidence level of the classification target of the test frame, the classification head has a preset learnable confidence vector, and performs attention operations with the test frame feature and the template frame feature respectively, senses the information of the two to predict the confidence level of the classification target of the current test frame, and selects a frame supplement with a confidence level meeting the condition from the tracked video frame sequence as the template frame in the tracking process.

5. The method as claimed in claim 4, wherein during on-line tracking, the target search area in the first frame image of the video to be tracked is cut out as the template frame F_trainThe frame to be tracked is taken as a test frame F_testObtaining a test frame F through a tracking frame MixFormer_testIn the tracking process, a frame with the highest confidence coefficient and a target frame obtained by tracking are selected from every N frames in the tracked frame sequence to serve as a label, and the label is supplemented to serve as a template frame F_train。

6. The method according to claim 1, wherein in the data preparation stage, the target area dithering is performed on each frame of image of each video in the training data set, and then the dithered target search area is cut out.

7. An end-to-end single target tracking device based on a hybrid attention mechanism, characterized by having a computer storage medium, wherein a computer program is configured in the computer storage medium, the computer program is used for implementing the tracking framework MixFormer of claims 1-6, and the computer program is executed to implement the tracking method of claims 1-6.