CN116168061A

CN116168061A - Visual target tracking method APR-Net based on attention pyramid residual error network

Info

Publication number: CN116168061A
Application number: CN202310192160.XA
Authority: CN
Inventors: 刘冰
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-03-02
Filing date: 2023-03-02
Publication date: 2023-05-26

Abstract

The invention discloses a visual target tracking method APR-Net based on an attention pyramid residual network, which is inspired by the pyramid residual network (pyramidal convolutional residual network, pyConvResNet). The APR-Net tracking method provided by the invention can automatically search the multi-scale features in the feature map. In addition, the invention introduces a attentive mechanism in PyConvResNet to further improve the tracking accuracy and stability. Experimental results on a reference data set show that the proposed tracking method realizes competitive tracking accuracy and stability in various challenging environments and meets real-time tracking speed. The tracking method proposed by the present invention runs on the GPU at a speed of approximately 30 frames/sec.

Description

Visual target tracking method APR-Net based on attention pyramid residual error network

Technical Field

The invention belongs to a target tracking technology, in particular to a visual target tracking method APR-Net based on an attention pyramid residual error network.

Background

Visual target tracking is an essential topic of research in computer vision. Because of its wide application, it has received extensive attention in various fields of automatic driving, video monitoring, motion analysis, medical diagnosis, etc. [1-3]. Despite considerable progress in the last few years of visual target tracking methods, there is still much room for improvement over other computer vision tasks [3-8]. This is because in most tracking tasks the target is known only in the first frame image of the video sequence and in subsequent video frames the target is unknown. That is, in the visual target tracking task, the position, state, and the like of the target object are not known in advance in each subsequent frame of the video image except for the first frame.

It is relatively simple to identify and detect a specified object in a still image. However, it is difficult to robustly locate and track a specified target in a continuous video image, and the existing tracking method often has poor effect. In long video sequences and complex video sequences that are not constrained by the environment, the video scene is constantly changing. More importantly, the target object may also experience various uncertain changes, such as: significant deformation, interference with surrounding similar objects, dimensional changes, long or short occlusion or lack of camera view, illumination changes and motion blur [3,6,7]. Under these serious challenges, and not to mention computer vision tracking systems, continuous tracking of specified targets, along with human vision systems, is also a significant challenge. Thus, the border that accurately identifies the change of the object and marks the object in each frame of image is one of the milestone objects in the field of visual object tracking [9, 10].

Previous studies have shown that the apparent representation of the target directly affects the tracking performance of the tracking method. Therefore, the requirement of the tracking system on target change in the reasoning stage can be met by constructing a robust characteristic representation model. In general, most tracking methods can be classified into two types according to different appearance characterization methods: namely a tracking method based on manual features and a tracking method based on depth features.

Prior to 2013, tracking methods based on manual features were one of the most widely used methods in the field of visual target tracking. The earliest tracking methods based on manual features could be traced back to 1955, where Wax 11 proposed a tracking method that utilized the correlation between points to track radar signals. In 1960, kalman [12] proposed a linear Kalman filter for visual target tracking. In 1987, sethi and Jain [13] proposed a point-based visual target tracking method. Thereafter, tracking methods based on the underlying features such as kernel histogram [14], mean shift [15], SIFT [16], gaussian mixture (Mixture of Gaussians, MOGs) [17], histogram [18], gradient-oriented histogram (histograms of oriented gradients, HOGs) [19], hough transform [20], particle filtering [21], and the like succeed. The manual characteristics need to design parameters manually according to experience, the number of the required parameters is small, and the execution speed is high. However, manual features are difficult to meet the tracking accuracy and robustness requirements.

Since 2013 the first successful introduction of depth features into visual target tracking task [22], tracking methods have evolved rapidly because of the successful application of depth features in visual target tracking. Depth features greatly improve tracking accuracy and robustness due to their rich feature representation capabilities. Currently, a tracking method based on depth features is dominant in the tracking field. Among the various depth models for visual target tracking, the family of convolutional neural networks (convolution neural networks, CNNs) is one of the most influential and widely used depth models [4]. However, the expensive computational complexity of convolution features still prevents its application in practical application scenarios where real-time tracking performance requirements are relatively high. Therefore, in the visual tracking task based on the depth model, how to improve the calculation efficiency without losing the accuracy or making trade-off between the tracking accuracy and the real-time performance is one of important problems to be studied.

The invention aims to capture the robust appearance characteristic by introducing a pyramid residual network (PyConvResNet) 23 with a attention mechanism 24 into a tracking framework, thereby realizing a competitive tracking method named APR-Net. Standard convolutional networks are based on kernels of a single spatial size and the same depth and therefore have limited processing power for multiple scales. Unlike standard convolution, APR-Net contains kernels of different sizes and depths, so that targets of different sizes can be captured automatically by different convolution layers. APR-Net is more efficient in terms of number of parameters and computational effort than other network architectures that exist. Therefore, the feature extraction model APR-Net provided by the invention is more suitable for tracking tasks.

The tracking method provided by the invention has the advantages that the tracking performance is obviously improved, more than most advanced tracking methods are realized, and the running speed close to real time is obtained. At the same time, the quality of the feature representation has proved to be largely dependent on sufficient end-to-end training of the feature extraction model. Through sufficient training, the feature extraction model based on the attention mechanism and the PyConvResNet network architecture provided by the invention can identify the target object from a stack of similar objects in each frame to the maximum extent.

[1]W.Luo,P.Sun,F.Zhong,W.Liu,T.Zhang,Y.Wang,End-to-end active object tracking and itsreal-world deployment via reinforcement learning,IEEE transactions on pattern analysis andmachine intelligence 42(6)(2019)1317–1332.

[2]J.Shen,Y.Liu,X.Dong,X.Lu,F.S.Khan,S.Hoi,Distilled siamese networks for visualtracking,IEEE Transactions on Pattern Analysis and Machine Intelligence 44(12)(2021)

8896–8909.

[3]Y.Wu,J.Lim,M.-H.Yang,Object tracking benchmark,IEEE Transactions on Pattern Analysisand Machine Intelligence 37(9)(2015)1834–1848.

[4]H.Kiani Galoogahi,A.Fagg,C.Huang,D.Ramanan,S.Lucey,Need for speed:A benchmarkfor higher frame rate object tracking,in:Proceedings of the IEEE International Conference onComputer Vision,2017,pp.1125–1134.

[5]K.Cannons,A review of visual tracking,Dept.Comput.Sci.Eng.,York Univ.,Toronto,

Canada,Tech.Rep.CSE-2008-07 242(2008).

[6]L.Huang,X.Zhao,K.Huang,Got-10k:A large high-diversity benchmark for generic objecttracking in the wild,IEEE Transactions on Pattern Analysis and Machine Intelligence 43(5)

(2019)1562–1577.

[7]M.Mueller,N.Smith,B.Ghanem,A benchmark and simulator for uav tracking,in:Europeanconference on computer vision,Springer,2016,pp.445–461.

[8]H.Yang,L.Shao,F.Zheng,L.Wang,Z.Song,Recent advances and trends in visual tracking:

A review,Neurocomputing 74(18)(2011)3823–3831.

[9]H.Fan,L.Lin,F.Yang,P.Chu,G.Deng,S.Yu,H.Bai,Y.Xu,C.Liao,H.Ling,Lasot:Ahigh-quality benchmark for large-scale single object tracking,in:Proceedings of theIEEE/CVF conference on computer vision and pattern recognition,2019,pp.5374–5383.

[10]M.Muller,A.Bibi,S.Giancola,S.Alsubaihi,B.Ghanem,Trackingnet:A large-scale datasetand benchmark for object tracking in the wild,in:Proceedings of the European conference oncomputer vision(ECCV),2018,pp.300–317.

[11]N.Wax,Signal-to-noise improvement and the statistics of track populations,Journal ofApplied physics 26(5)(1955)586–595.

[12]R.E.Kalman,et al.,A new approach to linear filtering and prediction problems[j],Journal ofbasic Engineering 82(1)(1960)35–45.

[13]I.K.Sethi,R.Jain,Finding trajectories of feature points in a monocular image sequence,IEEETransactions on Pattern Analysis and Machine Intelligence 1(PAMI-9)(1987)56–73.

[14]D.Comaniciu,V.Ramesh,P.Meer,Real-time tracking of non-rigid objects using mean shift,in:Proceedings IEEE Conference on Computer Vision and Pattern Recognition.CVPR 2000

(Cat.No.PR00662),Vol.2,IEEE,2000,pp.142–149.

[15]D.Comaniciu,V.Ramesh,P.Meer,Kernel-based object tracking,IEEE Transactions onpattern analysis and machine intelligence 25(5)(2003)564–577.

[16]D.G.Lowe,Object recognition from local scale-invariant features,in:Proceedings of theseventh IEEE international conference on computer vision,Vol.2,Ieee,1999,pp.1150–1157.[17]C.Stauffer,W.E.L.Grimson,Learning patterns of activity using real-time tracking,IEEETransactions on pattern analysis and machine intelligence 22(8)(2000)747–757.

[18]F.Porikli,Integral histogram:A fast way to extract histograms in cartesian spaces,in:2005

IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR’05),

Vol.1,IEEE,2005,pp.829–836.

[19]Z.Yin,F.Porikli,R.T.Collins,Likelihood map fusion for visual object tracking,in:2008

IEEE Workshop on Applications of Computer Vision,IEEE,2008,pp.1–7.2

[20]A.Goldenshluger,A.Zeevi,The hough transform estimator,The Annals of Statistics 32(5)(2004)1908–1932.

[21]M.Isard,A.Blake,Contour tracking by stochastic propagation of conditional density,in:

European conference on computer vision,Springer,1996,pp.343–356.

[22]N.Wang,D.-Y.Yeung,Learning a deep compact image representation for visual tracking,

Advances in neural information processing systems 26(2013).

[23]I.C.Duta,L.Liu,F.Zhu,L.Shao,Pyramidal convolution:rethinking convolutional neuralnetworks for visual recognition,arXiv preprint arXiv:2006.11538(2020).

[24]Q.Wang,Z.Teng,J.Xing,J.Gao,W.Hu,S.Maybank,Learning attentions:residualattentional siamese network for high performance online visual tracking,in:Proceedings of theIEEE conference on computer vision and pattern recognition,2018,pp.4854–4863.

[25]M.Kristan,A.Leonardis,J.Matas,M.Felsberg,R.Pflugfelder,L.ˇCehovin Zajc,T.Vojir,G.

Bhat,A.Lukezic,A.Eldesokey,et al.,The sixth visual object tracking vot2018 challengeresults,in:Proceedings of the European Conference on Computer Vision(ECCV)Workshops,

2018,pp.0–0.

[26]J.Fan,W.Xu,Y.Wu,Y.Gong,Human tracking using convolutional neural networks,IEEEtransactions on Neural Networks 21(10)(2010)1610–1623.

[27]M.Danelljan,G.Bhat,F.Shahbaz Khan,M.Felsberg,Eco:Efficient convolution operators fortracking,in:Proceedings of the IEEE conference on computer vision and pattern recognition,

2017,pp.6638–6646.

[28]G.Bhat,J.Johnander,M.Danelljan,F.S.Khan,M.Felsberg,Unveiling the power of deeptracking,in:Proceedings of the European Conference on Computer Vision(ECCV),2018,pp.

483–498.

[29]F.Zhang,S.Chang,Hierarchical convolutional features fusion for visual tracking,in:Journalof Physics:Conference Series,Vol.1651,IOP Publishing,2020,p.012134.

[30]F.Li,C.Tian,W.Zuo,L.Zhang,M.-H.Yang,Learning spatial-temporal regularizedcorrelation filters for visual tracking,in:Proceedings of the IEEE conference on computervision and pattern recognition,2018,pp.4904–4913.

[31]Z.Zhu,Q.Wang,B.Li,W.Wu,J.Yan,W.Hu,Distractor-aware siamese networks for visualobject tracking,arXiv e-prints(2018)arXiv–1808.

[32]T.Zhang,C.Xu,M.-H.Yang,Robust structural sparse tracking,IEEE transactions on patternanalysis and machine intelligence 41(2)(2018)473–486.

[33]H.Fan,H.Ling,Sanet:Structure-aware network for visual tracking,in:Proceedings of theIEEE conference on computer vision and pattern recognition workshops,2017,pp.42–49.

[34]H.Nam,B.Han,Learning multi-domain convolutional neural networks for visual tracking,in:

Proceedings of the IEEE conference on computer vision and pattern recognition,2016,pp.

4293–4302.

[35]L.Zhang,A.Gonzalez-Garcia,J.v.d.Weijer,M.Danelljan,F.S.Khan,Learning the modelupdate for siamese trackers,in:Proceedings of the IEEE/CVF international conference oncomputer vision,2019,pp.4010–4019.

[36]Y.Zhang,L.Wang,J.Qi,D.Wang,M.Feng,H.Lu,Structured siamese network for real-timevisual tracking,in:Proceedings of the European conference on computer vision(ECCV),2018,

pp.351–366.

[37]D.Held,S.Thrun,S.Savarese,Learning to track at 100 fps with deep regression networks,in:

European conference on computer vision,Springer,2016,pp.749–765.

[38]L.Bertinetto,J.Valmadre,J.F.Henriques,A.Vedaldi,P.H.Torr,Fully-convolutionalsiamese networks for object tracking,in:European conference on computer vision,Springer,

2016,pp.850–865.

[39]K.O’Shea,R.Nash,An introduction to convolutional neural networks,arXiv preprintarXiv:1511.08458(2015).

[40]A.Krizhevsky,I.Sutskever,G.E.Hinton,Imagenet classification with deep convolutionalneural networks,Communications of the ACM 60(6)(2017)84–90.

[41]K.He,X.Zhang,S.Ren,J.Sun,Deep residual learning for image recognition,in:Proceedingsof the IEEE conference on computer vision and pattern recognition,2016,pp.770–778.

[42]F.Scarselli,M.Gori,A.C.Tsoi,M.Hagenbuchner,G.Monfardini,The graph neural networkmodel,IEEE transactions on neural networks 20(1)(2008)61–80.

[43]M.Danelljan,G.Bhat,F.S.Khan,M.Felsberg,Atom:Accurate tracking by overlapmaximization,in:Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition,2019,pp.4660–4669.

[44]M.-H.Guo,T.-X.Xu,J.-J.Liu,Z.-N.Liu,P.-T.Jiang,T.-J.Mu,S.-H.Zhang,R.R.Martin,M.-M.Cheng,S.-M.Hu,Attention mechanisms in computer vision:A survey,ComputationalVisual Media(2022)1–38.

[45]B.A.Olshausen,C.H.Anderson,D.C.Van Essen,A neurobiological model of visualattention and invariant pattern recognition based on dynamic routing of information,Journal ofNeuroscience 13(11)(1993)4700–4719.

[46]L.Chen,H.Zhang,J.Xiao,L.Nie,J.Shao,W.Liu,T.-S.Chua,Sca-cnn:Spatial andchannel-wise attention in convolutional networks for image captioning,in:Proceedings of theIEEE conference on computer vision and pattern recognition,2017,pp.5659–5667.

[47]A.Sagar,Dmsanet:Dual multi scale attention network,in:International Conference on ImageAnalysis and Processing,Springer,2022,pp.633–645.

[48]S.Woo,J.Park,J.-Y.Lee,I.S.Kweon,Cbam:Convolutional block attention module,in:

Proceedings of the European conference on computer vision(ECCV),2018,pp.3–19.

[49]M.Yang,J.Yuan,Y.Wu,Spatial selection for attentional visual tracking,in:2007 IEEEConference on Computer Vision and Pattern Recognition,IEEE,2007,pp.1–8.

[50]J.Fan,Y.Wu,S.Dai,Discriminative spatial attention for robust tracking,in:EuropeanConference on Computer Vision,Springer,2010,pp.480–493.

[51]Z.Zhu,W.Wu,W.Zou,J.Yan,End-to-end flow correlation tracking with spatial-temporalattention,in:Proceedings of the IEEE conference on computer vision and pattern recognition,

2018,pp.548–557.

[52]B.Huang,J.Chen,T.Xu,Y.Wang,S.Jiang,Y.Wang,L.Wang,J.Li,Siamsta:

Spatio-temporal attention based siamese tracker for tracking uavs,in:Proceedings of theIEEE/CVF International Conference on Computer Vision,2021,pp.1204–1212.

[53]K.Yang,Z.He,Z.Zhou,N.Fan,Siamatt:Siamese attention network for visual tracking,

Knowledge-based systems 203(2020)106079.

[54]Y.Yu,Y.Xiong,W.Huang,M.R.Scott,Deformable siamese attention networks for visualobject tracking,in:Proceedings of the IEEE/CVF conference on computer vision and patternrecognition,2020,pp.6728–6737.

[55]J.Choi,H.J.Chang,J.Jeong,Y.Demiris,J.Y.Choi,Visual tracking usingattention-modulated disintegration and integration,in:Proceedings of the IEEE conference oncomputer vision and pattern recognition,2016,pp.4321–4330.

[56]X.Chen,B.Yan,J.Zhu,D.Wang,X.Yang,H.Lu,Transformer tracking,in:Proceedings ofthe IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021,pp.

8126–8135.

[57]B.Jiang,R.Luo,J.Mao,T.Xiao,Y.Jiang,Acquisition of localization confidence for accurateobject detection,in:Proceedings of the European conference on computer vision(ECCV),

2018,pp.784–799.

[58]B.Li,J.Yan,W.Wu,Z.Zhu,X.Hu,High performance visual tracking with siamese regionproposal network,in:Proceedings of the IEEE conference on computer vision and patternrecognition,2018,pp.8971–8980.

[59]B.Li,W.Wu,Q.Wang,F.Zhang,J.Xing,J.Yan,Siamrpn++:Evolution of siamese visualtracking with very deep networks,in:Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition,2019,pp.4282–4291.

[60]T.Yang,P.Xu,R.Hu,H.Chai,A.B.Chan,Roam:Recurrently optimizing tracking model,in:

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,2020,

pp.6718–6727.

[61]B.Yan,H.Zhao,D.Wang,H.Lu,X.Yang,’skimming-perusal’tracking:A framework forreal-time and robust

long-term tracking,in:Proceedings of the IEEE/CVF International Conference on ComputerVision,2019,pp.2385–2393.

[62]Z.Zhang,H.Peng,Deeper and wider siamese networks for real-time visual tracking,in:

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019,

pp.4591–4600.

[63]Y.Song,C.Ma,X.Wu,L.Gong,L.Bao,W.Zuo,C.Shen,R.W.Lau,M.-H.Yang,Vital:

Visual tracking via adversarial learning,in:Proceedings of the IEEE conference on computervision and pattern recognition,2018,pp.8990–8999.

[64]K.Dai,D.Wang,H.Lu,C.Sun,J.Li,Visual tracking via adaptive spatially-regularizedcorrelation filters,in:Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition,2019,pp.4670–4679.

[65]M.Danelljan,G.Hager,F.Shahbaz Khan,M.Felsberg,Learning spatially regularizedcorrelation filters for visual tracking,in:Proceedings of the IEEE international conference oncomputer vision,2015,pp.4310–4318.

[66]J.Valmadre,L.Bertinetto,J.Henriques,A.Vedaldi,P.H.Torr,End-to-end representationlearning for correlation filter based tracking,in:Proceedings of the IEEE conference oncomputer vision and pattern recognition,2017,pp.2805–2813.

[67]A.Lukezic,T.Vojir,L.ˇCehovin Zajc,J.Matas,M.Kristan,Discriminative correlation filterwith channel and spatial reliability,in:Proceedings of the IEEE conference on computer visionand pattern recognition,2017,pp.6309–6318.

[68]J.Zhang,S.Ma,S.Sclaroff,Meem:robust tracking via multiple experts using entropyminimization,in:European conference on computer vision,Springer,2014,pp.188–203.

[69]Y.Li,J.Zhu,A scale adaptive kernel correlation filter tracker with feature integration,in:

European conference on computer vision,Springer,2014,pp.254–265.

[70]Z.Hong,Z.Chen,C.Wang,X.Mei,D.Prokhorov,D.Tao,Multi-store tracker(muster):Acognitive psychology inspired approach to object tracking,in:Proceedings of the IEEEconference on computer vision and pattern recognition,2015,pp.749–758.

[71]T.Zhang,C.Xu,M.-H.Yang,Learning multi-task correlation particle filters for visualtracking,IEEE transactions on pattern analysis and machine intelligence 41(2)(2018)

365–378.

[72]C.Ma,J.-B.Huang,X.Yang,M.-H.Yang,Robust visual tracking via hierarchicalconvolutional features,IEEE transactions on pattern analysis and machine intelligence 41(11)

(2018)2709–2723.

[73]L.Zhang,J.Varadarajan,P.Nagaratnam Suganthan,N.Ahuja,P.Moulin,Robust visualtracking using oblique random forests,in:Proceedings of the IEEE conference on computervision and pattern recognition,2017,pp.5589–5598.

[74]A.He,C.Luo,X.Tian,W.Zeng,A twofold siamese network for real-time object tracking,in:

Proceedings of the IEEE conference on computer vision and pattern recognition,2018,pp.

4834–4843.

[75]M.Che,R.Wang,Y.Lu,Y.Li,H.Zhi,C.Xiong,Channel pruning for visual tracking,in:

Proceedings of the European Conference on Computer Vision(ECCV)Workshops,2018,pp.

0–0.

[76]J.Choi,H.J.Chang,T.Fischer,S.Yun,K.Lee,J.Jeong,Y.Demiris,J.Y.Choi,Context-aware deep feature compression for high-speed visual tracking,in:Proceedings of theIEEE conference on computer vision and pattern recognition,2018,pp.479–488.

[77]Q.Wang,L.Zhang,L.Bertinetto,W.Hu,P.H.Torr,Fast online object tracking andsegmentation:A unifying approach,in:Proceedings of the IEEE/CVF conference onComputer Vision and Pattern Recognition,2019,pp.1328–1338.

[78]M.Danelljan,G.Hager,F.Khan,M.Felsberg,Accurate scale estimation for robust visualtracking,in:British Machine Vision Conference,Nottingham,September 1-5,2014,BmvaPress,2014,p.1.

[79]T.-Y.Lin,M.Maire,S.Belongie,J.Hays,P.Perona,D.Ramanan,P.Dollar,C.L.Zitnick,Microsoft coco:Common objects in context,in:European conference on computer vision,Springer,2014,pp.740–755.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. A visual target tracking method APR-Net based on an attention pyramid residual error network is provided. The technical scheme of the invention is as follows:

a visual target tracking method APR-Net based on an attention pyramid residual network comprises the following steps:

designing an attention pyramid residual network feature extraction model APR-Net, wherein the attention pyramid residual network feature extraction model APR-Net is a pyramid residual network with an attention mechanism;

the ATOM tracking method is improved, and a tracking framework is designed, which comprises four parts: a feature extractor, an attention module, a classifier and a predictor based on a pyramid network PyConvResNet; the feature extractor based on the pyramid network PyConvResNet is used for extracting the multi-scale features of each frame of video image; the attention module is used for enhancing the visual expression capability of the features and helping the feature extractor pay more attention to the possible areas where the targets may appear. The attention module is integrated into the pyramid network model through bitwise weighted operation, and finally, the tracking problem is converted into the classification and prediction problem; and (3) initially positioning the target by using a classifier to obtain an initial target frame, optimizing the target frame by using back propagation by using a predictor, and correcting the tracking result of each frame to obtain a fine target frame.

Further, the pyramid residual error network PyConvResNet extracts multi-scale convolution characteristics by utilizing filters with different sizes and different depths in each convolution layer; the PyConvResNet inputs a fixed size image and automatically learns the multi-scale features in each convolutional layer through filters of different sizes and different depths.

Further, the feature extraction model APR-Net consists of two parts: pyramid feature layer and mixed attention module; the pyramid network consists of four pyramid residual blocks, and the third and fourth pyramid residual blocks are respectively followed by a mixed attention module; processing the mixed attention layer into a bit-by-bit weighting operation; the final feature map generated by the feature extraction model APR-Net contains various scale features and attention features of the target.

Furthermore, the invention replaces ResNet-18 in the original ATOM with APR-Net to extract more robust target appearance characteristics; an attention module is then placed between PyConvResNet and the pooling layer PrRoIPooling of the region of interest.

The invention has the advantages and beneficial effects as follows:

firstly, the visual attention mechanism is integrated into the PyConvResNet network architecture, a more robust model which is more suitable for tracking tasks is specially designed, and the model is named as an attention pyramid residual error network feature extraction model APR-Net. The model helps the feature extractor to pay more attention to the target and to know where the target is most likely to occur. This is critical to identifying small objects, size changes of objects, deformations, out of view and viewpoint changes, APR-Net is a more efficient alternative to the original depth residual network architecture.

Secondly, experiments show that the full end-to-end training of the feature extraction model APR-Net on a plurality of data sets is greatly helpful for improving tracking performance. It follows that the training strategy is one of the key factors to improve tracking performance.

Third, the present invention successfully applies the proposed APR-Net to a unified tracking framework. The tracking method is a simple and effective tracking framework, and good trade-off is made between tracking accuracy, robustness and tracking speed.

Finally, detailed experiments and analyses were performed on the tracking method proposed by the present invention on multiple reference datasets. Experimental results show that by introducing a feature extraction model APR-Net and fully training on multiple data sets including LaSOT [1], trackingNet [29] and got [ 10k [30], the tracking method designed by the invention is equivalent to or superior to the existing tracking method in tracking accuracy, robustness and real-time capability.

Drawings

Fig. 1 is a general structure of a tracking framework proposed by the present invention.

FIG. 2 is an original and ranking graph of accuracy/stability (A/R) of a tracking method on a VOT-2018 challenge reference dataset.

FIG. 3 is an original and ranking plot of the precision/robustness (A/R) of 6 different visual attributes (A) camera motion, (b) illumination change, (c) motion change, (d) occlusion, (e) scale change, and (f) null attributes on a VOT-2018 dataset.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and specifically described below with reference to the drawings in the embodiments of the present invention. The described embodiments are only a few embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

the present invention will focus on the progress of two classes of research work and some representative tracking methods that are closely related to the work of the present invention. I.e. review the progress of existing tracking methods based on deep learning (section 2.1) and attention mechanisms (section 2.2).

2.1 tracking method based on depth characteristics

The tracking method based on the depth features can be traced back to 2010, and Fan et al [26] learn spatial and temporal features in images of two continuous video frames by training a Convolutional Neural Network (CNNs), and put forward a tracking method special for people. Document [26] achieves good tracking performance using convolutional neural networks. However, the tracking method proposed in document [26] is limited to tracking a person, and cannot effectively track other general targets. In 2013, the method is significant to a tracking method based on a convolution network, and Wang Jiang et al [22] propose a general tracking method based on a deep learning architecture for the first time. First, they learn the generic appearance representation of the target object offline using stacked denoising self-encoders. The offline learned features are then migrated to an online tracking process, automatically deriving valuable features from previous video images. The average tracking speed of the tracking method on the GPU is 15 frames/second. Based on this, a series of tracking methods based on depth features, such as ECO 27, UPDT 28, HCF 29, deep STRCF 30, daSiamPRN 31, RSSTDeep 32, SANet 33, MDNet 34, updeteNet 35 and Structure Siam 36, are proposed successively for general visual target tracking tasks.

Although the existing tracking method can realize robust tracking performance on the reference data set, the tracking performance still has a great improvement space under challenging scenes such as scale change, illumination change, rapid motion, very small targets, complex background, out-of-plane rotation, in-plane rotation, shielding and the like. Furthermore, the application of depth features in visual target tracking tasks is more challenging due to the higher requirements on real-time performance relative to the successful application of depth features in image classification and detection tasks. Because of the computational cost limitations, depth feature based tracking methods are computationally inefficient. This makes it difficult for these conventional deep convolution feature tracking methods to meet the practical application with high real-time requirements. Most of the existing tracking methods based on the depth convolution network need to realize real-time tracking performance by means of a GPU.

Recently, a series of tracking methods based on deep learning have been proposed to solve the problem of low computational efficiency [27, 37, 38]. Although these tracking methods have been successful in terms of real-time, it is often difficult to achieve high accuracy and robustness. For most depth tracking methods, the most important task is how to trade off between tracking accuracy, robustness and real-time performance.

Inspired by the successful application of the pyramid residual network in image classification and video motion classification/recognition [23], the invention provides a simple and efficient tracking method (named APR-Net tracking method) based on the combination of multi-scale features and attention mechanisms. The invention uses pyramid residual network (PyConvResNet) as backbone network and uses the attention mechanism to enhance the expression ability of visual features.

Standard convolutional networks, such as CNNs [39], alexNet [40], resNet [41], GNN [42], have a single type of filter kernel of the same spatial resolution and the same depth at different convolutional layers. Thus, in order to cope with the size variation of the object in the video image, these depth tracking methods solve the multi-scale problem by the brute force search method. Unlike popular standard convolutional networks that apply filters of the same size and depth, pyconv res et extract multi-scale convolutional features at the convolutional layer with filters of different sizes and different depths. The pyconvreset inputs a fixed size image and the multi-scale features in each convolution layer can be automatically learned by different size filters so that the multi-scale features in each video image can be easily captured. As described in document [23], pyConvResNet can gradually reduce the size of the feature map due to the flexibility of the filter, so that target features of various scales can be captured in a series of image blocks.

Compared with the reference tracking method ATOM 43, the tracking method provided by the invention can realize more competitive tracking performance without increasing calculation and storage capacity. The APR-Net of the invention effectively improves the real-time performance without affecting the tracking precision and the robustness. The invention fully considers the trade-off of tracking performance and tracking speed when designing the whole visual target tracking framework. The experimental results of section 6 show that tracking accuracy and speed can be effectively improved by introducing a variable size filter in the convolutional layer. This is because variable size filters increase scale invariance, reducing overfitting and computational complexity.

2.2 tracking method based on attention mechanism

Attention is a unique mechanism of the biological optic nervous system that a person will focus on a target area rather than the entire image when viewing the image. Thus, it is also desirable that the feature extraction network be able to focus on areas where targets may appear, like the human visual system. I.e. by introducing a focus mechanism, the feature network can focus on critical parts of the image or video. The attention mechanism adaptively weights the features according to their importance in the input image, thereby automatically finding salient regions in various scenes [44]. Olshausen et al [45] applied the attention mechanism to the neuroscience field for the first time in 1993 and achieved good performance, which was the pioneering work to introduce the attention mechanism into pattern recognition. Thereafter, attention mechanisms are rapidly expanding into other areas of research in computer vision, such as image annotation [46], image classification [47], object detection [48], image saliency [44], and the like.

As attention mechanisms develop and mature, attention mechanisms have also been successfully transplanted into the field of visual target tracking, and research by researchers has shown that attention mechanisms can effectively improve tracking performance. Representative tracking methods based on the mechanism of attention include documents [49], [50], [51] and documents [52, 53, 24, 54]. These tracking methods are all based on the basic discovery that by introducing an attention mechanism, more attention can be directed to the area where the object may appear, thereby improving the recognition of the object in the field of visual tracking.

Choi et al in 2016 in document [55] proposed a tracking method named AtCF, an pioneering effort that introduced the mechanism of attention into the field of visual target tracking. In document [49], a novel tracking method AVT based on an attention mechanism is proposed, which enhances the discriminatory power of a tracking model by introducing an attention module, thereby improving tracking performance. In document [50], a selective attention tracking paradigm is introduced in the visual appearance model to improve tracking performance. In addition, a series of tracking methods have attempted to improve tracking performance by integrating the attention mechanism into the twin network tracking framework, such as tracking methods SiamSTA [52] and SiamAttn [53]. However, these tracking methods that integrate attention into the twin network framework have difficulty achieving better performance in complex scenes such as occlusion, complex background, out of view, etc., and are difficult to meet real-time requirements due to higher computational complexity.

Thereafter, many attention-based tracking methods are in succession, such as the cross-attention and self-attention-based tracking methods presented in document [56], the spatial and temporal attention-based tracking methods presented in documents [50] and [51 ]. The above described tracking method of adding an attention mechanism achieves better tracking performance on the baseline tracking dataset than the baseline tracking method. This is because the attentional mechanisms effectively enhance the expressive power of the features.

The main factors affecting the target tracking performance are: the quality and predictive performance of the feature models, classifiers, and regressors of the targets reflect the accuracy of target identification and detection and the robustness of target localization. Thus, in order to better simulate a human visual system, the present invention carefully designs a unified tracking framework. The attention module is added to the feature map in order to make the feature extraction network pay more attention to the areas where objects may appear in the feature map. This helps to enhance the appearance characterization of the target, resulting in robust and accurate tracking performance.

Integrating the attention mechanism with the pyramid residual network hardly increases the complexity of the network model, and realizes the fusion of the multiscale information in the feature map and the attention mechanism. Most importantly, the attention mechanism helps to capture global dependencies from feature graphs and effectively build long-term dependencies [56], which is well suited for visual object tracking tasks. By introducing APR-Net, a better balance between tracking accuracy, robustness and real-time performance can be achieved compared to ATOM.

3 tracking method proposed by the invention

In this section, the proposed overall tracking framework is first described in section 3.1. The attention pyramid residual network architecture will be described in section 3.2. Section 3.3 and section 3.4 describe the attention pyramid network structure and its tracking process. Finally, section 3.5 will focus on the differences between the tracking method of the present invention and ATOM.

3.1 overview of tracking method proposed by the present invention

In the invention, in order to overcome the limitation that ATOM can not automatically adapt to multi-scale and complex scenes, inspired by the core ideas of PyConvResNet [23] and a mixed attention mechanism [24], a better tracking method is realized by designing a pyramid residual error network fused with the mixed attention mechanism. On this basis, a simple and effective tracking framework is designed, which comprises four parts: a feature extractor, an attention module, a classifier and a predictor based on a pyramid residual network PyConvResNet. Briefly, the present invention follows the idea of the ATOM tracking framework, and the proposed tracking method converts tracking problems into classification and prediction problems by integrating the attention module into the pyramid residual network model.

The method comprises the steps of distinguishing a target from a background by using a classifier, and obtaining an initial target frame; correcting the tracking result of each frame by using a predictor; and the prediction result is refined through a modulation module. However, unlike prior ATOM, resNet-18 in ATOM is replaced with APR-Net to extract the target appearance feature in this patent. A hybrid attention module is then placed between the pyramid residual network PyConvResNet and the pooling layer of the region of interest (prroipoling) [57 ]. Through the improvement, the tracking method provided by the invention can better capture global information through larger acceptance domain and multi-scale information under the condition of not increasing calculation cost. FIG. 1 depicts the overall tracking framework of the present invention.

Fig. 1 shows the overall structure of the tracking framework proposed by the present invention. The tracking framework of the present invention is composed of four components: a feature extractor, an attention module, a classifier and a predictor based on a pyramid residual network. When the image of the template frame and the image of the current frame respectively flow into the feature extractor, a feature representation is generated by the feature extractor. The attention module enhanced feature map is aggregated to a fixed size using a PrRoIPool layer. The characteristics of the image output of the current frame, the predicted target frame obtained in the image of the current frame, the characteristics of the template image output and the predicted target frame thereof are simultaneously input to a IoU predictor. The pooling feature with the attention mechanism is modulated by performing a multiplication operation channel by channel with the coefficient vector returned by the template branch. The classifier generates a confidence value according to the characteristics of the image output of the current frame to obtain an initial target frame. Finally, a fine prediction target frame is obtained by maximizing IoU overlap rate.

3.2 attention pyramid network architecture

The invention designs a depth feature extraction model which is more suitable for visual target tracking tasks, and the depth feature extraction model is named APR-Net. The overall network architecture of the APR-Net and its attention mechanisms will be described in sections 3.2.1 and 3.2.2, respectively.

3.2.1 overall network architecture

In the previous visual target tracking task, exploration was made with respect to the attention and pyramid networks, respectively. The invention successfully pyramids the attention mechanism in the residual error network for the first time for extracting the appearance characteristics of the target object in the tracking task. The invention obtains inspiration from the attention mechanism [24] and pyramid residual network (PyConvResNet) [23 ]. The backbone architecture of the feature extraction network model of the present invention is the depth residual network with skip connection (ResNet-18) proposed by He Kaiming et al [41]. Unlike the original ResNet-18, APR-Net is used to extract features from each frame by integrating an attention module into PyConvResNet. The feature extraction model APR-Net has filters with different sizes and different depths in each layer, and can automatically extract multi-scale feature representation from each feature map.

The original pyramid residual network PyConvResNet was designed for target detection, image classification and image recognition. In order to enable the method to be applied to tracking tasks, an original PyConvResNet architecture is modified, and an attention mechanism is added, so that the feature extraction model APR-Net can fully utilize the complementary advantages of a pyramid architecture and the attention mechanism. With minor modifications, the pyramid residual network architecture is more suitable for visual target tracking tasks. Thus, in the tracking process, the present invention uses multi-scale depth features with an attention mechanism for generic feature description of the target.

As the name suggests, the model APR-Net for extracting depth features consists of two parts: pyramid feature layer and attention module. The pyramid feature network consists of four pyramid residual blocks, the third and fourth pyramid residual blocks being followed by a mixed attention module. The present invention treats the mixed attention layer as a bitwise weighting operation. The final feature map generated by the feature extraction model APR-Net contains various scale features and attention features of the target that will emphasize the regions where the target may appear and suppress its background region. Thus, the sign extraction model APR-Net can pay attention to the most likely important area of the target in each frame of image, so that the designated target can be better positioned.

The APR-Net architecture enables the feature extraction process to capture various types of complementary information, from local information to global information. The improvement ensures that the feature extraction model APR-Net has more excellent robustness to various complex scenes in each sequence, and particularly can better identify targets under the conditions of complex background, scale change, deformation, motion blur, rapid motion and the like.

Furthermore, integration of the attention module with the multi-scale feature map and adequate training of the network model over multiple data sets is key to improving tracking performance. This is mainly because the APR-Net network architecture can not only encode targets of various scales from the feature map, but also emphasize areas where targets may appear, which tightly couples learned features to the tracking process.

3.2.2 attention mechanisms

The invention uses a mixed attention module to learn the interdependence relationship between different features. The main motivation for designing an attention module is to emphasize the areas where the target may appear, while ignoring its background areas. In the present invention, the attention module is integrated into the classifier branch and the predictor branch, respectively. An attention module in the classifier directs the feature extraction network to enhance the most likely region of the object in each frame, thereby enhancing the robustness to object deformation. The intuition behind the attention module is that not all features provide the same contribution to the classifier and predictor. An attention module in the predictor may fine tune the results of predicting the bounding box to obtain a more accurate target bounding box.

Such as document [24 ]]As shown therein, the present invention introduces a attention factor λ into the feature x, learning the appearance feature of the deep layer through an attention mechanism. Attention module of the invention

The formula of (2) is:

wherein,,

representing a bitwise weighting operation. To reduce computational complexity, literature [24]The attention factor lambda is further decomposed into dual attention epsilon and channel attention kappa. Dual attention epsilon is used to enhance the focus on the target and channel attention kappa is used to define the characteristic channel. Attention is therefore broken down into:

wherein epsilon can be re-decomposed into superimposed attention, from general attention

(similar to Gaussian distribution) and residual attention->

(can be considered as a discriminant) composition:

the overlay attention module can help the feature model capture global saliency information in various video scenes, enhance the capability of the feature model to identify targets, and reduce computational complexity. Three kinds of attention are fused using the mixed attention module described in document [24] as a template feature vector: namely spatial attention, superimposed attention and channel attention.

It is of paramount importance to draw mixed attention that, after the target is specified in the first frame, target candidate regions requiring close attention can be found in the image of each successive video frame subsequent to the video sequence. And channel attention κ may further enhance the adaptation of the feature model to target variations. More details about mixed attention and its solving process are consistent with document [24 ]. However, unlike document [24] which introduces an attention mechanism on the twinning network, the present invention integrates the attention mechanism into the pyramid residual network pyconvranet, capturing complementary information with multi-scale features from local information to global information.

3.3 feature learning and feature extraction

The present invention adds mixed attention at the end of the third and fourth pyramid residual blocks. The attention module is used for weighting the feature map. More specifically, template feature x from the first frame ₀ And test feature x of current frame t _t Are transferred into respective APR-Net branches, respectively obtain feature maps with attention mechanisms at location i

And

and->

The higher the values at the upper position i, the greater the likelihood that they are target positions. While

And->

The lower value at position i is more likely to correspond to the background area of the object.

Feature map generated by template image

And its predicted target frame->

Image characteristics of the current frame->

And its predicted target frame->

As well as an input to the IoU predictor branch network. And the classification branch network calculates a target confidence score according to the feature map to obtain an initial target frame. The tracking method can acquire complementary information from different convolution layers through different types of filter kernels, so that better discrimination capability is provided for learning depth features.

3.4 tracking procedure

The essential problem of the tracking method proposed by the present invention is how to learn the classifier and predictor. Thus, in this section, the following two aspects will be described with emphasis: target classification and target prediction. In the tracking process, the tracking method alternately and circularly performs target classification and target prediction. The object of the present invention is to learn a feature extraction network to map a current frame from an image to a feature map through a series of pyramid convolution operations. After mixed attention enhancement, the multi-scale feature map with attention is sent to a classifier and a predictor.

3.4.1 target prediction

The effect of the prediction branch is to combine the previously generated modulation vector, infer the IoU overlap rate of the target frame, and optimize the target frame by using back propagation to maximize IoU, thus obtaining a fine prediction frame. The input template image size of each frame is 288 x 288, and all the input template images of each layer are inputThe size of the feature map was adjusted to 38 x 38. Template branching is used for image characteristics x according to a first frame ₀ The appearance of the object is modeled. Test branch for extracting appearance feature x of target in current frame t _t And confidence and IoU overlap values are calculated.

The output of the template branch is two 1 x D _z Is of the modulation vector of (a)

The test branch obtains 1X D through PrRoI pooling operation _z Feature vector +.>

The output of the last test branch is IoU overlap ratio, which is given by:

wherein x is ₀ And

the template features and the marked frames in each template image are respectively. Return is of size

Modulation vector consisting of positive coefficients +.>

But->

According to template characteristics->

And an initial bounding box->

Pre-calculated. Obtaining target specific exo-factors by a modulation moduleAnd (5) viewing information, and refining the predicted IoU overlapping rate. Designated target prediction frame- >

Is->

Parameterized representation of (c) _m ,c _n ) Is the position of the center of a frame. Subscripts m and n are coordinates of the image, w and h represent the width and height of the bezel, respectively. The backbone profile x passes the attention module +.>

And prediction border->

Generating a size of +.>

Is a representation of (1)

Where K is the spatial output size. />

Is a IoU estimation module consisting of three fully connected layers. The prediction error of equation (4) can be minimized according to a given labeling frame. And the predicted target state can be obtained by maximizing equation (4).

3.4.2 object classification

The goal of the classification branch is to initially locate the target in the continuous video sequence. Thus, the classifier is learned online in order to capture objects in real-time in the image of each frame of video. The target classification component in the tracking framework of the present invention is defined as:

wherein the method comprises the steps of，

Is a backbone feature map of the image, ω= { ω ₁ ,ω ₂ -network parameters }, ∈>

And

is an excitation function representing a standard multichannel convolution.

The objective function of the classification error, elicited from the discriminant correlation filter (the discriminative correlation filter, DCF), can be defined as:

labeling each training sample feature map x _t,j ，/>

Is a gaussian function centered on the target location. Weight alpha _j For controlling->

Influence of (1)>

Is to weight omega _k Is a regularization of (2).

3.4.3 target tracking

The invention uses ATOM tracking method as reference tracking frame [43], and defines the tracking frame as classifier and predictor. The invention uses a classifier to distinguish the target and the background thereof, obtains the initial target frame, and combines a predictor to finely adjust the tracking result. And finally, optimizing the tracking result by the combined training classifier and the predictor to obtain a final tracking result in the current frame t.

In the target positioning stage, a tracking target is given using a rectangular frame in the first frame. For the subsequent video frames, the tracking method determines the target frame position in the image of the current frame according to the state information of the target of the previous frame. In the tracking process, all the output feature maps of the head layers are used to predict and classify the tracking target of each video frame. First, using the classifying module, according to the position P of the previous frame t-1 _t-1 And size S _t-1 Calculating a confidence map for the position P with the highest confidence score _t As the position of the target in the current frame t. Position P at current frame t _t And the size S calculated in the previous frame t-1 _t-1 Constitutes an initial target frame

Next, 10 candidate regions are generated and their IoU overlap rates are calculated using the target prediction module. And taking frames corresponding to the first three maximum values, and finally taking the average value as a final tracking frame.

3.5 discussion

This section will focus on analyzing and discussing the differences between the tracking method proposed by the present invention and ATOM in terms of multi-scale searching (section 3.5.1) and attention mechanisms (section 3.5.2).

3.5.1 benefited from Multi-scale searching

The size of the target varies almost throughout any video sequence, so tracking performance is largely dependent on the ability of the tracking method to identify target size variations. Thus, one of the long-term targets for visual target tracking is the ability to process inputs on multiple scales to accurately capture moving targets in a video sequence. ATOM attempts to automatically identify scale changes from the target by maximizing the IoU overlap ratio of the target's estimated components. A potential problem with ATOM is reduced tracking performance due to inconsistencies between the predicted and real frames. This essentially explains why the original ATOM cannot further improve tracking accuracy and robustness. Thus, feature extraction networks require a more efficient strategy to search for multi-scale information of a target.

Compared with ATOM, the tracking method provided by the invention can automatically capture the scale change of the target through the APR-Net, and the APR-Net is not limited to approximating a group of filter kernels with fixed size, but can be adjusted to a plurality of filter kernels with different sizes and different depths in each convolution layer. The multi-scale features of each pyramid layer are combined with the original scale features of the current frame to serve as final features, and targets with different scales are accurately captured. Finally, capturing targets with different sizes in the image of the video frame by using the multi-scale features as feature graphs of the predictor and the classifier. Thus, when APR-Net is used as the feature extractor, it can provide significant advantages for the tracking task.

3.5.2 benefit from the attention mechanism

Unlike ATOM which performs PrRoIPool operations directly on the feature map, the present invention places a mixed attention module between PrRoIPool operations and the APR-Net generated feature map. By adding an attention mechanism in the network, attention force diagrams obtained by a network model can be extracted according to the characteristics, and key areas are paid more attention to.

Wherein the feature map with attention module in the predictor is focused on the target area, the feature map with attention module in the classifier is fine-tuned by focusing on the difference between the target and its background. Although the work of the present invention is an extension of the ATOM tracking method, in various challenging scenarios, more competitive tracking performance than ATOM has been achieved.

4 implementation details

The network model APR-Net and the proposed tracking method are implemented using Python on computers of Intel i 9.60 GHz, 32.0GB RAM and NVIDIA RTX A2000 GPU. The average speed of the tracking method proposed by the present invention is close to 30 frames/second.

The settings of the invention during training follow PyConvResNet [23 ]]And ATOM [43 ]]. To meet our GPU throughput requirements, all training images are performed in a minimum of 64 batch sizes. The image or sequence of images used for training is cropped to a fixed size 288 x 288. A random gradient descent (SGD) with a momentum of 0.9 is applied, And the weight decay is set to 0.005 during the end-to-end training phase. The invention trains 50 epochs of the network model, and the learning rate is 10 ^-5 。

In the present invention, APR-Net is randomly initialized with the APR-Net as the feature extractor of the tracking method, and is first trained from scratch on the MS-COCO dataset in an off-line fashion. The predictor branches were then fully trained offline on three dedicated tracking training data sets (LaSOT, trackingNet and got10 k). Enabling predictors to learn feature representations with multi-scale and attention mechanisms from generic objects. And the classifier is trained online, special tracking target characteristics are captured, and the generalization capability and the discrimination capability of the tracking method on specific targets are improved, which is similar to ATOM.

5 experiment verification and result analysis

In this section, it is desirable to verify the validity of the proposed tracking method by conducting a series of verification experiments on a plurality of representative benchmark tracking data sets (including LaSOT [9], UAV123[7], UAV20L [7], VOT-2018[25] and NFS [4 ]). Meanwhile, ablation study was performed to observe the influence of each component of the tracking method proposed by the present invention on tracking performance. Experiments in this section show that the APR-Net model can achieve more robust tracking performance without sacrificing computational efficiency.

5.1 experiment 1: laSOT

As is well known, the LaSOT dataset [9] consists of two types of datasets: one is a specialized training subset that provides visual and linguistic annotations to train the depth tracking method (1120 videos), and the other is a testing subset that evaluates the performance of the tracking method (280 videos). In this section, the LaSOT test subset is used to verify the validity of the tracking method proposed by the present invention. According to evaluation protocol II in document [9], tracking performance is evaluated using two indices: i.e. tracking accuracy and success rate.

5.1.1LaSOT Overall Performance

The overall tracking performance of the tracking method provided by the invention realizes the tracking precision of 0.524 and the success rate of 0.530 on the LaSOT verification subset. The tracking method proposed by the present invention achieves better performance than the baseline tracking framework ATOM. Compared with ATOM, the tracking precision is improved by 1.9%, and the success rate is improved by 1.6%.

The excellent performance can be attributed to: the ATOM tracking method is replaced by the APR-Net of the proposed feature extraction model for extracting the ResNet-18 network of the features. The reason why the tracking method provided by the invention is successful on various tracking sequences is the design of the attention pyramid residual error network, which provides a global acceptance domain for the tracking method provided by the invention. Thus, the contribution of the attention and pyramid network to this excellent tracking performance is very significant.

5.1.2 LaSOT-based Property Properties

As described in document [9], laSOT is more challenging than other mainstream baseline datasets with the attributes of target scale change, out of view of the camera, and occlusion. Even so, the tracking method still realizes the tracking precision of 0.521 and the success rate of 0.529 in the scale change attribute, and realizes the tracking precision of 0.439 and the success rate of 0.448 in the out-of-view attribute respectively. In addition, the tracking accuracy of the tracking method on the complete shielding attribute is 0.476, and the success rate is 0.446.

Most importantly, in addition to the tracking methods of the present invention achieving competitive tracking performance on the typically most challenging visual attributes (e.g., scale change, complete occlusion, and out of view), the tracking methods of the present invention have better tracking stability among other various challenging attributes (including viewpoint change, rotation, partial occlusion, motion blur, low resolution, illumination change, fast motion, deformation, camera motion, complex background, and aspect ratio change). This shows that the tracking method of the invention can be adapted to various complex scenes and achieve satisfactory tracking effects.

5.2 experiment 2: UAV123

The unmanned aerial vehicle dataset (The unmanned aerial vehicles, UAVs) [7] is made up of 123 high-definition low-altitude aerial viewpoint videos taken by a professional unmanned aerial vehicle, and therefore is also referred to as a UAV123 reference tracking dataset. The evaluation index in the UAV123 is the same as LaSOT, i.e., accuracy and success rate. Tracking a very small target, a fast moving target, an out of view target, and a target in a complex polytropic scene in the UAV123 scene has greater challenges than other target tracking datasets in the field scene. Although video on the UAV123 reference dataset is a great challenge, the tracking method of the present invention still has very strong tracking performance. The tracking accuracy of the tracking method provided by the invention is 0.854, and the success rate is 0.645. It can thus be inferred that the tracking method of the present invention is well suited for tracking objects in aerial video in complex and diverse real scenes such as camera view, height, position changes, low resolution, significant scale/aspect ratio changes, fast motion, very small objects, similar objects, and long-term partial or complete occlusion. The tracking method provided by the invention is obviously superior to the baseline tracking frame ATOM tracking method (tracking accuracy is 0.828, success rate is 0.621). Compared with ATOM, the tracking precision and the success rate of the invention are respectively improved by 2.6 percent and 2.4 percent.

This is mainly because the tracking method of the present invention can adapt well to the change of the target scale by extracting multi-scale features through the pyramid network. In addition, the attention module is integrated into the pyramid network, so that the characteristics can be conveniently captured from different receiving domains, and the related characteristics are combined on different scales, so that the tracking method can better identify various changes. Most importantly, the introduction of a focus mechanism in the pyramid network can distribute more focus to the area where the target is located. Therefore, the tracking method of the present invention can stably track very small objects.

5.3 experiment 3: UAV20L

UAV123 contains three subsets, where UAV20L is a subset of UAV123, including 20 long low-altitude high-definition videos. The UAV20L is more challenging than other short-term videos in the UAV123 because targets are more easily lost in long videos. The tracking method of the invention is superior to the ATOM tracking method, the tracking precision is improved from 0.678 to 0.767 on the UAV20L, and the success rate is improved from 0.522 to 0.585. However, compared to the tracking performance of UAV123, the tracking performance of the tracking method of the present invention on UAV20L is slightly degraded (tracking accuracy is degraded by 8.7% and success rate is degraded by 6%). The tracking method of the invention improves the target recognition capability by introducing the mixed attention module into the original pyramid network so as to improve the characteristic representation and tracking performance. Most importantly, the study of document [29] found that the shallow layer captured details including texture, edges, contours, and colors, as the shallow layer had a small acceptance field. And because a larger acceptance field is deep, more semantic information can be obtained. Thus, the 3 rd and 4 th residual blocks are fed into the respective convolutional layers, facilitating the fusion of shallow features with deep features. And finally, classifying and predicting by utilizing the characteristics of the 3 rd and 4 th residual blocks, and fully utilizing the detail information of the shallow layer and the semantic information of the high layer. These may be key factors in achieving the desired tracking performance of the tracking method of the present invention even in long-time video such as the UAV 20L.

5.4 experiment 4: VOT-2018

Visual Object Tracking (VOT) is well known to be a challenging set of reference data. Since 2013, ICCV or ECCV reported improvements in this baseline dataset and the results of challenging tracking methods almost every year. However, unlike other reference data sets used for evaluation, the accuracy/stability map is used to measure tracking performance on the VOT data set. In this section, the experimental verification of the present invention is based on the standard of VOT-2018 (see http:// votpallenge. Net/VOT 2018).

The tracking method of the present invention compares to known tracking methods, such as ATOM [43], siamDW [62], ECO [27], SA_Siam_P [74], siamRPN [58], CPT [75], updateNet [35], TRACA [76], UPDT [28], and SiamMask [77], on a VOT-2018 dataset. As shown in fig. 2, the tracking method provided by the invention is positioned at the upper right of the VOT-2018 benchmark test, and is equivalent to the tracking performance of the existing advanced tracking method.

FIG. 2VOT-2018 challenges the original and ranking graphs of accuracy/stability (A/R) of the tracking method on the baseline dataset. The tracking method used for evaluation is ranked according to the average performance of all sequences. The upper right corner tracking method is the best performance based on the expected average overlap value of VOT-2018

In addition to overall performance, table 1 further reports the Expected Average Overlap (EAO), overlap rate (overlay), failure rate (failure), and speed (frames/sec) of the tracking method. The tracking method of the present invention is superior to the recent tracking method in terms of expected average overlap rate and overlap rate, and inferior to the UPDT in terms of stability (tracking failure rate). Notably, the tracking method of the present invention operates at 28.2 frames/second, which is faster than the reference tracking frame ATOM (26.6 frames/second) of the present invention. Experimental results show that although the characteristic extraction network of ATOM is realized based on ResNet-18 backbone network, the tracking speed of the characteristic extraction network APR-Net is equivalent to ATOM even though the characteristic extraction network APR-Net is realized based on PyConvResNet-18 backbone network of the hybrid injection module.

Table 1 tracking method on the VOT-2018 reference dataset tracking performance at Expected Average Overlap (EAO), overlap rate (overlay), failure rate (failure), and speed (frames/sec).

In addition, FIG. 3 shows the tracking performance of the tracking method based on attributes. The tracking method provided by the invention is more suitable for scenes such as camera motion, illumination change, motion change, shielding, scale change, empty attribute and the like. The tracking method can effectively improve the tracking precision and robustness of VOT-2018. This is because the tracking method proposed by the present invention takes advantage of the attention mechanism to better locate targets in the global scope of the video. Thus, the tracking method proposed in the present invention can meet tracking performance requirements among these challenging attributes.

FIG. 3 is a precision/robustness (A/R) ranking graph of 6 different visual attributes (A) camera motion, (b) illumination change, (c) motion change, (d) occlusion, (e) scale change, and (f) null attributes on the VOT-2018 dataset. The tracking method is ranked according to the average performance of all sequences. The upper right corner tracking method is the best performance based on the expected average overlap value of VOT-2018

5.5 experiment 5: ablation study on NFS dataset

Unlike the low frame rate (30 fps) video tracking reference data set, NFS (Need for Speed) [4] is the first high frame rate video data set and the target tracking reference data set. As described in [ Need for Speed ], the reference dataset consists of 100 sequences (380 frames) captured from a real scene by a high frame rate (240 PFS) camera. The tracking performance on the dataset is ranked according to accuracy and real-time performance. Three variants of the tracking method proposed in this patent were implemented and evaluated in section 5.5.1. Furthermore, the tracking performance of training the feature extraction model APR-Net on different training data sets is reported in section 5.5.2. It is expected that these ablative experimental results will verify why the proposed tracking method can improve tracking very effectively.

5.5.1 three variants of the tracking method proposed by the present invention

To demonstrate how different components of the tracking method, or variations thereof, affect the final tracking performance, a thorough ablation study was performed on the NFS baseline dataset. The influence of the feature extraction network APR-Net on tracking performance is verified by transforming the ResNet-18 architecture, and three different variants of the model based on the invention correspond to a PyConvNet+ResNet model with only a pyramid layer, an attention+ResNet model with only an Attention layer, and a PyConvNet+attention+ResNet model with both pyramid and Attention layers. On the NFS reference dataset, the tracking performance of the three variants pyconvnet+ ResNet, attention + ResNet, pyConvNet +attention+res net and ATOM were verified under the same parameter settings. ATOM tracking method As the baseline tracking method of the present invention, no modifications were made to the author's code in the comparison.

First, the attention module is simply removed from the network architecture, instead of the original ResNet-18 (PyConvNet+ResNet) -based pyramid residual network [23], without any other changes. After this variation, the network structure for feature extraction consists only of a series of pyramid residual blocks without attention layers. Regardless of the mixed attention, tracking accuracy was 0.719 and success rate was 0.599 based on pyconvnet+resnet alone.

Besides the pyramid residual network model is beneficial to improving tracking performance, the attention module is found to have a certain influence on tracking performance through an ablation experiment. Compared with ATOM, the tracking performance is slightly better (tracking precision is 0.692, and success rate is 0.581) when the attention module (without pyramid residual block) is only added in the original ResNet-18. The reason for achieving such tracking performance is that the discrimination capability of the model is closely related to the introduction of the attention module in the model. It follows that the attention mechanism helps to enhance the ability of the depth model to identify specific targets. It is explained that the attention layer can effectively improve tracking performance, and that the visible attention layer is very necessary in visual target tracking.

When the PyConvNet+ResNet backbone network is replaced by the proposed attention pyramid architecture PyConvNet+attention+ResNet, the tracking precision is improved by 2.1%, and the success rate is improved by 1.4%. The attention pyramid network pyconvnet+attention+res net is also superior to the attention module only (without pyramid convolution) tracking method. Thus, the excellent performance of the proposed tracking method is attributed to an efficient combination of pyconvnet+resnet and attention mechanisms. It can be found from the results of the ablation study described above that none of the variants can exceed the tracking performance of the present invention in integrating attention into the pyramid network. In general, feature extraction models with both attention and pyramid layers are superior in tracking performance to feature extraction models with attention only or pyramid layers only.

The superior performance of the tracking method provided by the invention is verified by an ablation experiment result. Integrating attention and pyramid convolutions with different filter kernel sizes and different depths into each residual block can provide significant advantages. This provides a powerful feature model for the tracking task, which is also one of the main drivers for excellent tracking performance.

Furthermore, three variants of the feature extraction model of the present invention can always provide better tracking performance than the ATOM tracking framework. The tracking success rate of using only the attention module (attention+ResNet), the tracking method using only the pyramid architecture (PyConvNet+ResNet), and the tracking success rate of using both the attention and the pyramid architecture (PyConvNet+attention+ResNet) are improved by 1%, 2.8%, and 4.2%, respectively, compared with ATOM. The tracking method based on the PyConvNet+Attention+ResNet model realizes the most remarkable performance improvement (the success rate is improved by nearly 4.2 percent compared with ATOM), because the feature extraction model can automatically identify the size change of the target. Thus, introducing a mechanism of attention in the depth feature model may effectively improve tracking performance.

In practice, many tracking methods can only achieve excellent performance on low-speed video sequences. The main reason is that the tracking methods require a large amount of memory resources, and have high calculation complexity, so that the calculation speed of each frame is slow. Therefore, these tracking methods cannot obtain good tracking performance in high-speed video sequences. The tracking method of the invention adopts a focusing mechanism and a pyramid design, so that the tracking method can accurately capture targets with various sizes, stably position the targets in any video sequence, and show very strong performance (tracking accuracy is 0.740 and success rate is 0.613) on an NFS data set with higher frame rate. These ablation study results illustrate the effectiveness of the feature extraction model proposed by the present invention.

5.5.2 training on different data sets

In this section, an experiment was performed to verify the effect of training depth feature models under different training data sets on tracking performance. Meanwhile, the results of the tracking method based on the feature extraction model APR-Net, which is provided by the invention, after training of different data sets are reported. In MS COCO [28 ]]、LaSOT[1]、TrackingNet[29]And got k [30 ]]After training the depth feature extraction model of the present invention on the same different training data sets, the area under curve score (AUC), the overlap ratio accuracy (OP _0.50 ,OP _0.75 ) And regularized tracking accuracy (norm.pre.) as shown in table 2.

Tracking tasks differ from target recognition and detection tasks in that merely pre-training the feature extraction model on the MS COCO dataset may make tracking performance quite undesirable. Therefore, in addition to training the depth feature extraction model of the present invention on the MS co dataset, it is also necessary to train the depth feature extraction model proposed by the present invention on a specific tracking video sequence. The tracking method proposed by the present invention provides the best tracking performance after training on all four data sets (including MS COCO, laSOT, trackingNet and got k) without any change in the tracking framework. However, when the present invention trains the network for feature extraction on a single or two training data sets, sub-optimal tracking performance is achieved.

As can be seen from the experimental results in table 2, under the same feature extraction network, the feature extraction model has more excellent tracking performance after being fully trained on a plurality of training data sets than the result without being fully pre-trained. It can be seen that the more sufficient the training data, the more stable the tracking performance is under the same feature extraction model. The tracking performance after training the feature extraction model on various data sets indicates the necessity of adequate end-to-end training and fine tuning over multiple data sets. It can thus be concluded that for tracking methods based on the same feature extraction model, the use of multiple larger data sets for sufficient training and fine tuning is why a powerful tracking performance is achieved. This means that the robustness of tracking can be effectively improved by increasing the number of training samples, thereby making the tracking performance better. Most importantly, efficient end-to-end training using multiple different data sets allows for tightly coupled operation of the individual components in the feature extraction model.

Table 2 after pre-training the depth feature extraction model of the present invention on different training data sets (including MS COCO, laSOT, trackingNet and got k), the area under the curve (AUC), the overlap ratio accuracy (OP) of the tracking method based on the feature extraction model APR-Net were reported _0.50 、OP _0.75 ) Tracking performance in terms of Precision and regularized Precision (norm.Prec.), etc

Conclusion 6

In the invention, the advantages of the pyramid residual error network and the attention mechanism are combined, a real-time effective tracking method is developed, and the requirements of tracking precision and robustness are met. Experiments on benchmark tracking datasets (including LaSOT, UAV123, UAV20L, VOT-2018 and NFS) show that the tracking method proposed by the present invention is superior to the advanced tracking method in terms of tracking performance. Through a plurality of experimental verification, the network architecture for feature extraction and sufficient training of the network architecture can certainly play a good role in the field of visual target tracking.

Although the tracking method provided by the invention has greatly improved tracking performance, there is still a great improvement space on the requirements of tracking precision and real-time tracking performance. Up to now, there is no completely satisfactory solution, since existing solutions either trade complexity for accuracy or accuracy for inefficient computational effort. In future trace task work, an attempt may be made to better understand the contributions of the various components of the trace framework. More specifically, it is desirable to continue to further improve tracking performance in terms of real-time, accuracy, and success rate. By fusing a lightweight network into the tracking framework, a faster, more stable tracking method can be achieved. At the same time, it is desirable to be able to prompt more researchers to study which components in the tracking framework are able to improve tracking performance more effectively.

The system, apparatus, module or unit described in the above embodiments may be implemented by a computer chip or entity, or by a product having a certain function

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The above examples should be understood as illustrative only and not limiting the scope of the invention. Various changes and modifications to the present invention may be made by one skilled in the art after reading the teachings herein, and such equivalent changes and modifications are intended to fall within the scope of the invention as defined in the appended claims.

Claims

1. The visual target tracking method APR-Net based on the attention pyramid residual error network is characterized by comprising the following steps of:

Designing an attention pyramid residual error network APR-Net feature extraction model, wherein the feature extraction model APR-Net is a pyramid residual error network with an attention mechanism;

taking the ATOM tracking method as a baseline tracking framework, the designed tracking method comprises four parts: a feature extractor, an attention module, a classifier and a predictor based on a pyramid residual network PyConvResNet; the feature extractor based on the pyramid residual error network PyConvResNet is used for extracting the features of each frame of video image; the attention module is used for enhancing the visual expression capability of the features and helping the feature extractor to pay more attention to the possible areas where the targets are located; the attention module is integrated into a pyramid residual error network model through bitwise weighted operation, and finally, the tracking problem is converted into a classification and prediction problem; and (3) initially positioning the target by using a classifier to obtain an initial target frame, optimizing the target frame by using back propagation by using a predictor, and correcting the tracking result of each frame to obtain a fine target frame.

2. The visual target tracking method APR-Net based on the attention pyramid residual network according to claim 1, wherein the pyramid residual network pyconvres Net extracts multi-scale convolution features by using filters with different sizes and different depths in each convolution layer; a fixed-size image is input to PyConvResNet, and multi-scale features in each convolution layer are automatically learned by filters of different sizes and different depths.

3. The visual object tracking method APR-Net based on the attention pyramid residual network according to claim 1, wherein the feature extraction model APR-Net is composed of two parts: pyramid feature layer and mixed attention module; the pyramid network consists of four pyramid residual blocks, and the third and fourth pyramid residual blocks are respectively followed by a mixed attention module; processing the mixed attention layer into a bit-by-bit weighting operation; the final feature map generated by the model APR-Net contains various scale features and attention features of the target.

4. The visual object tracking method APR-Net based on the attention pyramid residual network according to claim 2, wherein the res Net-18 in the original ATOM is replaced by the APR-Net to extract more robust object appearance characteristics; an attention module is then placed between PyConvResNet and the pooling layer PrRoIPooling of the region of interest.

5. The visual target tracking method APR-Net based on the attention machine pyramid residual network according to claim 2, wherein the attention module specifically comprises:

attention factor lambda is introduced into feature x, enhancing the depth vision feature by the attention mechanism. Attention module

The formula of (2) is:

wherein, as follows, as a result of the bitwise weighting of the feature x, the attention factor λ is further decomposed into dual attentions ε for increasing the attention to the object in order to reduce the number of parameters, and the channel attentions κ define the feature channel, so that the attentions model can be decomposed into:

in order to be able to effectively capture the common features of objects in the images of different video frames and the differences between objects in the images of their different video frames, ε can be further decomposed into terms of general attention

And residual attention->

Superimposed attention to composition:

6. the visual object tracking method APR-Net based on the attention pyramid residual network according to claim 1, wherein the predictor functions to accurately predict the frame of the object by maximizing the IoU overlap rate in each frame through back propagation; the size of the input template image of each frame is 288 x 288, and the size of all the output characteristic images of each layer is adjusted to 38 x 38; template branching is used for image x according to first frame ₀ Modeling the appearance of a target to obtain multi-scale features with attention

Test branch for extracting image x in current frame t _t Target appearance characteristics- >

And calculating a confidence value and a IoU overlap value; wherein the subscript i represents the ith position in the feature map;

The output of the last test branch is IoU overlap ratio, which is given by:

7. the visual object tracking method APR-Net based on the attention pyramid residual network according to claim 1, wherein said classifier specifically comprises: the object of the classifier is to separate the object from the background thereof in each frame of the continuous video sequence, and an initial object frame is obtained; thus, to capture objects in real-time in the image of each frame of video, the classifier is learned online; the target classification component in the tracking framework is defined as:

the objective function of the classification error may be defined as:

8. the visual target tracking method APR-Net based on the attention pyramid residual network according to claim 7, wherein the target tracking positioning specifically comprises: in the target positioning stage, a rectangular frame is used for giving a tracking target in a first frame; for a pair ofIn the subsequent video frames, the tracking method determines the frame position of the current frame target according to the state information of the previous frame target; in the tracking process, predicting and classifying the tracking target of each video frame by using all the output characteristic diagrams of each layer of the head; first, using the classifying module, according to the position P of the previous frame t-1 _t-1 And size S _t-1 Calculating a confidence map for the position P with the highest confidence score _t A position in the current frame t as a target; position P at current frame t _t And the size S calculated in the previous frame t-1 _t-1 Constitutes an initial target frame

Next, 10 candidate regions are generated and their IoU overlap rates are calculated using the target prediction module; and taking frames corresponding to the first three maximum values, and taking the average value of the frames as a final tracking frame. />