CN111696137B

CN111696137B - Target tracking method based on multilayer feature mixing and attention mechanism

Info

Publication number: CN111696137B
Application number: CN202010518472.1A
Authority: CN
Inventors: 王正宁; 曾浩; 潘力立; 何庆东; 刘怡君; 曾仪; 彭大伟
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-06-09
Filing date: 2020-06-09
Publication date: 2022-08-02
Anticipated expiration: 2040-06-09
Also published as: CN111696137A

Abstract

The invention discloses a target tracking method based on a multilayer feature mixing and attention mechanism, which utilizes an improved FPN structure to better reserve and utilize shallow features of an image, and the improved FPN structure which can better reserve the shallow features can output fusion features with multi-dimension and multi-scale features. The method has better tracking capability for targets with different sizes and targets with continuously changing sizes. The FPN is used for the cascaded RPN, so that the feature extraction is more accurate, the similar interferents are better distinguished when the tracking is guaranteed, and the condition of error tracking is reduced. Meanwhile, by utilizing an attention mechanism, on a spatial scale, the network gives more attention to the possible positions of the target, so that the condition of target loss or target tracking error caused by target half-shading, deformation, illumination and the like is reduced.

Description

Target tracking method based on multilayer feature mixing and attention mechanism

Technical Field

The invention belongs to the field of image processing and computer vision, and particularly relates to a target tracking method based on multilayer feature mixing and attention mechanism.

Background

The visual target tracking is an important computer visual task and can be applied to the fields of visual monitoring, human-computer interaction, video compression and the like. Despite extensive research on this problem, it still has difficulty in dealing with complex changes in the appearance of objects due to the effects of lighting variations, partial occlusion, shape distortion, and camera motion.

At present, the target tracking algorithm mainly has two large branches, one is based on a correlation filtering algorithm, and the other is based on a deep learning algorithm. The target tracking method provided by the invention belongs to the branch of deep learning.

The following methods are mainly used for deep learning: a convolutional neural network; a recurrent neural network; generating a countermeasure network; a twin neural network. The target tracking method based on the convolutional neural network proposed by Learning spatial-aware regression for visual tracking, c.sun, d.wang, h.lu, and m.yang, in proc.ieee CVPR,2018, pp.8962-8970 "constructs a plurality of target models to capture various target appearances, learns different target models, processes partial occlusion and deformation based on part models, and simultaneously prevents overfitting and Learning the rotation information of the target by using the dual-flow network. Although this method has made great progress in the accuracy of target estimation, this type of convolutional neural network-based method still has high computational complexity. The invention discloses a multi-maneuvering-target tracking method based on an LSTM network, and particularly relates to a CN110780290A target tracking method based on a recurrent neural network, which utilizes context information to process the influence of a similar background on a tracked target. But since visual target tracking is related to the spatial and temporal information of the video frames, a cyclic neural network based approach is used while taking into account the motion of the target. The number of methods based on recurrent neural networks is limited due to the large number of parameters present in the model, which leads to training difficulties. Almost all of these approaches attempt to improve target modeling with other information and memory. Furthermore, a second goal of using a recurrent neural network-based approach is to avoid fine tuning of the pre-trained CNN model, which requires a lot of time and is prone to overfitting. "Visual tracking via adaptive learning, Y.Song, C.Ma, X.Wu, L.Gong, L.Bao, W.Zuo, C.Shen, R.W.Lau, and M.H.Yang, in Proc.IEEE CVPR,2018, pp.8990-8999" performs target tracking based on generation of countermeasure network, and can generate required sample to solve the problem of unbalanced distribution of training sample, and simultaneously solve the problem of insufficient sample amount by generating sample. But the generation of a countermeasure network is often difficult to train and evaluate, and in practice the mechanics of solving this problem is very strong. In the invention patent of an infrared weak and small target detection and tracking method based on a convolutional neural network, CN110728697A utilizes a twin network to track a target, and performs characteristic matching by extracting the depth characteristic of a picture to complete the tracking of the target.

Aiming at the problems of uneven utilization of target features and shielding, half shielding, illumination change, deformation and the like of a tracked object in the prior deep learning, the method is based on a twin network, combines shallow and deep features by utilizing a plurality of FPNs, and improves the robustness of the method by using an attention mechanism.

Disclosure of Invention

The invention belongs to the field of computer vision and deep learning, and enables the whole target tracking network to have stronger feature extraction capability and robustness by improving the feature extraction part and the region recommendation network part of a twin network. The invention provides a target tracking method based on a multilayer feature mixing and attention mechanism, which comprises the following specific steps:

(1) before training, preprocessing the data set: the training data is composed of video sequences and carries the target object

A location and size tag; the target tracking network needs to input a template frame corresponding to a tracking target and a search frame for finding the target. Cutting the original video sequence to obtain w _t ×h _t Template frame F of pixels _t And w _c ×h _c Search frame F of pixels _c Wherein the template frame corresponds to a first frame of the video sequence and the search frame corresponds to the remaining video sequence starting with a second frame of the video sequence.

(2) Designing two parallel 5-block depth residual networks N ₁ 、N ₂ Extracting the characteristics of the template frame and the search frame, and forming a twin network N in a weight sharing mode _S The depth residual network is used to remove padding from the first 7 × 7 convolution of the existing "ResNet-50", and to change the last two convolutions of step 2 in the "ResNet-50" to convolutions of step 1, and to change the template frame F to _t And search frame F _c Are respectively fed into N ₁ 、N ₂ Extracting the features of the two types of the image at different depths through operations such as convolution, pooling, activation and the like; ConvM _ N (F) _t ) And ConvM _ N (F) _c ) Respectively represent template frames F on different levels of the network _t And search frame F _c Wherein M represents the location of the block in the ResNet network in which the feature map resides, and N represents the specific location in a block.

(3) Designing a feature pyramid network FPN, comprising three FPNs: FPN1, FPN2 and FPN3 are slave network N ₁ 、N ₂ Extracting: (Conv1_1, Conv2_3, Conv3_ 3); (Conv1_1, Conv2_3, Conv4_ 6); (Conv1_1, Conv2_3, Conv5_3) the 3 groups of output features with different depths are respectively fused to obtain 3 groups of fused features, each FPN receives 3 feature maps with different scales, and the feature maps are respectively F from large to small and from shallow to deep ₁ 、F ₂ 、F ₃ (ii) a The fusion of the characteristics is completed by point-to-point addition, the number of channels of one characteristic is adjusted by using 1 multiplied by 1 convolution, so that the number of the channels of the two characteristics is the same, then the size of the other characteristic is adjusted by using 2 times of up-sampling or 3 multiplied by 3 convolution with the step length of 2, so that the sizes of the two adjusted characteristics are the same, and the point-to-point addition, namely the characteristic fusion, is completed; fusing the 3 characteristics and finally outputting a fused characteristic F _M And F is _M Size and F ₃ The same; finally, three FPNs respectively output the mixed characteristics F of the template frame _{M_1} (F _t )、F _{M_2} (F _t )、F _{M_3} (F _t ) And hybrid feature of search frame F _{M_1} (F _c )、F _{M_2} (F _c )、F _{M_3} (F _c )；

(4) Designing a regional recommended network (RPN), which comprises three RPNs: RPN1, RPN2, and RPN3 respectively input three pairs of template frame and search frame mixture features: f _{M_1} (F _t )、F _{M_1} (F _c )；F _{M_2} (F _t )、F _{M_2} (F _c )；F _{M_3} (F _t )、F _{M_3} (F _c ) Obtaining a classification result CLS and a regression result REG of the suggestion box;

(5) the RPN outputs the regression results of the classification CLS and the REG of the suggestion frame, the two different outputs are completed by two paths, the upper half part of the RPN outputs the classification CLS of the suggestion frame, and the lower half part outputs the regression REG of the suggestion frame; RPN first extracts the hybrid feature F from the template frame _M (F _t ) Cutting from the edge, wherein the number of the mixed characteristic channels in different combinations is different; then, by adjusting the convolution, F _M (F _t ) And F _M (F _c ) Adjusted to a suitable size [ F ] _M (F _t )] _c ，[F _M (F _c )] _c ，[F _M (F _t )] _r ，[F _M (F _c )] _r (ii) a Will [ F ] _M (F _t )] _c ，[F _M (F _c )] _c Performing cross-correlation operation to obtain a primary classification result CLS _ O; will [ F ] _M (F _t )] _r ，[F _M (F _c )] _r Performing cross-correlation operation to obtain a primary regression result REG _ O;

CLS _ O has a size w _res ×h _res X 2k, size of REG _ O is w _res ×h _res X 4k, in the output result at w _res ×h _res Dimension and original drawing w _c ×h _c In a spatially linear relationship, at w _res ×h _res Each position of the anchor frame corresponds to k anchor frames with preset sizes, and the center of each anchor frame is the center of the current position; the 2k channels of CLS _ O represent the probability P that k anchor boxes of the network prediction contain the target _pos And probability P of not containing the object _neg (ii) a The 4k channels of REG _ O represent the length-width difference and position difference of k anchor boxes predicted by the network and the actual target box, which are dx, dy, dw, dh respectively. The relationship with the actual target frame is as follows:

wherein A is _x 、A _y Represents the center point of the reference frame, A _w 、A _h Width and height of the reference frame, T _x 、T _y 、T _w 、T _h The coordinate and the length and the width of the truth value are expressed, and finally the final target is found out through methods such as maximum suppression and the like;

(6) after CLS _ O and REG _ O are obtained through output, the CLS _ O and REG _ O are input into a space attention module, and w is obtained through average pooling, maximum pooling, convolution and Sigmoid activation operation _res ×h _res X 1 spatial attention weights SA _ c and SA _ r; CLS _ O and REG _ O are multiplied by the corresponding positions of SA _ c and SA _ r respectively and added with the original CLS _ O and REG _ O, or the final RPN output results CLS and REG are obtained;

(7) for three RPNs: and the output results of the RPN1, the RPN2 and the RPN3 are subjected to weighted addition to serve as a final target tracking network output result:

wherein alpha is ₁ ，α ₂ ，α ₃ ，β ₁ ，β ₂ ，β ₃ Is a preset weight.

(8) Class loss L in training the target tracking network _cls Using cross-entropy loss, regression loss L _reg Using a smoothed L1 penalty with normalized coordinates; y represents the value of the tag and,

representing the actual classification value, i.e. P _pos ；dx _T ，dy _T ，dw _T ，dh _T Representing the length and width difference and the position difference between the actual k kinds of anchor frames and the actual target frame, namely a true value; the loss functions are defined as:

wherein:

the final loss function is as follows:

loss＝L _cls +λL _reg (5)

where λ is a hyperparameter, which is used to balance the two types of losses.

The present invention utilizes an improved FPN structure. Compared with the situation that the deep features obtained in the traditional FPN have insufficient reservation for the shallow features, the shallow features of the image are better reserved and utilized by utilizing the improved FPN structure. This improved FPN structure with better retention of shallow features can output fused features with multi-dimensional, multi-scale features. The method has better tracking capability for targets with different sizes and targets with continuously changing sizes. The FPN is used for the cascaded RPN, so that the feature extraction is more accurate, the similar interferents are better distinguished when the tracking is guaranteed, and the condition of error tracking is reduced. Meanwhile, by utilizing an attention mechanism, on a spatial scale, the network gives more attention to the possible positions of the target, so that the condition of target loss or target tracking error caused by target half-shading, deformation, illumination and the like is reduced.

Drawings

FIG. 1 is a diagram of a template frame and a search frame according to the present invention

FIG. 2 is an overall structure diagram of the target tracking network of the present invention

FIG. 3 is a diagram of the FPN structure of the present invention

FIG. 4 is a diagram of the RPN structure of the present invention

FIG. 5 is a diagram illustrating the RPN output result of the present invention

FIG. 6 is a flow chart of target tracking network training in accordance with the present invention

Detailed Description

The following detailed description of the embodiments and the working principles of the present invention will be made with reference to the accompanying drawings.

The invention provides a target tracking method based on a multilayer feature mixing and attention mechanism, which comprises the following specific steps:

(1) the data set is pre-processed prior to training. The training data is composed of video sequences with labels of the position and size of the target object. The target tracking network needs to input a template frame corresponding to a tracking target and a search frame for finding the target. Cutting the original video sequence to obtain w _t ×h _t Template frame F of pixels _t And w _c ×h _c Search frame F of pixels _c As shown in fig. 1 and 2. Wherein the template frame corresponds to a first frame of the video sequence and the search frame corresponds to the remaining video sequence starting with a second frame of the video sequence.

(2) Designing two parallel 5-block depth residual networks N ₁ 、N ₂ Extracting the characteristics of the template frame and the search frame, and forming a twin network N in a weight sharing mode _S . The deep residual network used removes padding from the first 7 x 7 convolution of the existing "ResNet-50", while changing the last two convolutions of step 2 in this "ResNet-50" to convolutions of step 1. Template frame F _t And search frame F _c Are respectively fed into N ₁ 、N ₂ And extracting the features of the image in different depths respectively through operations such as convolution, pooling, activation and the like. ConvM _ N (F) _t ) And ConvM _ N (F) _c ) Respectively represent template frames F on different levels of the network _t And search frame F _c Wherein M represents the location of the block in the ResNet network in which the feature map resides, and N represents the specific location in a block.

(3) Designing a Feature Pyramid Network (FPN), three FPNs (FPN1, FPN2, FPN3) are respectively to be selected from the network N ₁ 、N ₂ Extracting: (Conv1_1, Conv2_3, Conv3_ 3); (Conv1_1, Conv2_3, Conv4_ 6); (Conv1_1, Conv2_3, Conv5_3) the 3 sets of output features at different depths were fused separately, and 3 sets of fused features were obtained.

The specific structure of a single FPN used in the present invention is shown in fig. 4. Each FPN receives 3 feature maps with different scales, namely F from large to small and from shallow to deep ₁ 、F ₂ 、F ₃ . The fusion of the characteristics is completed by point-to-point addition, the number of channels of one characteristic is adjusted by using 1 multiplied by 1 convolution to ensure that the number of the channels of the two characteristics is the same, and then the size of the other characteristic is adjusted by using 2 times of upsampling or 3 multiplied by 3 convolution with the step length of 2 to ensure that the sizes of the two adjusted characteristics are the same, so that the point-to-point addition, namely the characteristic fusion, is completed. Fusing the 3 characteristics and finally outputting a fused characteristic F _M And F is _M Size and F ₃ The same is true. Finally, three FPNs respectively output the mixed characteristics F of the template frame _{M_1} (F _t )、F _{M_2} (F _t )、F _{M_3} (F _t ) And hybrid feature of search frame F _{M_1} (F _c )、F _{M_2} (F _c )、F _{M_3} (F _c )。

(4) Regional recommendation Network (RPN), three RPNs (RPN1, RPN2, RP N3) respectively input the mixed features of three pairs of template frames and search frames: f _{M_1} (F _t )、F _{M_1} (F _c )；F _{M_2} (F _t )、F _{M_2} (F _c )；F _{M_3} (F _t )、F _{M_3} (F _c ) And obtaining the classification result CLS and the regression result REG of the suggestion box, as shown in FIG. 2.

(5) The RPN needs to output the classification CLS and REG regression results of the proposed frame, and these two different outputs need two paths to complete, the upper half of the RPN in fig. 2 outputs the classification CLS of the proposed frame, and the lower half outputs the regression RE G of the proposed frame. RPN first extracts the hybrid feature F from the template frame _M (F _t ) And c', in fig. 4, the number of the current mixed characteristic channels is shown, and the mixed characteristic channels in different combinations are different.Then, by adjusting the convolution, F _M (F _t ) And F _M (F _c ) Adjusted to a suitable size [ F ] _M (F _t )] _c ，[F _M (F _c )] _c ，[F _M (F _t )] _r ，[F _M (F _c )] _r . Will [ F ] _M (F _t )] _c ，[F _M (F _c )] _c Performing cross-correlation operation to obtain a primary classification result CLS _ O; will [ F ] _M (F _t )] _r ，[F _M (F _c )] _r And performing cross-correlation operation to obtain a primary regression result REG _ O.

CLS _ O has a size w _res ×h _res X 2k, size of REG _ O is w _res ×h _res X 4k, as shown in FIG. 5, in the output result at w _res ×h _res Dimension and original drawing w _c ×h _c In a spatially linear relationship, at w _res ×h _res Corresponds to k anchor frames with preset sizes, and the center of the anchor frame is the center of the current position. The 2k channels of CLS _ O represent the probability P that k anchor boxes of the network prediction contain the target _pos And probability P of not containing the object _neg . The 4k channels of REG _ O represent the length-width difference and position difference of k anchor boxes predicted by the network and the actual target box, which are dx, dy, dw, dh respectively. The relationship with the actual target frame is as follows:

wherein A is _x 、A _y Represents the center point of the reference frame, A _w 、A _h Width and height of the reference frame, T _x 、T _y 、T _w 、T _h The coordinates and length and width representing the true values. And finally finding out a final target by methods such as maximum value inhibition and the like.

(6) After CLS _ O and REG _ O are obtained through output, the obtained data are input into a space attention module, and as shown in FIG. 4, w is obtained through average pooling, maximum pooling, convolution and Sigmoid activation operations _res ×h _res X 1 spatial attention weights SA _ c and SA _ r. CLS _ O and REG _ O are multiplied by the corresponding positions of SA _ c and SA _ r respectively and added with the original CLS _ O and REG _ O, or the final RPN output results CLS and REG are obtained.

(7) And performing weighted addition on output results of the three RPNs (RPN1, RPN2 and RPN3) to obtain a final target tracking network output result:

(8) Class loss L in training the target tracking network _cls Using cross-entropy loss, regression loss L _reg A smoothed L1 penalty with normalized coordinates is used. y represents the value of the tag and,

representing the actual classification value (i.e. P) _pos )；dx _T ，dy _T ，dw _T ，dh _T And the length and width difference and the position difference of the actual k anchor boxes and the actual target box are represented, namely true values. The loss functions are defined as:

wherein:

the final loss function is as follows:

loss＝L _cls +λL _reg (5)

where λ is a hyperparameter, which is used to balance the two types of losses.

The key parameters related to one embodiment of the present invention are shown in table 1, and the specific parameters marked in the diagram in appendix 1 are based on the implementation parameters:

TABLE 1 example parameters

The specific training process of the target tracking network designed by the invention is shown in fig. 6, wherein the specific training process and the related parameters of the specific implementation of the scheme are as follows:

a video sequence in the data set is processed. According to the label information, cutting to obtain 127X 127 pixel template frame F _t And a search frame F of 255 x 255 pixels _c 。

Template frame F _t And search frame F _c Feature extraction network ResNet _ N as fed into FIG. 2 ₁ And ResNet _ N ₂ And extracting five features of different depth levels, wherein two feature extraction networks share weight.

Three feature pyramid networks, as shown in fig. 3, FPN1, FPN2, and FPN3 respectively extract template frames F of different depth levels _t And search frame F _c Feature fusion is performed on the features, wherein the FPN1 fuses the features obtained from the first, second and third blocks (layers), the FPN2 fuses the features obtained from the first, second and fourth blocks (layers), and the FPN3 fuses the features obtained from the first, second and fifth blocks (layers), as shown in fig. 2. Three pairs of FPNs respectively output mixed characteristics F of template frames _{M_1} (F _t )、F _{M_2} (F _t )、F _{M_3} (F _t ) And hybrid feature of search frame F _{M_1} (F _c )、F _{M_2} (F _c )、F _{M_3} (F _c ). The mixed feature sizes of the template frames are all 15 × 15 × 512, and the mixed feature sizes of the search frames are all 31 × 31 × 512.

Mixing three pairs of features F _{M_1} (F _t ) And F _{M_1} (F _c )、F _{M_2} (F _t ) And F _{M_2} (F _c )、F _{M_3} (F _t ) And F _{M_3} (F _c ) Respectively sent into three regional recommendation networks RPN1,RPN2, RPN3, as shown in fig. 2. The structure of each area recommendation network is the same, and as shown in fig. 4, 5 anchor boxes are provided, that is, k is 5. Firstly, mixing characteristics F of template frame _M (F _t ) Cutting to remove the elements around the periphery to obtain a size of 7 × 7 × 512, and adjusting F by four convolution layers _M (F _t ) Hybrid feature F with search frame _M (F _c ) The channel numbers of (a) can respectively obtain: [ F ] _M (F _t )] _c A size of 5 × 5 × (10 × 512); [ F ] _M (F _t )] _r A size of 5 × 5 × (20 × 512); [ F ] _M (F _c )] _c The size is 29 multiplied by 512; [ F ] _M (F _c )] _r And the size is 29 × 29 × 512.

Respectively will [ F _M (F _t )] _c And [ F ] _M (F _c )] _c 、[F _M (F _t )] _r And [ F ] _M (F _c )] _r And performing cross-correlation operation to obtain a classification intermediate result CLS _ O and a regression intermediate result REG _ O. Wherein the size of CLS _ O is 25 × 25 × 10, and the size of REG _ O is 25 × 25 × 20.

CLS _ O and REG _ O are respectively sent to the corresponding space attention modules to obtain space attention weights SA _ c and SA _ r. And multiplying the CLS _ O and the REG _ O by the corresponding positions of SA _ c and SA _ r, and adding the obtained product to the original CLS _ O and REG _ O to obtain a final RPN output classification result CLS and a regression result REG. CLS and CLS _ O are the same in size; REG and REG _ O are the same size. The "spatial attention" in the flow chart completes the above steps.

And performing weighted addition on the classification results output by the RPN1, the RPN2 and the RPN3 and the regression result according to the weights of 0.2, 0.3 and 0.5 to obtain a final target classification result and a suggested frame regression result. The losses were calculated and optimized according to equations (3), (4) and (5). When the set number of training rounds is 50 rounds, the training is finished and the test is carried out.

While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps; any non-essential addition and replacement made by the technical characteristics of the technical scheme of the invention by a person skilled in the art belong to the protection scope of the invention.

Claims

1. A target tracking method based on a multilayer feature mixing and attention mechanism is characterized by comprising the following steps:

(1) before training, preprocessing the data set: the training data is composed of video sequences and is provided with a label of the position and the size of a target object; the target tracking network needs to input a template frame corresponding to a tracking target and a search frame for searching the target; cutting the original video sequence to obtain w _t ×h _t Template frame F of pixels _t And w _c ×h _c Search frame F of pixels _c Wherein the template frame corresponds to a first frame of the video sequence, and the search frame corresponds to a remaining video sequence beginning with a second frame of the video sequence;

(2) designing two parallel 5-block depth residual networks N ₁ 、N ₂ Extracting the characteristics of the template frame and the search frame, and forming a twin network N in a weight sharing mode _S The depth residual network is used to remove padding from the first 7 × 7 convolution of the existing "ResNet-50", and to change the last two convolutions of step 2 in the "ResNet-50" to convolutions of step 1, and to change the template frame F to _t And search frame F _c Are respectively fed into N ₁ 、N ₂ Extracting the features of the filter paper at different depths through operations including convolution, pooling and activation; ConvM _ N (F) _t ) And ConvM _ N (F) _c ) Respectively represent template frames F on different levels of the network _t And search frame F _c Wherein M represents the characteristic output ConvM _ N (F) _t ) Or ConvM _ N (F) _c ) The location of the block in the ResNet network, N represents the characteristic output ConvM _ N (F) _t ) Or ConvM _ N (F) _c ) A specific location in a block;

(3) design characteristic pyramidA tower network FPN comprising three FPNs: FPN1, FPN2 and FPN3 are slave network N ₁ 、N ₂ Extracting: (Conv1_1, Conv2_3, Conv3_ 3); (Conv1_1, Conv2_3, Conv4_ 6); (Conv1_1, Conv2_3, Conv5_3) the 3 groups of output features with different depths are respectively fused to obtain 3 groups of fused features, each FPN receives 3 feature maps with different scales, and the feature maps are respectively F from large to small and from shallow to deep ₁ 、F ₂ 、F ₃ (ii) a The fusion of the characteristics is completed by point-to-point addition, the number of channels of one characteristic is adjusted by using 1 multiplied by 1 convolution, so that the number of the channels of the two characteristics is the same, then the size of the other characteristic is adjusted by using 2 times of up-sampling or 3 multiplied by 3 convolution with the step length of 2, so that the sizes of the two adjusted characteristics are the same, and the point-to-point addition, namely the characteristic fusion, is completed; fusing the 3 characteristics and finally outputting a fused characteristic F _M And F is _M Size and F ₃ The same; finally, three FPNs respectively output the mixed characteristics F of the template frame _{M_1} (F _t )、F _{M_2} (F _t )、F _{M_3} (F _t ) And hybrid feature of search frame F _{M_1} (F _c )、F _{M_2} (F _c )、F _{M_3} (F _c )；

(5) the RPN outputs the regression results of the classification CLS and the REG of the suggestion frame, the two different outputs are completed by two paths, the upper half part of the RPN outputs the classification CLS of the suggestion frame, and the lower half part outputs the regression REG of the suggestion frame; RPN first extracts the hybrid feature F from the template frame _M (F _t ) Cutting from the edge, wherein the number of mixed characteristic channels in different combinations is different; then, by adjusting the convolution, F _M (F _t ) And F _M (F _c ) Adjusted to a suitable size[F _M (F _t )] _c ，[F _M (F _c )] _c ，[F _M (F _t )] _r ，[F _M (F _c )] _r (ii) a Will [ F ] _M (F _t )] _c ，[F _M (F _c )] _c Performing cross-correlation operation to obtain a primary classification result CLS _ O; will [ F ] _M (F _t )] _r ，[F _M (F _c )] _r Performing cross-correlation operation to obtain a primary regression result REG _ O;

CLS _ O has a size w _res ×h _res X 2k, size of REG _ O is w _res ×h _res X 4k, in the output result at w _res ×h _res Dimension and original drawing w _c ×h _c In a spatially linear relationship, at w _res ×h _res Each position of the anchor frame corresponds to k anchor frames with preset sizes, and the center of each anchor frame is the center of the current position; the 2k channels of CLS _ O represent the probability P that k anchor boxes of the network prediction contain the target _pos And probability P of not containing the object _neg (ii) a The 4k channels of REG _ O represent the length-width difference and position difference of k anchor frames predicted by the network and the actual target frame, which are dx, dy, dw, dh, and the relationship with the actual target frame is:

wherein A is _x 、A _y Represents the center point of the reference frame, A _w 、A _h Width and height of the reference frame, T _x 、T _y 、T _w 、T _h The coordinate and the length and the width of the true value are represented, and finally a final target is found out through a maximum suppression method;

(6) after CLS _ O and REG _ O are obtained through output, the CLS _ O and REG _ O are input into a space attention module, and w is obtained through average pooling, maximum pooling, convolution and Sigmoid activation operation _res ×h _res X 1 spatial attention weights SA _ c and SA _ r; CLS _ O and REG _ O are multiplied by the corresponding positions of SA _ c and SA _ r respectively and are multiplied by the original positionsAdding the CLS _ O and the REG _ O or obtaining the final RPN output result CLS and REG;

wherein alpha is ₁ ，α ₂ ，α ₃ ，β ₁ ，β ₂ ，β ₃ Is a preset weight;

wherein:

the final loss function is as follows:

loss＝L _cls +λL _reg (5)

where λ is a hyperparameter, which is used to balance the two types of losses.

2. The target tracking method based on the multi-layer feature mixing and attention mechanism as claimed in claim 1, wherein the step (8) of training the target tracking network specifically comprises:

processing the video sequence in the data set, and cutting to obtain a 127 x 127 pixel template frame F according to the label information _t And a search frame F of 255 x 255 pixels _c ；

Template frame F _t And search frame F _c Send into feature extraction network ResNet _ N ₁ And ResNet _ N ₂ Extracting five features of different depth levels, wherein two features extract network sharing weight;

three characteristic pyramid networks, namely FPN1, FPN2 and FPN3 respectively extract template frames F of different depth levels _t And search frame F _c Performing feature fusion on the features, wherein the FPN1 fuses first, second and third blocks, namely features obtained by one, two and three layers, the FPN2 fuses first, second and four blocks, namely features obtained by one, two and four layers, the FPN3 fuses first, second and five blocks, namely features obtained by one, two and five layers, and three pairs of FPNs respectively output mixed features F of template frames _{M_1} (F _t )、F _{M_2} (F _t )、F _{M_3} (F _t ) And hybrid feature of search frame F _{M_1} (F _c )、F _{M_2} (F _c )、F _{M_3} (F _c ) (ii) a The mixed feature sizes of the template frames are all 15 multiplied by 512, and the mixed feature sizes of the search frames are all 31 multiplied by 512;

mixing three pairs of features F _{M_1} (F _t ) And F _{M_1} (F _c )、F _{M_2} (F _t ) And F _{M_2} (F _c )、F _{M_3} (F _t ) And F _{M_3} (F _c ) Three regional recommended networks RPN1, RPN2 and RPN3 are respectively fed, wherein the structure of each regional recommended network is the same, and 5 anchor frames are totally set, namely k is 5; firstly, mixing characteristics F of template frame _M (F _t ) Cutting to remove the elements around the periphery to obtain a size of 7 × 7 × 512, and adjusting F by four convolution layers _M (F _t ) Hybrid feature F with search frame _M (F _c ) The number of channels of (a) is obtained: [ F ] _M (F _t )] _c A size of 5 × 5 × (10 × 512); [ F ] _M (F _t )] _r A size of 5 × 5 × (20 × 512); [ F ] _M (F _c )] _c The size is 29 multiplied by 512; [ F ] _M (F _c )] _r The size is 29 multiplied by 512;

respectively will [ F _M (F _t )] _c And [ F ] _M (F _c )] _c 、[F _M (F _t )] _r And [ F ] _M (F _c )] _r Performing cross-correlation operation to obtain a classification intermediate result CLS _ O and a regression intermediate result REG _ O, wherein the size of CLS _ O is 25 multiplied by 10, and the size of REG _ O is 25 multiplied by 20;

CLS _ O and REG _ O are respectively sent to corresponding space attention modules to obtain space attention weights SA _ c and SA _ r; multiplying the CLS _ O and the REG _ O with the corresponding positions of SA _ c and SA _ r, and adding the obtained product with the original CLS _ O and REG _ O to obtain a final RPN output classification result CLS and a regression result REG; CLS and CLS _ O are the same in size; REG and REG _ O have the same size, and the above steps are completed by 'space attention';

weighting and adding the classification results output by RPN1, RPN2 and RPN3 and the regression results according to weights of 0.2, 0.3 and 0.5 to obtain final target classification results and regression results of the suggested frames, and calculating loss and optimizing according to the formulas (3), (4) and (5); when the set number of training rounds is 50 rounds, the training is finished and the test is carried out.