CN111696137B - Target tracking method based on multilayer feature mixing and attention mechanism - Google Patents

Target tracking method based on multilayer feature mixing and attention mechanism Download PDF

Info

Publication number
CN111696137B
CN111696137B CN202010518472.1A CN202010518472A CN111696137B CN 111696137 B CN111696137 B CN 111696137B CN 202010518472 A CN202010518472 A CN 202010518472A CN 111696137 B CN111696137 B CN 111696137B
Authority
CN
China
Prior art keywords
frame
cls
reg
network
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010518472.1A
Other languages
Chinese (zh)
Other versions
CN111696137A (en
Inventor
王正宁
曾浩
潘力立
何庆东
刘怡君
曾仪
彭大伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010518472.1A priority Critical patent/CN111696137B/en
Publication of CN111696137A publication Critical patent/CN111696137A/en
Application granted granted Critical
Publication of CN111696137B publication Critical patent/CN111696137B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/248Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a target tracking method based on a multilayer feature mixing and attention mechanism, which utilizes an improved FPN structure to better reserve and utilize shallow features of an image, and the improved FPN structure which can better reserve the shallow features can output fusion features with multi-dimension and multi-scale features. The method has better tracking capability for targets with different sizes and targets with continuously changing sizes. The FPN is used for the cascaded RPN, so that the feature extraction is more accurate, the similar interferents are better distinguished when the tracking is guaranteed, and the condition of error tracking is reduced. Meanwhile, by utilizing an attention mechanism, on a spatial scale, the network gives more attention to the possible positions of the target, so that the condition of target loss or target tracking error caused by target half-shading, deformation, illumination and the like is reduced.

Description

Target tracking method based on multilayer feature mixing and attention mechanism
Technical Field
The invention belongs to the field of image processing and computer vision, and particularly relates to a target tracking method based on multilayer feature mixing and attention mechanism.
Background
The visual target tracking is an important computer visual task and can be applied to the fields of visual monitoring, human-computer interaction, video compression and the like. Despite extensive research on this problem, it still has difficulty in dealing with complex changes in the appearance of objects due to the effects of lighting variations, partial occlusion, shape distortion, and camera motion.
At present, the target tracking algorithm mainly has two large branches, one is based on a correlation filtering algorithm, and the other is based on a deep learning algorithm. The target tracking method provided by the invention belongs to the branch of deep learning.
The following methods are mainly used for deep learning: a convolutional neural network; a recurrent neural network; generating a countermeasure network; a twin neural network. The target tracking method based on the convolutional neural network proposed by Learning spatial-aware regression for visual tracking, c.sun, d.wang, h.lu, and m.yang, in proc.ieee CVPR,2018, pp.8962-8970 "constructs a plurality of target models to capture various target appearances, learns different target models, processes partial occlusion and deformation based on part models, and simultaneously prevents overfitting and Learning the rotation information of the target by using the dual-flow network. Although this method has made great progress in the accuracy of target estimation, this type of convolutional neural network-based method still has high computational complexity. The invention discloses a multi-maneuvering-target tracking method based on an LSTM network, and particularly relates to a CN110780290A target tracking method based on a recurrent neural network, which utilizes context information to process the influence of a similar background on a tracked target. But since visual target tracking is related to the spatial and temporal information of the video frames, a cyclic neural network based approach is used while taking into account the motion of the target. The number of methods based on recurrent neural networks is limited due to the large number of parameters present in the model, which leads to training difficulties. Almost all of these approaches attempt to improve target modeling with other information and memory. Furthermore, a second goal of using a recurrent neural network-based approach is to avoid fine tuning of the pre-trained CNN model, which requires a lot of time and is prone to overfitting. "Visual tracking via adaptive learning, Y.Song, C.Ma, X.Wu, L.Gong, L.Bao, W.Zuo, C.Shen, R.W.Lau, and M.H.Yang, in Proc.IEEE CVPR,2018, pp.8990-8999" performs target tracking based on generation of countermeasure network, and can generate required sample to solve the problem of unbalanced distribution of training sample, and simultaneously solve the problem of insufficient sample amount by generating sample. But the generation of a countermeasure network is often difficult to train and evaluate, and in practice the mechanics of solving this problem is very strong. In the invention patent of an infrared weak and small target detection and tracking method based on a convolutional neural network, CN110728697A utilizes a twin network to track a target, and performs characteristic matching by extracting the depth characteristic of a picture to complete the tracking of the target.
Aiming at the problems of uneven utilization of target features and shielding, half shielding, illumination change, deformation and the like of a tracked object in the prior deep learning, the method is based on a twin network, combines shallow and deep features by utilizing a plurality of FPNs, and improves the robustness of the method by using an attention mechanism.
Disclosure of Invention
The invention belongs to the field of computer vision and deep learning, and enables the whole target tracking network to have stronger feature extraction capability and robustness by improving the feature extraction part and the region recommendation network part of a twin network. The invention provides a target tracking method based on a multilayer feature mixing and attention mechanism, which comprises the following specific steps:
(1) before training, preprocessing the data set: the training data is composed of video sequences and carries the target object
A location and size tag; the target tracking network needs to input a template frame corresponding to a tracking target and a search frame for finding the target. Cutting the original video sequence to obtain w t ×h t Template frame F of pixels t And w c ×h c Search frame F of pixels c Wherein the template frame corresponds to a first frame of the video sequence and the search frame corresponds to the remaining video sequence starting with a second frame of the video sequence.
(2) Designing two parallel 5-block depth residual networks N 1 、N 2 Extracting the characteristics of the template frame and the search frame, and forming a twin network N in a weight sharing mode S The depth residual network is used to remove padding from the first 7 × 7 convolution of the existing "ResNet-50", and to change the last two convolutions of step 2 in the "ResNet-50" to convolutions of step 1, and to change the template frame F to t And search frame F c Are respectively fed into N 1 、N 2 Extracting the features of the two types of the image at different depths through operations such as convolution, pooling, activation and the like; ConvM _ N (F) t ) And ConvM _ N (F) c ) Respectively represent template frames F on different levels of the network t And search frame F c Wherein M represents the location of the block in the ResNet network in which the feature map resides, and N represents the specific location in a block.
(3) Designing a feature pyramid network FPN, comprising three FPNs: FPN1, FPN2 and FPN3 are slave network N 1 、N 2 Extracting: (Conv1_1, Conv2_3, Conv3_ 3); (Conv1_1, Conv2_3, Conv4_ 6); (Conv1_1, Conv2_3, Conv5_3) the 3 groups of output features with different depths are respectively fused to obtain 3 groups of fused features, each FPN receives 3 feature maps with different scales, and the feature maps are respectively F from large to small and from shallow to deep 1 、F 2 、F 3 (ii) a The fusion of the characteristics is completed by point-to-point addition, the number of channels of one characteristic is adjusted by using 1 multiplied by 1 convolution, so that the number of the channels of the two characteristics is the same, then the size of the other characteristic is adjusted by using 2 times of up-sampling or 3 multiplied by 3 convolution with the step length of 2, so that the sizes of the two adjusted characteristics are the same, and the point-to-point addition, namely the characteristic fusion, is completed; fusing the 3 characteristics and finally outputting a fused characteristic F M And F is M Size and F 3 The same; finally, three FPNs respectively output the mixed characteristics F of the template frame M_1 (F t )、F M_2 (F t )、F M_3 (F t ) And hybrid feature of search frame F M_1 (F c )、F M_2 (F c )、F M_3 (F c );
(4) Designing a regional recommended network (RPN), which comprises three RPNs: RPN1, RPN2, and RPN3 respectively input three pairs of template frame and search frame mixture features: f M_1 (F t )、F M_1 (F c );F M_2 (F t )、F M_2 (F c );F M_3 (F t )、F M_3 (F c ) Obtaining a classification result CLS and a regression result REG of the suggestion box;
(5) the RPN outputs the regression results of the classification CLS and the REG of the suggestion frame, the two different outputs are completed by two paths, the upper half part of the RPN outputs the classification CLS of the suggestion frame, and the lower half part outputs the regression REG of the suggestion frame; RPN first extracts the hybrid feature F from the template frame M (F t ) Cutting from the edge, wherein the number of the mixed characteristic channels in different combinations is different; then, by adjusting the convolution, F M (F t ) And F M (F c ) Adjusted to a suitable size [ F ] M (F t )] c ,[F M (F c )] c ,[F M (F t )] r ,[F M (F c )] r (ii) a Will [ F ] M (F t )] c ,[F M (F c )] c Performing cross-correlation operation to obtain a primary classification result CLS _ O; will [ F ] M (F t )] r ,[F M (F c )] r Performing cross-correlation operation to obtain a primary regression result REG _ O;
CLS _ O has a size w res ×h res X 2k, size of REG _ O is w res ×h res X 4k, in the output result at w res ×h res Dimension and original drawing w c ×h c In a spatially linear relationship, at w res ×h res Each position of the anchor frame corresponds to k anchor frames with preset sizes, and the center of each anchor frame is the center of the current position; the 2k channels of CLS _ O represent the probability P that k anchor boxes of the network prediction contain the target pos And probability P of not containing the object neg (ii) a The 4k channels of REG _ O represent the length-width difference and position difference of k anchor boxes predicted by the network and the actual target box, which are dx, dy, dw, dh respectively. The relationship with the actual target frame is as follows:
Figure GDA0003547414690000031
wherein A is x 、A y Represents the center point of the reference frame, A w 、A h Width and height of the reference frame, T x 、T y 、T w 、T h The coordinate and the length and the width of the truth value are expressed, and finally the final target is found out through methods such as maximum suppression and the like;
(6) after CLS _ O and REG _ O are obtained through output, the CLS _ O and REG _ O are input into a space attention module, and w is obtained through average pooling, maximum pooling, convolution and Sigmoid activation operation res ×h res X 1 spatial attention weights SA _ c and SA _ r; CLS _ O and REG _ O are multiplied by the corresponding positions of SA _ c and SA _ r respectively and added with the original CLS _ O and REG _ O, or the final RPN output results CLS and REG are obtained;
(7) for three RPNs: and the output results of the RPN1, the RPN2 and the RPN3 are subjected to weighted addition to serve as a final target tracking network output result:
Figure GDA0003547414690000032
wherein alpha is 1 ,α 2 ,α 3 ,β 1 ,β 2 ,β 3 Is a preset weight.
(8) Class loss L in training the target tracking network cls Using cross-entropy loss, regression loss L reg Using a smoothed L1 penalty with normalized coordinates; y represents the value of the tag and,
Figure GDA0003547414690000043
representing the actual classification value, i.e. P pos ;dx T ,dy T ,dw T ,dh T Representing the length and width difference and the position difference between the actual k kinds of anchor frames and the actual target frame, namely a true value; the loss functions are defined as:
Figure GDA0003547414690000041
wherein:
Figure GDA0003547414690000042
the final loss function is as follows:
loss=L cls +λL reg (5)
where λ is a hyperparameter, which is used to balance the two types of losses.
The present invention utilizes an improved FPN structure. Compared with the situation that the deep features obtained in the traditional FPN have insufficient reservation for the shallow features, the shallow features of the image are better reserved and utilized by utilizing the improved FPN structure. This improved FPN structure with better retention of shallow features can output fused features with multi-dimensional, multi-scale features. The method has better tracking capability for targets with different sizes and targets with continuously changing sizes. The FPN is used for the cascaded RPN, so that the feature extraction is more accurate, the similar interferents are better distinguished when the tracking is guaranteed, and the condition of error tracking is reduced. Meanwhile, by utilizing an attention mechanism, on a spatial scale, the network gives more attention to the possible positions of the target, so that the condition of target loss or target tracking error caused by target half-shading, deformation, illumination and the like is reduced.
Drawings
FIG. 1 is a diagram of a template frame and a search frame according to the present invention
FIG. 2 is an overall structure diagram of the target tracking network of the present invention
FIG. 3 is a diagram of the FPN structure of the present invention
FIG. 4 is a diagram of the RPN structure of the present invention
FIG. 5 is a diagram illustrating the RPN output result of the present invention
FIG. 6 is a flow chart of target tracking network training in accordance with the present invention
Detailed Description
The following detailed description of the embodiments and the working principles of the present invention will be made with reference to the accompanying drawings.
The invention provides a target tracking method based on a multilayer feature mixing and attention mechanism, which comprises the following specific steps:
(1) the data set is pre-processed prior to training. The training data is composed of video sequences with labels of the position and size of the target object. The target tracking network needs to input a template frame corresponding to a tracking target and a search frame for finding the target. Cutting the original video sequence to obtain w t ×h t Template frame F of pixels t And w c ×h c Search frame F of pixels c As shown in fig. 1 and 2. Wherein the template frame corresponds to a first frame of the video sequence and the search frame corresponds to the remaining video sequence starting with a second frame of the video sequence.
(2) Designing two parallel 5-block depth residual networks N 1 、N 2 Extracting the characteristics of the template frame and the search frame, and forming a twin network N in a weight sharing mode S . The deep residual network used removes padding from the first 7 x 7 convolution of the existing "ResNet-50", while changing the last two convolutions of step 2 in this "ResNet-50" to convolutions of step 1. Template frame F t And search frame F c Are respectively fed into N 1 、N 2 And extracting the features of the image in different depths respectively through operations such as convolution, pooling, activation and the like. ConvM _ N (F) t ) And ConvM _ N (F) c ) Respectively represent template frames F on different levels of the network t And search frame F c Wherein M represents the location of the block in the ResNet network in which the feature map resides, and N represents the specific location in a block.
(3) Designing a Feature Pyramid Network (FPN), three FPNs (FPN1, FPN2, FPN3) are respectively to be selected from the network N 1 、N 2 Extracting: (Conv1_1, Conv2_3, Conv3_ 3); (Conv1_1, Conv2_3, Conv4_ 6); (Conv1_1, Conv2_3, Conv5_3) the 3 sets of output features at different depths were fused separately, and 3 sets of fused features were obtained.
The specific structure of a single FPN used in the present invention is shown in fig. 4. Each FPN receives 3 feature maps with different scales, namely F from large to small and from shallow to deep 1 、F 2 、F 3 . The fusion of the characteristics is completed by point-to-point addition, the number of channels of one characteristic is adjusted by using 1 multiplied by 1 convolution to ensure that the number of the channels of the two characteristics is the same, and then the size of the other characteristic is adjusted by using 2 times of upsampling or 3 multiplied by 3 convolution with the step length of 2 to ensure that the sizes of the two adjusted characteristics are the same, so that the point-to-point addition, namely the characteristic fusion, is completed. Fusing the 3 characteristics and finally outputting a fused characteristic F M And F is M Size and F 3 The same is true. Finally, three FPNs respectively output the mixed characteristics F of the template frame M_1 (F t )、F M_2 (F t )、F M_3 (F t ) And hybrid feature of search frame F M_1 (F c )、F M_2 (F c )、F M_3 (F c )。
(4) Regional recommendation Network (RPN), three RPNs (RPN1, RPN2, RP N3) respectively input the mixed features of three pairs of template frames and search frames: f M_1 (F t )、F M_1 (F c );F M_2 (F t )、F M_2 (F c );F M_3 (F t )、F M_3 (F c ) And obtaining the classification result CLS and the regression result REG of the suggestion box, as shown in FIG. 2.
(5) The RPN needs to output the classification CLS and REG regression results of the proposed frame, and these two different outputs need two paths to complete, the upper half of the RPN in fig. 2 outputs the classification CLS of the proposed frame, and the lower half outputs the regression RE G of the proposed frame. RPN first extracts the hybrid feature F from the template frame M (F t ) And c', in fig. 4, the number of the current mixed characteristic channels is shown, and the mixed characteristic channels in different combinations are different.Then, by adjusting the convolution, F M (F t ) And F M (F c ) Adjusted to a suitable size [ F ] M (F t )] c ,[F M (F c )] c ,[F M (F t )] r ,[F M (F c )] r . Will [ F ] M (F t )] c ,[F M (F c )] c Performing cross-correlation operation to obtain a primary classification result CLS _ O; will [ F ] M (F t )] r ,[F M (F c )] r And performing cross-correlation operation to obtain a primary regression result REG _ O.
CLS _ O has a size w res ×h res X 2k, size of REG _ O is w res ×h res X 4k, as shown in FIG. 5, in the output result at w res ×h res Dimension and original drawing w c ×h c In a spatially linear relationship, at w res ×h res Corresponds to k anchor frames with preset sizes, and the center of the anchor frame is the center of the current position. The 2k channels of CLS _ O represent the probability P that k anchor boxes of the network prediction contain the target pos And probability P of not containing the object neg . The 4k channels of REG _ O represent the length-width difference and position difference of k anchor boxes predicted by the network and the actual target box, which are dx, dy, dw, dh respectively. The relationship with the actual target frame is as follows:
Figure GDA0003547414690000061
wherein A is x 、A y Represents the center point of the reference frame, A w 、A h Width and height of the reference frame, T x 、T y 、T w 、T h The coordinates and length and width representing the true values. And finally finding out a final target by methods such as maximum value inhibition and the like.
(6) After CLS _ O and REG _ O are obtained through output, the obtained data are input into a space attention module, and as shown in FIG. 4, w is obtained through average pooling, maximum pooling, convolution and Sigmoid activation operations res ×h res X 1 spatial attention weights SA _ c and SA _ r. CLS _ O and REG _ O are multiplied by the corresponding positions of SA _ c and SA _ r respectively and added with the original CLS _ O and REG _ O, or the final RPN output results CLS and REG are obtained.
(7) And performing weighted addition on output results of the three RPNs (RPN1, RPN2 and RPN3) to obtain a final target tracking network output result:
Figure GDA0003547414690000062
wherein alpha is 1 ,α 2 ,α 3 ,β 1 ,β 2 ,β 3 Is a preset weight.
(8) Class loss L in training the target tracking network cls Using cross-entropy loss, regression loss L reg A smoothed L1 penalty with normalized coordinates is used. y represents the value of the tag and,
Figure GDA0003547414690000065
representing the actual classification value (i.e. P) pos );dx T ,dy T ,dw T ,dh T And the length and width difference and the position difference of the actual k anchor boxes and the actual target box are represented, namely true values. The loss functions are defined as:
Figure GDA0003547414690000063
wherein:
Figure GDA0003547414690000064
the final loss function is as follows:
loss=L cls +λL reg (5)
where λ is a hyperparameter, which is used to balance the two types of losses.
The key parameters related to one embodiment of the present invention are shown in table 1, and the specific parameters marked in the diagram in appendix 1 are based on the implementation parameters:
TABLE 1 example parameters
Figure GDA0003547414690000071
The specific training process of the target tracking network designed by the invention is shown in fig. 6, wherein the specific training process and the related parameters of the specific implementation of the scheme are as follows:
a video sequence in the data set is processed. According to the label information, cutting to obtain 127X 127 pixel template frame F t And a search frame F of 255 x 255 pixels c
Template frame F t And search frame F c Feature extraction network ResNet _ N as fed into FIG. 2 1 And ResNet _ N 2 And extracting five features of different depth levels, wherein two feature extraction networks share weight.
Three feature pyramid networks, as shown in fig. 3, FPN1, FPN2, and FPN3 respectively extract template frames F of different depth levels t And search frame F c Feature fusion is performed on the features, wherein the FPN1 fuses the features obtained from the first, second and third blocks (layers), the FPN2 fuses the features obtained from the first, second and fourth blocks (layers), and the FPN3 fuses the features obtained from the first, second and fifth blocks (layers), as shown in fig. 2. Three pairs of FPNs respectively output mixed characteristics F of template frames M_1 (F t )、F M_2 (F t )、F M_3 (F t ) And hybrid feature of search frame F M_1 (F c )、F M_2 (F c )、F M_3 (F c ). The mixed feature sizes of the template frames are all 15 × 15 × 512, and the mixed feature sizes of the search frames are all 31 × 31 × 512.
Mixing three pairs of features F M_1 (F t ) And F M_1 (F c )、F M_2 (F t ) And F M_2 (F c )、F M_3 (F t ) And F M_3 (F c ) Respectively sent into three regional recommendation networks RPN1,RPN2, RPN3, as shown in fig. 2. The structure of each area recommendation network is the same, and as shown in fig. 4, 5 anchor boxes are provided, that is, k is 5. Firstly, mixing characteristics F of template frame M (F t ) Cutting to remove the elements around the periphery to obtain a size of 7 × 7 × 512, and adjusting F by four convolution layers M (F t ) Hybrid feature F with search frame M (F c ) The channel numbers of (a) can respectively obtain: [ F ] M (F t )] c A size of 5 × 5 × (10 × 512); [ F ] M (F t )] r A size of 5 × 5 × (20 × 512); [ F ] M (F c )] c The size is 29 multiplied by 512; [ F ] M (F c )] r And the size is 29 × 29 × 512.
Respectively will [ F M (F t )] c And [ F ] M (F c )] c 、[F M (F t )] r And [ F ] M (F c )] r And performing cross-correlation operation to obtain a classification intermediate result CLS _ O and a regression intermediate result REG _ O. Wherein the size of CLS _ O is 25 × 25 × 10, and the size of REG _ O is 25 × 25 × 20.
CLS _ O and REG _ O are respectively sent to the corresponding space attention modules to obtain space attention weights SA _ c and SA _ r. And multiplying the CLS _ O and the REG _ O by the corresponding positions of SA _ c and SA _ r, and adding the obtained product to the original CLS _ O and REG _ O to obtain a final RPN output classification result CLS and a regression result REG. CLS and CLS _ O are the same in size; REG and REG _ O are the same size. The "spatial attention" in the flow chart completes the above steps.
And performing weighted addition on the classification results output by the RPN1, the RPN2 and the RPN3 and the regression result according to the weights of 0.2, 0.3 and 0.5 to obtain a final target classification result and a suggested frame regression result. The losses were calculated and optimized according to equations (3), (4) and (5). When the set number of training rounds is 50 rounds, the training is finished and the test is carried out.
While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps; any non-essential addition and replacement made by the technical characteristics of the technical scheme of the invention by a person skilled in the art belong to the protection scope of the invention.

Claims (2)

1. A target tracking method based on a multilayer feature mixing and attention mechanism is characterized by comprising the following steps:
(1) before training, preprocessing the data set: the training data is composed of video sequences and is provided with a label of the position and the size of a target object; the target tracking network needs to input a template frame corresponding to a tracking target and a search frame for searching the target; cutting the original video sequence to obtain w t ×h t Template frame F of pixels t And w c ×h c Search frame F of pixels c Wherein the template frame corresponds to a first frame of the video sequence, and the search frame corresponds to a remaining video sequence beginning with a second frame of the video sequence;
(2) designing two parallel 5-block depth residual networks N 1 、N 2 Extracting the characteristics of the template frame and the search frame, and forming a twin network N in a weight sharing mode S The depth residual network is used to remove padding from the first 7 × 7 convolution of the existing "ResNet-50", and to change the last two convolutions of step 2 in the "ResNet-50" to convolutions of step 1, and to change the template frame F to t And search frame F c Are respectively fed into N 1 、N 2 Extracting the features of the filter paper at different depths through operations including convolution, pooling and activation; ConvM _ N (F) t ) And ConvM _ N (F) c ) Respectively represent template frames F on different levels of the network t And search frame F c Wherein M represents the characteristic output ConvM _ N (F) t ) Or ConvM _ N (F) c ) The location of the block in the ResNet network, N represents the characteristic output ConvM _ N (F) t ) Or ConvM _ N (F) c ) A specific location in a block;
(3) design characteristic pyramidA tower network FPN comprising three FPNs: FPN1, FPN2 and FPN3 are slave network N 1 、N 2 Extracting: (Conv1_1, Conv2_3, Conv3_ 3); (Conv1_1, Conv2_3, Conv4_ 6); (Conv1_1, Conv2_3, Conv5_3) the 3 groups of output features with different depths are respectively fused to obtain 3 groups of fused features, each FPN receives 3 feature maps with different scales, and the feature maps are respectively F from large to small and from shallow to deep 1 、F 2 、F 3 (ii) a The fusion of the characteristics is completed by point-to-point addition, the number of channels of one characteristic is adjusted by using 1 multiplied by 1 convolution, so that the number of the channels of the two characteristics is the same, then the size of the other characteristic is adjusted by using 2 times of up-sampling or 3 multiplied by 3 convolution with the step length of 2, so that the sizes of the two adjusted characteristics are the same, and the point-to-point addition, namely the characteristic fusion, is completed; fusing the 3 characteristics and finally outputting a fused characteristic F M And F is M Size and F 3 The same; finally, three FPNs respectively output the mixed characteristics F of the template frame M_1 (F t )、F M_2 (F t )、F M_3 (F t ) And hybrid feature of search frame F M_1 (F c )、F M_2 (F c )、F M_3 (F c );
(4) Designing a regional recommended network (RPN), which comprises three RPNs: RPN1, RPN2, and RPN3 respectively input three pairs of template frame and search frame mixture features: f M_1 (F t )、F M_1 (F c );F M_2 (F t )、F M_2 (F c );F M_3 (F t )、F M_3 (F c ) Obtaining a classification result CLS and a regression result REG of the suggestion box;
(5) the RPN outputs the regression results of the classification CLS and the REG of the suggestion frame, the two different outputs are completed by two paths, the upper half part of the RPN outputs the classification CLS of the suggestion frame, and the lower half part outputs the regression REG of the suggestion frame; RPN first extracts the hybrid feature F from the template frame M (F t ) Cutting from the edge, wherein the number of mixed characteristic channels in different combinations is different; then, by adjusting the convolution, F M (F t ) And F M (F c ) Adjusted to a suitable size[F M (F t )] c ,[F M (F c )] c ,[F M (F t )] r ,[F M (F c )] r (ii) a Will [ F ] M (F t )] c ,[F M (F c )] c Performing cross-correlation operation to obtain a primary classification result CLS _ O; will [ F ] M (F t )] r ,[F M (F c )] r Performing cross-correlation operation to obtain a primary regression result REG _ O;
CLS _ O has a size w res ×h res X 2k, size of REG _ O is w res ×h res X 4k, in the output result at w res ×h res Dimension and original drawing w c ×h c In a spatially linear relationship, at w res ×h res Each position of the anchor frame corresponds to k anchor frames with preset sizes, and the center of each anchor frame is the center of the current position; the 2k channels of CLS _ O represent the probability P that k anchor boxes of the network prediction contain the target pos And probability P of not containing the object neg (ii) a The 4k channels of REG _ O represent the length-width difference and position difference of k anchor frames predicted by the network and the actual target frame, which are dx, dy, dw, dh, and the relationship with the actual target frame is:
Figure FDA0003547414680000021
wherein A is x 、A y Represents the center point of the reference frame, A w 、A h Width and height of the reference frame, T x 、T y 、T w 、T h The coordinate and the length and the width of the true value are represented, and finally a final target is found out through a maximum suppression method;
(6) after CLS _ O and REG _ O are obtained through output, the CLS _ O and REG _ O are input into a space attention module, and w is obtained through average pooling, maximum pooling, convolution and Sigmoid activation operation res ×h res X 1 spatial attention weights SA _ c and SA _ r; CLS _ O and REG _ O are multiplied by the corresponding positions of SA _ c and SA _ r respectively and are multiplied by the original positionsAdding the CLS _ O and the REG _ O or obtaining the final RPN output result CLS and REG;
(7) for three RPNs: and the output results of the RPN1, the RPN2 and the RPN3 are subjected to weighted addition to serve as a final target tracking network output result:
Figure FDA0003547414680000022
wherein alpha is 1 ,α 2 ,α 3 ,β 1 ,β 2 ,β 3 Is a preset weight;
(8) class loss L in training the target tracking network cls Using cross-entropy loss, regression loss L reg Using a smoothed L1 penalty with normalized coordinates; y represents the value of the tag and,
Figure FDA0003547414680000023
representing the actual classification value, i.e. P pos ;dx T ,dy T ,dw T ,dh T Representing the length and width difference and the position difference between the actual k kinds of anchor frames and the actual target frame, namely a true value; the loss functions are defined as:
Figure FDA0003547414680000024
wherein:
Figure FDA0003547414680000031
the final loss function is as follows:
loss=L cls +λL reg (5)
where λ is a hyperparameter, which is used to balance the two types of losses.
2. The target tracking method based on the multi-layer feature mixing and attention mechanism as claimed in claim 1, wherein the step (8) of training the target tracking network specifically comprises:
processing the video sequence in the data set, and cutting to obtain a 127 x 127 pixel template frame F according to the label information t And a search frame F of 255 x 255 pixels c
Template frame F t And search frame F c Send into feature extraction network ResNet _ N 1 And ResNet _ N 2 Extracting five features of different depth levels, wherein two features extract network sharing weight;
three characteristic pyramid networks, namely FPN1, FPN2 and FPN3 respectively extract template frames F of different depth levels t And search frame F c Performing feature fusion on the features, wherein the FPN1 fuses first, second and third blocks, namely features obtained by one, two and three layers, the FPN2 fuses first, second and four blocks, namely features obtained by one, two and four layers, the FPN3 fuses first, second and five blocks, namely features obtained by one, two and five layers, and three pairs of FPNs respectively output mixed features F of template frames M_1 (F t )、F M_2 (F t )、F M_3 (F t ) And hybrid feature of search frame F M_1 (F c )、F M_2 (F c )、F M_3 (F c ) (ii) a The mixed feature sizes of the template frames are all 15 multiplied by 512, and the mixed feature sizes of the search frames are all 31 multiplied by 512;
mixing three pairs of features F M_1 (F t ) And F M_1 (F c )、F M_2 (F t ) And F M_2 (F c )、F M_3 (F t ) And F M_3 (F c ) Three regional recommended networks RPN1, RPN2 and RPN3 are respectively fed, wherein the structure of each regional recommended network is the same, and 5 anchor frames are totally set, namely k is 5; firstly, mixing characteristics F of template frame M (F t ) Cutting to remove the elements around the periphery to obtain a size of 7 × 7 × 512, and adjusting F by four convolution layers M (F t ) Hybrid feature F with search frame M (F c ) The number of channels of (a) is obtained: [ F ] M (F t )] c A size of 5 × 5 × (10 × 512); [ F ] M (F t )] r A size of 5 × 5 × (20 × 512); [ F ] M (F c )] c The size is 29 multiplied by 512; [ F ] M (F c )] r The size is 29 multiplied by 512;
respectively will [ F M (F t )] c And [ F ] M (F c )] c 、[F M (F t )] r And [ F ] M (F c )] r Performing cross-correlation operation to obtain a classification intermediate result CLS _ O and a regression intermediate result REG _ O, wherein the size of CLS _ O is 25 multiplied by 10, and the size of REG _ O is 25 multiplied by 20;
CLS _ O and REG _ O are respectively sent to corresponding space attention modules to obtain space attention weights SA _ c and SA _ r; multiplying the CLS _ O and the REG _ O with the corresponding positions of SA _ c and SA _ r, and adding the obtained product with the original CLS _ O and REG _ O to obtain a final RPN output classification result CLS and a regression result REG; CLS and CLS _ O are the same in size; REG and REG _ O have the same size, and the above steps are completed by 'space attention';
weighting and adding the classification results output by RPN1, RPN2 and RPN3 and the regression results according to weights of 0.2, 0.3 and 0.5 to obtain final target classification results and regression results of the suggested frames, and calculating loss and optimizing according to the formulas (3), (4) and (5); when the set number of training rounds is 50 rounds, the training is finished and the test is carried out.
CN202010518472.1A 2020-06-09 2020-06-09 Target tracking method based on multilayer feature mixing and attention mechanism Active CN111696137B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010518472.1A CN111696137B (en) 2020-06-09 2020-06-09 Target tracking method based on multilayer feature mixing and attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010518472.1A CN111696137B (en) 2020-06-09 2020-06-09 Target tracking method based on multilayer feature mixing and attention mechanism

Publications (2)

Publication Number Publication Date
CN111696137A CN111696137A (en) 2020-09-22
CN111696137B true CN111696137B (en) 2022-08-02

Family

ID=72479929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010518472.1A Active CN111696137B (en) 2020-06-09 2020-06-09 Target tracking method based on multilayer feature mixing and attention mechanism

Country Status (1)

Country Link
CN (1) CN111696137B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112258557B (en) * 2020-10-23 2022-06-10 福州大学 Visual tracking method based on space attention feature aggregation
CN112288778B (en) * 2020-10-29 2022-07-01 电子科技大学 Infrared small target detection method based on multi-frame regression depth network
CN112308013B (en) * 2020-11-16 2023-03-31 电子科技大学 Football player tracking method based on deep learning
CN112489088A (en) * 2020-12-15 2021-03-12 东北大学 Twin network visual tracking method based on memory unit
CN112651954A (en) * 2020-12-30 2021-04-13 广东电网有限责任公司电力科学研究院 Method and device for detecting insulator string dropping area
CN112669350A (en) * 2020-12-31 2021-04-16 广东电网有限责任公司电力科学研究院 Adaptive feature fusion intelligent substation human body target tracking method
CN112785624B (en) * 2021-01-18 2023-07-04 苏州科技大学 RGB-D characteristic target tracking method based on twin network
CN113298850B (en) * 2021-06-11 2023-04-21 安徽大学 Target tracking method and system based on attention mechanism and feature fusion
CN114120056A (en) * 2021-10-29 2022-03-01 中国农业大学 Small target identification method, small target identification device, electronic equipment, medium and product
CN114399533B (en) * 2022-01-17 2024-04-16 中南大学 Single-target tracking method based on multi-level attention mechanism

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201908574D0 (en) * 2019-06-14 2019-07-31 Vision Semantics Ltd Optimised machine learning
CN110544269A (en) * 2019-08-06 2019-12-06 西安电子科技大学 twin network infrared target tracking method based on characteristic pyramid
CN110704665A (en) * 2019-08-30 2020-01-17 北京大学 Image feature expression method and system based on visual attention mechanism
CN111126472A (en) * 2019-12-18 2020-05-08 南京信息工程大学 Improved target detection method based on SSD

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030053658A1 (en) * 2001-06-29 2003-03-20 Honeywell International Inc. Surveillance system and methods regarding same
CN110349185B (en) * 2019-07-12 2022-10-11 安徽大学 RGBT target tracking model training method and device
CN111192292B (en) * 2019-12-27 2023-04-28 深圳大学 Target tracking method and related equipment based on attention mechanism and twin network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201908574D0 (en) * 2019-06-14 2019-07-31 Vision Semantics Ltd Optimised machine learning
CN110544269A (en) * 2019-08-06 2019-12-06 西安电子科技大学 twin network infrared target tracking method based on characteristic pyramid
CN110704665A (en) * 2019-08-30 2020-01-17 北京大学 Image feature expression method and system based on visual attention mechanism
CN111126472A (en) * 2019-12-18 2020-05-08 南京信息工程大学 Improved target detection method based on SSD

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《Bridging the Gap Between Detection and Tracking: A Unified Approach》;Huang LH et al;《IEEE》;20200227;全文 *
《基于多级特征和混合注意力机制的室内人群检测网络》;沈文祥等;《计算机应用》;20191210;第39卷(第12期);全文 *
《基于深度特征增强的光学遥感目标检测技术研究》;胡滔;《中国优秀硕士学位论文全文数据库 工程科技Ⅱ辑》;20200215(第2020年第02期);全文 *

Also Published As

Publication number Publication date
CN111696137A (en) 2020-09-22

Similar Documents

Publication Publication Date Title
CN111696137B (en) Target tracking method based on multilayer feature mixing and attention mechanism
CN110188239B (en) Double-current video classification method and device based on cross-mode attention mechanism
CN108537824B (en) Feature map enhanced network structure optimization method based on alternating deconvolution and convolution
Wang et al. Self-supervised multiscale adversarial regression network for stereo disparity estimation
CN110929736B (en) Multi-feature cascading RGB-D significance target detection method
CN110705448A (en) Human body detection method and device
CN111696136B (en) Target tracking method based on coding and decoding structure
CN108805151B (en) Image classification method based on depth similarity network
CN113128424B (en) Method for identifying action of graph convolution neural network based on attention mechanism
CN109766822A (en) Gesture identification method neural network based and system
CN113610905B (en) Deep learning remote sensing image registration method based on sub-image matching and application
CN112232134A (en) Human body posture estimation method based on hourglass network and attention mechanism
CN114612832A (en) Real-time gesture detection method and device
CN112819951A (en) Three-dimensional human body reconstruction method with shielding function based on depth map restoration
CN116030498A (en) Virtual garment running and showing oriented three-dimensional human body posture estimation method
CN116343334A (en) Motion recognition method of three-stream self-adaptive graph convolution model fused with joint capture
CN115116139A (en) Multi-granularity human body action classification method based on graph convolution network
CN115222998A (en) Image classification method
CN114743162A (en) Cross-modal pedestrian re-identification method based on generation of countermeasure network
Fu et al. Complementarity-aware Local-global Feature Fusion Network for Building Extraction in Remote Sensing Images
CN114882234A (en) Construction method of multi-scale lightweight dense connected target detection network
CN114743273A (en) Human skeleton behavior identification method and system based on multi-scale residual error map convolutional network
CN112990154B (en) Data processing method, computer equipment and readable storage medium
Ma et al. Land Use Classification of High-Resolution Multispectral Satellite Images With Fine-Grained Multiscale Networks and Superpixel Postprocessing
CN110197226B (en) Unsupervised image translation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant