CN111696137B - Target tracking method based on multilayer feature mixing and attention mechanism - Google Patents
Target tracking method based on multilayer feature mixing and attention mechanism Download PDFInfo
- Publication number
- CN111696137B CN111696137B CN202010518472.1A CN202010518472A CN111696137B CN 111696137 B CN111696137 B CN 111696137B CN 202010518472 A CN202010518472 A CN 202010518472A CN 111696137 B CN111696137 B CN 111696137B
- Authority
- CN
- China
- Prior art keywords
- frame
- cls
- reg
- network
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
- G06T7/248—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a target tracking method based on a multilayer feature mixing and attention mechanism, which utilizes an improved FPN structure to better reserve and utilize shallow features of an image, and the improved FPN structure which can better reserve the shallow features can output fusion features with multi-dimension and multi-scale features. The method has better tracking capability for targets with different sizes and targets with continuously changing sizes. The FPN is used for the cascaded RPN, so that the feature extraction is more accurate, the similar interferents are better distinguished when the tracking is guaranteed, and the condition of error tracking is reduced. Meanwhile, by utilizing an attention mechanism, on a spatial scale, the network gives more attention to the possible positions of the target, so that the condition of target loss or target tracking error caused by target half-shading, deformation, illumination and the like is reduced.
Description
Technical Field
The invention belongs to the field of image processing and computer vision, and particularly relates to a target tracking method based on multilayer feature mixing and attention mechanism.
Background
The visual target tracking is an important computer visual task and can be applied to the fields of visual monitoring, human-computer interaction, video compression and the like. Despite extensive research on this problem, it still has difficulty in dealing with complex changes in the appearance of objects due to the effects of lighting variations, partial occlusion, shape distortion, and camera motion.
At present, the target tracking algorithm mainly has two large branches, one is based on a correlation filtering algorithm, and the other is based on a deep learning algorithm. The target tracking method provided by the invention belongs to the branch of deep learning.
The following methods are mainly used for deep learning: a convolutional neural network; a recurrent neural network; generating a countermeasure network; a twin neural network. The target tracking method based on the convolutional neural network proposed by Learning spatial-aware regression for visual tracking, c.sun, d.wang, h.lu, and m.yang, in proc.ieee CVPR,2018, pp.8962-8970 "constructs a plurality of target models to capture various target appearances, learns different target models, processes partial occlusion and deformation based on part models, and simultaneously prevents overfitting and Learning the rotation information of the target by using the dual-flow network. Although this method has made great progress in the accuracy of target estimation, this type of convolutional neural network-based method still has high computational complexity. The invention discloses a multi-maneuvering-target tracking method based on an LSTM network, and particularly relates to a CN110780290A target tracking method based on a recurrent neural network, which utilizes context information to process the influence of a similar background on a tracked target. But since visual target tracking is related to the spatial and temporal information of the video frames, a cyclic neural network based approach is used while taking into account the motion of the target. The number of methods based on recurrent neural networks is limited due to the large number of parameters present in the model, which leads to training difficulties. Almost all of these approaches attempt to improve target modeling with other information and memory. Furthermore, a second goal of using a recurrent neural network-based approach is to avoid fine tuning of the pre-trained CNN model, which requires a lot of time and is prone to overfitting. "Visual tracking via adaptive learning, Y.Song, C.Ma, X.Wu, L.Gong, L.Bao, W.Zuo, C.Shen, R.W.Lau, and M.H.Yang, in Proc.IEEE CVPR,2018, pp.8990-8999" performs target tracking based on generation of countermeasure network, and can generate required sample to solve the problem of unbalanced distribution of training sample, and simultaneously solve the problem of insufficient sample amount by generating sample. But the generation of a countermeasure network is often difficult to train and evaluate, and in practice the mechanics of solving this problem is very strong. In the invention patent of an infrared weak and small target detection and tracking method based on a convolutional neural network, CN110728697A utilizes a twin network to track a target, and performs characteristic matching by extracting the depth characteristic of a picture to complete the tracking of the target.
Aiming at the problems of uneven utilization of target features and shielding, half shielding, illumination change, deformation and the like of a tracked object in the prior deep learning, the method is based on a twin network, combines shallow and deep features by utilizing a plurality of FPNs, and improves the robustness of the method by using an attention mechanism.
Disclosure of Invention
The invention belongs to the field of computer vision and deep learning, and enables the whole target tracking network to have stronger feature extraction capability and robustness by improving the feature extraction part and the region recommendation network part of a twin network. The invention provides a target tracking method based on a multilayer feature mixing and attention mechanism, which comprises the following specific steps:
(1) before training, preprocessing the data set: the training data is composed of video sequences and carries the target object
A location and size tag; the target tracking network needs to input a template frame corresponding to a tracking target and a search frame for finding the target. Cutting the original video sequence to obtain w t ×h t Template frame F of pixels t And w c ×h c Search frame F of pixels c Wherein the template frame corresponds to a first frame of the video sequence and the search frame corresponds to the remaining video sequence starting with a second frame of the video sequence.
(2) Designing two parallel 5-block depth residual networks N 1 、N 2 Extracting the characteristics of the template frame and the search frame, and forming a twin network N in a weight sharing mode S The depth residual network is used to remove padding from the first 7 × 7 convolution of the existing "ResNet-50", and to change the last two convolutions of step 2 in the "ResNet-50" to convolutions of step 1, and to change the template frame F to t And search frame F c Are respectively fed into N 1 、N 2 Extracting the features of the two types of the image at different depths through operations such as convolution, pooling, activation and the like; ConvM _ N (F) t ) And ConvM _ N (F) c ) Respectively represent template frames F on different levels of the network t And search frame F c Wherein M represents the location of the block in the ResNet network in which the feature map resides, and N represents the specific location in a block.
(3) Designing a feature pyramid network FPN, comprising three FPNs: FPN1, FPN2 and FPN3 are slave network N 1 、N 2 Extracting: (Conv1_1, Conv2_3, Conv3_ 3); (Conv1_1, Conv2_3, Conv4_ 6); (Conv1_1, Conv2_3, Conv5_3) the 3 groups of output features with different depths are respectively fused to obtain 3 groups of fused features, each FPN receives 3 feature maps with different scales, and the feature maps are respectively F from large to small and from shallow to deep 1 、F 2 、F 3 (ii) a The fusion of the characteristics is completed by point-to-point addition, the number of channels of one characteristic is adjusted by using 1 multiplied by 1 convolution, so that the number of the channels of the two characteristics is the same, then the size of the other characteristic is adjusted by using 2 times of up-sampling or 3 multiplied by 3 convolution with the step length of 2, so that the sizes of the two adjusted characteristics are the same, and the point-to-point addition, namely the characteristic fusion, is completed; fusing the 3 characteristics and finally outputting a fused characteristic F M And F is M Size and F 3 The same; finally, three FPNs respectively output the mixed characteristics F of the template frame M_1 (F t )、F M_2 (F t )、F M_3 (F t ) And hybrid feature of search frame F M_1 (F c )、F M_2 (F c )、F M_3 (F c );
(4) Designing a regional recommended network (RPN), which comprises three RPNs: RPN1, RPN2, and RPN3 respectively input three pairs of template frame and search frame mixture features: f M_1 (F t )、F M_1 (F c );F M_2 (F t )、F M_2 (F c );F M_3 (F t )、F M_3 (F c ) Obtaining a classification result CLS and a regression result REG of the suggestion box;
(5) the RPN outputs the regression results of the classification CLS and the REG of the suggestion frame, the two different outputs are completed by two paths, the upper half part of the RPN outputs the classification CLS of the suggestion frame, and the lower half part outputs the regression REG of the suggestion frame; RPN first extracts the hybrid feature F from the template frame M (F t ) Cutting from the edge, wherein the number of the mixed characteristic channels in different combinations is different; then, by adjusting the convolution, F M (F t ) And F M (F c ) Adjusted to a suitable size [ F ] M (F t )] c ,[F M (F c )] c ,[F M (F t )] r ,[F M (F c )] r (ii) a Will [ F ] M (F t )] c ,[F M (F c )] c Performing cross-correlation operation to obtain a primary classification result CLS _ O; will [ F ] M (F t )] r ,[F M (F c )] r Performing cross-correlation operation to obtain a primary regression result REG _ O;
CLS _ O has a size w res ×h res X 2k, size of REG _ O is w res ×h res X 4k, in the output result at w res ×h res Dimension and original drawing w c ×h c In a spatially linear relationship, at w res ×h res Each position of the anchor frame corresponds to k anchor frames with preset sizes, and the center of each anchor frame is the center of the current position; the 2k channels of CLS _ O represent the probability P that k anchor boxes of the network prediction contain the target pos And probability P of not containing the object neg (ii) a The 4k channels of REG _ O represent the length-width difference and position difference of k anchor boxes predicted by the network and the actual target box, which are dx, dy, dw, dh respectively. The relationship with the actual target frame is as follows:
wherein A is x 、A y Represents the center point of the reference frame, A w 、A h Width and height of the reference frame, T x 、T y 、T w 、T h The coordinate and the length and the width of the truth value are expressed, and finally the final target is found out through methods such as maximum suppression and the like;
(6) after CLS _ O and REG _ O are obtained through output, the CLS _ O and REG _ O are input into a space attention module, and w is obtained through average pooling, maximum pooling, convolution and Sigmoid activation operation res ×h res X 1 spatial attention weights SA _ c and SA _ r; CLS _ O and REG _ O are multiplied by the corresponding positions of SA _ c and SA _ r respectively and added with the original CLS _ O and REG _ O, or the final RPN output results CLS and REG are obtained;
(7) for three RPNs: and the output results of the RPN1, the RPN2 and the RPN3 are subjected to weighted addition to serve as a final target tracking network output result:
wherein alpha is 1 ,α 2 ,α 3 ,β 1 ,β 2 ,β 3 Is a preset weight.
(8) Class loss L in training the target tracking network cls Using cross-entropy loss, regression loss L reg Using a smoothed L1 penalty with normalized coordinates; y represents the value of the tag and,representing the actual classification value, i.e. P pos ;dx T ,dy T ,dw T ,dh T Representing the length and width difference and the position difference between the actual k kinds of anchor frames and the actual target frame, namely a true value; the loss functions are defined as:
wherein:
the final loss function is as follows:
loss=L cls +λL reg (5)
where λ is a hyperparameter, which is used to balance the two types of losses.
The present invention utilizes an improved FPN structure. Compared with the situation that the deep features obtained in the traditional FPN have insufficient reservation for the shallow features, the shallow features of the image are better reserved and utilized by utilizing the improved FPN structure. This improved FPN structure with better retention of shallow features can output fused features with multi-dimensional, multi-scale features. The method has better tracking capability for targets with different sizes and targets with continuously changing sizes. The FPN is used for the cascaded RPN, so that the feature extraction is more accurate, the similar interferents are better distinguished when the tracking is guaranteed, and the condition of error tracking is reduced. Meanwhile, by utilizing an attention mechanism, on a spatial scale, the network gives more attention to the possible positions of the target, so that the condition of target loss or target tracking error caused by target half-shading, deformation, illumination and the like is reduced.
Drawings
FIG. 1 is a diagram of a template frame and a search frame according to the present invention
FIG. 2 is an overall structure diagram of the target tracking network of the present invention
FIG. 3 is a diagram of the FPN structure of the present invention
FIG. 4 is a diagram of the RPN structure of the present invention
FIG. 5 is a diagram illustrating the RPN output result of the present invention
FIG. 6 is a flow chart of target tracking network training in accordance with the present invention
Detailed Description
The following detailed description of the embodiments and the working principles of the present invention will be made with reference to the accompanying drawings.
The invention provides a target tracking method based on a multilayer feature mixing and attention mechanism, which comprises the following specific steps:
(1) the data set is pre-processed prior to training. The training data is composed of video sequences with labels of the position and size of the target object. The target tracking network needs to input a template frame corresponding to a tracking target and a search frame for finding the target. Cutting the original video sequence to obtain w t ×h t Template frame F of pixels t And w c ×h c Search frame F of pixels c As shown in fig. 1 and 2. Wherein the template frame corresponds to a first frame of the video sequence and the search frame corresponds to the remaining video sequence starting with a second frame of the video sequence.
(2) Designing two parallel 5-block depth residual networks N 1 、N 2 Extracting the characteristics of the template frame and the search frame, and forming a twin network N in a weight sharing mode S . The deep residual network used removes padding from the first 7 x 7 convolution of the existing "ResNet-50", while changing the last two convolutions of step 2 in this "ResNet-50" to convolutions of step 1. Template frame F t And search frame F c Are respectively fed into N 1 、N 2 And extracting the features of the image in different depths respectively through operations such as convolution, pooling, activation and the like. ConvM _ N (F) t ) And ConvM _ N (F) c ) Respectively represent template frames F on different levels of the network t And search frame F c Wherein M represents the location of the block in the ResNet network in which the feature map resides, and N represents the specific location in a block.
(3) Designing a Feature Pyramid Network (FPN), three FPNs (FPN1, FPN2, FPN3) are respectively to be selected from the network N 1 、N 2 Extracting: (Conv1_1, Conv2_3, Conv3_ 3); (Conv1_1, Conv2_3, Conv4_ 6); (Conv1_1, Conv2_3, Conv5_3) the 3 sets of output features at different depths were fused separately, and 3 sets of fused features were obtained.
The specific structure of a single FPN used in the present invention is shown in fig. 4. Each FPN receives 3 feature maps with different scales, namely F from large to small and from shallow to deep 1 、F 2 、F 3 . The fusion of the characteristics is completed by point-to-point addition, the number of channels of one characteristic is adjusted by using 1 multiplied by 1 convolution to ensure that the number of the channels of the two characteristics is the same, and then the size of the other characteristic is adjusted by using 2 times of upsampling or 3 multiplied by 3 convolution with the step length of 2 to ensure that the sizes of the two adjusted characteristics are the same, so that the point-to-point addition, namely the characteristic fusion, is completed. Fusing the 3 characteristics and finally outputting a fused characteristic F M And F is M Size and F 3 The same is true. Finally, three FPNs respectively output the mixed characteristics F of the template frame M_1 (F t )、F M_2 (F t )、F M_3 (F t ) And hybrid feature of search frame F M_1 (F c )、F M_2 (F c )、F M_3 (F c )。
(4) Regional recommendation Network (RPN), three RPNs (RPN1, RPN2, RP N3) respectively input the mixed features of three pairs of template frames and search frames: f M_1 (F t )、F M_1 (F c );F M_2 (F t )、F M_2 (F c );F M_3 (F t )、F M_3 (F c ) And obtaining the classification result CLS and the regression result REG of the suggestion box, as shown in FIG. 2.
(5) The RPN needs to output the classification CLS and REG regression results of the proposed frame, and these two different outputs need two paths to complete, the upper half of the RPN in fig. 2 outputs the classification CLS of the proposed frame, and the lower half outputs the regression RE G of the proposed frame. RPN first extracts the hybrid feature F from the template frame M (F t ) And c', in fig. 4, the number of the current mixed characteristic channels is shown, and the mixed characteristic channels in different combinations are different.Then, by adjusting the convolution, F M (F t ) And F M (F c ) Adjusted to a suitable size [ F ] M (F t )] c ,[F M (F c )] c ,[F M (F t )] r ,[F M (F c )] r . Will [ F ] M (F t )] c ,[F M (F c )] c Performing cross-correlation operation to obtain a primary classification result CLS _ O; will [ F ] M (F t )] r ,[F M (F c )] r And performing cross-correlation operation to obtain a primary regression result REG _ O.
CLS _ O has a size w res ×h res X 2k, size of REG _ O is w res ×h res X 4k, as shown in FIG. 5, in the output result at w res ×h res Dimension and original drawing w c ×h c In a spatially linear relationship, at w res ×h res Corresponds to k anchor frames with preset sizes, and the center of the anchor frame is the center of the current position. The 2k channels of CLS _ O represent the probability P that k anchor boxes of the network prediction contain the target pos And probability P of not containing the object neg . The 4k channels of REG _ O represent the length-width difference and position difference of k anchor boxes predicted by the network and the actual target box, which are dx, dy, dw, dh respectively. The relationship with the actual target frame is as follows:
wherein A is x 、A y Represents the center point of the reference frame, A w 、A h Width and height of the reference frame, T x 、T y 、T w 、T h The coordinates and length and width representing the true values. And finally finding out a final target by methods such as maximum value inhibition and the like.
(6) After CLS _ O and REG _ O are obtained through output, the obtained data are input into a space attention module, and as shown in FIG. 4, w is obtained through average pooling, maximum pooling, convolution and Sigmoid activation operations res ×h res X 1 spatial attention weights SA _ c and SA _ r. CLS _ O and REG _ O are multiplied by the corresponding positions of SA _ c and SA _ r respectively and added with the original CLS _ O and REG _ O, or the final RPN output results CLS and REG are obtained.
(7) And performing weighted addition on output results of the three RPNs (RPN1, RPN2 and RPN3) to obtain a final target tracking network output result:
wherein alpha is 1 ,α 2 ,α 3 ,β 1 ,β 2 ,β 3 Is a preset weight.
(8) Class loss L in training the target tracking network cls Using cross-entropy loss, regression loss L reg A smoothed L1 penalty with normalized coordinates is used. y represents the value of the tag and,representing the actual classification value (i.e. P) pos );dx T ,dy T ,dw T ,dh T And the length and width difference and the position difference of the actual k anchor boxes and the actual target box are represented, namely true values. The loss functions are defined as:
wherein:
the final loss function is as follows:
loss=L cls +λL reg (5)
where λ is a hyperparameter, which is used to balance the two types of losses.
The key parameters related to one embodiment of the present invention are shown in table 1, and the specific parameters marked in the diagram in appendix 1 are based on the implementation parameters:
TABLE 1 example parameters
The specific training process of the target tracking network designed by the invention is shown in fig. 6, wherein the specific training process and the related parameters of the specific implementation of the scheme are as follows:
a video sequence in the data set is processed. According to the label information, cutting to obtain 127X 127 pixel template frame F t And a search frame F of 255 x 255 pixels c 。
Template frame F t And search frame F c Feature extraction network ResNet _ N as fed into FIG. 2 1 And ResNet _ N 2 And extracting five features of different depth levels, wherein two feature extraction networks share weight.
Three feature pyramid networks, as shown in fig. 3, FPN1, FPN2, and FPN3 respectively extract template frames F of different depth levels t And search frame F c Feature fusion is performed on the features, wherein the FPN1 fuses the features obtained from the first, second and third blocks (layers), the FPN2 fuses the features obtained from the first, second and fourth blocks (layers), and the FPN3 fuses the features obtained from the first, second and fifth blocks (layers), as shown in fig. 2. Three pairs of FPNs respectively output mixed characteristics F of template frames M_1 (F t )、F M_2 (F t )、F M_3 (F t ) And hybrid feature of search frame F M_1 (F c )、F M_2 (F c )、F M_3 (F c ). The mixed feature sizes of the template frames are all 15 × 15 × 512, and the mixed feature sizes of the search frames are all 31 × 31 × 512.
Mixing three pairs of features F M_1 (F t ) And F M_1 (F c )、F M_2 (F t ) And F M_2 (F c )、F M_3 (F t ) And F M_3 (F c ) Respectively sent into three regional recommendation networks RPN1,RPN2, RPN3, as shown in fig. 2. The structure of each area recommendation network is the same, and as shown in fig. 4, 5 anchor boxes are provided, that is, k is 5. Firstly, mixing characteristics F of template frame M (F t ) Cutting to remove the elements around the periphery to obtain a size of 7 × 7 × 512, and adjusting F by four convolution layers M (F t ) Hybrid feature F with search frame M (F c ) The channel numbers of (a) can respectively obtain: [ F ] M (F t )] c A size of 5 × 5 × (10 × 512); [ F ] M (F t )] r A size of 5 × 5 × (20 × 512); [ F ] M (F c )] c The size is 29 multiplied by 512; [ F ] M (F c )] r And the size is 29 × 29 × 512.
Respectively will [ F M (F t )] c And [ F ] M (F c )] c 、[F M (F t )] r And [ F ] M (F c )] r And performing cross-correlation operation to obtain a classification intermediate result CLS _ O and a regression intermediate result REG _ O. Wherein the size of CLS _ O is 25 × 25 × 10, and the size of REG _ O is 25 × 25 × 20.
CLS _ O and REG _ O are respectively sent to the corresponding space attention modules to obtain space attention weights SA _ c and SA _ r. And multiplying the CLS _ O and the REG _ O by the corresponding positions of SA _ c and SA _ r, and adding the obtained product to the original CLS _ O and REG _ O to obtain a final RPN output classification result CLS and a regression result REG. CLS and CLS _ O are the same in size; REG and REG _ O are the same size. The "spatial attention" in the flow chart completes the above steps.
And performing weighted addition on the classification results output by the RPN1, the RPN2 and the RPN3 and the regression result according to the weights of 0.2, 0.3 and 0.5 to obtain a final target classification result and a suggested frame regression result. The losses were calculated and optimized according to equations (3), (4) and (5). When the set number of training rounds is 50 rounds, the training is finished and the test is carried out.
While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps; any non-essential addition and replacement made by the technical characteristics of the technical scheme of the invention by a person skilled in the art belong to the protection scope of the invention.
Claims (2)
1. A target tracking method based on a multilayer feature mixing and attention mechanism is characterized by comprising the following steps:
(1) before training, preprocessing the data set: the training data is composed of video sequences and is provided with a label of the position and the size of a target object; the target tracking network needs to input a template frame corresponding to a tracking target and a search frame for searching the target; cutting the original video sequence to obtain w t ×h t Template frame F of pixels t And w c ×h c Search frame F of pixels c Wherein the template frame corresponds to a first frame of the video sequence, and the search frame corresponds to a remaining video sequence beginning with a second frame of the video sequence;
(2) designing two parallel 5-block depth residual networks N 1 、N 2 Extracting the characteristics of the template frame and the search frame, and forming a twin network N in a weight sharing mode S The depth residual network is used to remove padding from the first 7 × 7 convolution of the existing "ResNet-50", and to change the last two convolutions of step 2 in the "ResNet-50" to convolutions of step 1, and to change the template frame F to t And search frame F c Are respectively fed into N 1 、N 2 Extracting the features of the filter paper at different depths through operations including convolution, pooling and activation; ConvM _ N (F) t ) And ConvM _ N (F) c ) Respectively represent template frames F on different levels of the network t And search frame F c Wherein M represents the characteristic output ConvM _ N (F) t ) Or ConvM _ N (F) c ) The location of the block in the ResNet network, N represents the characteristic output ConvM _ N (F) t ) Or ConvM _ N (F) c ) A specific location in a block;
(3) design characteristic pyramidA tower network FPN comprising three FPNs: FPN1, FPN2 and FPN3 are slave network N 1 、N 2 Extracting: (Conv1_1, Conv2_3, Conv3_ 3); (Conv1_1, Conv2_3, Conv4_ 6); (Conv1_1, Conv2_3, Conv5_3) the 3 groups of output features with different depths are respectively fused to obtain 3 groups of fused features, each FPN receives 3 feature maps with different scales, and the feature maps are respectively F from large to small and from shallow to deep 1 、F 2 、F 3 (ii) a The fusion of the characteristics is completed by point-to-point addition, the number of channels of one characteristic is adjusted by using 1 multiplied by 1 convolution, so that the number of the channels of the two characteristics is the same, then the size of the other characteristic is adjusted by using 2 times of up-sampling or 3 multiplied by 3 convolution with the step length of 2, so that the sizes of the two adjusted characteristics are the same, and the point-to-point addition, namely the characteristic fusion, is completed; fusing the 3 characteristics and finally outputting a fused characteristic F M And F is M Size and F 3 The same; finally, three FPNs respectively output the mixed characteristics F of the template frame M_1 (F t )、F M_2 (F t )、F M_3 (F t ) And hybrid feature of search frame F M_1 (F c )、F M_2 (F c )、F M_3 (F c );
(4) Designing a regional recommended network (RPN), which comprises three RPNs: RPN1, RPN2, and RPN3 respectively input three pairs of template frame and search frame mixture features: f M_1 (F t )、F M_1 (F c );F M_2 (F t )、F M_2 (F c );F M_3 (F t )、F M_3 (F c ) Obtaining a classification result CLS and a regression result REG of the suggestion box;
(5) the RPN outputs the regression results of the classification CLS and the REG of the suggestion frame, the two different outputs are completed by two paths, the upper half part of the RPN outputs the classification CLS of the suggestion frame, and the lower half part outputs the regression REG of the suggestion frame; RPN first extracts the hybrid feature F from the template frame M (F t ) Cutting from the edge, wherein the number of mixed characteristic channels in different combinations is different; then, by adjusting the convolution, F M (F t ) And F M (F c ) Adjusted to a suitable size[F M (F t )] c ,[F M (F c )] c ,[F M (F t )] r ,[F M (F c )] r (ii) a Will [ F ] M (F t )] c ,[F M (F c )] c Performing cross-correlation operation to obtain a primary classification result CLS _ O; will [ F ] M (F t )] r ,[F M (F c )] r Performing cross-correlation operation to obtain a primary regression result REG _ O;
CLS _ O has a size w res ×h res X 2k, size of REG _ O is w res ×h res X 4k, in the output result at w res ×h res Dimension and original drawing w c ×h c In a spatially linear relationship, at w res ×h res Each position of the anchor frame corresponds to k anchor frames with preset sizes, and the center of each anchor frame is the center of the current position; the 2k channels of CLS _ O represent the probability P that k anchor boxes of the network prediction contain the target pos And probability P of not containing the object neg (ii) a The 4k channels of REG _ O represent the length-width difference and position difference of k anchor frames predicted by the network and the actual target frame, which are dx, dy, dw, dh, and the relationship with the actual target frame is:
wherein A is x 、A y Represents the center point of the reference frame, A w 、A h Width and height of the reference frame, T x 、T y 、T w 、T h The coordinate and the length and the width of the true value are represented, and finally a final target is found out through a maximum suppression method;
(6) after CLS _ O and REG _ O are obtained through output, the CLS _ O and REG _ O are input into a space attention module, and w is obtained through average pooling, maximum pooling, convolution and Sigmoid activation operation res ×h res X 1 spatial attention weights SA _ c and SA _ r; CLS _ O and REG _ O are multiplied by the corresponding positions of SA _ c and SA _ r respectively and are multiplied by the original positionsAdding the CLS _ O and the REG _ O or obtaining the final RPN output result CLS and REG;
(7) for three RPNs: and the output results of the RPN1, the RPN2 and the RPN3 are subjected to weighted addition to serve as a final target tracking network output result:
wherein alpha is 1 ,α 2 ,α 3 ,β 1 ,β 2 ,β 3 Is a preset weight;
(8) class loss L in training the target tracking network cls Using cross-entropy loss, regression loss L reg Using a smoothed L1 penalty with normalized coordinates; y represents the value of the tag and,representing the actual classification value, i.e. P pos ;dx T ,dy T ,dw T ,dh T Representing the length and width difference and the position difference between the actual k kinds of anchor frames and the actual target frame, namely a true value; the loss functions are defined as:
wherein:
the final loss function is as follows:
loss=L cls +λL reg (5)
where λ is a hyperparameter, which is used to balance the two types of losses.
2. The target tracking method based on the multi-layer feature mixing and attention mechanism as claimed in claim 1, wherein the step (8) of training the target tracking network specifically comprises:
processing the video sequence in the data set, and cutting to obtain a 127 x 127 pixel template frame F according to the label information t And a search frame F of 255 x 255 pixels c ;
Template frame F t And search frame F c Send into feature extraction network ResNet _ N 1 And ResNet _ N 2 Extracting five features of different depth levels, wherein two features extract network sharing weight;
three characteristic pyramid networks, namely FPN1, FPN2 and FPN3 respectively extract template frames F of different depth levels t And search frame F c Performing feature fusion on the features, wherein the FPN1 fuses first, second and third blocks, namely features obtained by one, two and three layers, the FPN2 fuses first, second and four blocks, namely features obtained by one, two and four layers, the FPN3 fuses first, second and five blocks, namely features obtained by one, two and five layers, and three pairs of FPNs respectively output mixed features F of template frames M_1 (F t )、F M_2 (F t )、F M_3 (F t ) And hybrid feature of search frame F M_1 (F c )、F M_2 (F c )、F M_3 (F c ) (ii) a The mixed feature sizes of the template frames are all 15 multiplied by 512, and the mixed feature sizes of the search frames are all 31 multiplied by 512;
mixing three pairs of features F M_1 (F t ) And F M_1 (F c )、F M_2 (F t ) And F M_2 (F c )、F M_3 (F t ) And F M_3 (F c ) Three regional recommended networks RPN1, RPN2 and RPN3 are respectively fed, wherein the structure of each regional recommended network is the same, and 5 anchor frames are totally set, namely k is 5; firstly, mixing characteristics F of template frame M (F t ) Cutting to remove the elements around the periphery to obtain a size of 7 × 7 × 512, and adjusting F by four convolution layers M (F t ) Hybrid feature F with search frame M (F c ) The number of channels of (a) is obtained: [ F ] M (F t )] c A size of 5 × 5 × (10 × 512); [ F ] M (F t )] r A size of 5 × 5 × (20 × 512); [ F ] M (F c )] c The size is 29 multiplied by 512; [ F ] M (F c )] r The size is 29 multiplied by 512;
respectively will [ F M (F t )] c And [ F ] M (F c )] c 、[F M (F t )] r And [ F ] M (F c )] r Performing cross-correlation operation to obtain a classification intermediate result CLS _ O and a regression intermediate result REG _ O, wherein the size of CLS _ O is 25 multiplied by 10, and the size of REG _ O is 25 multiplied by 20;
CLS _ O and REG _ O are respectively sent to corresponding space attention modules to obtain space attention weights SA _ c and SA _ r; multiplying the CLS _ O and the REG _ O with the corresponding positions of SA _ c and SA _ r, and adding the obtained product with the original CLS _ O and REG _ O to obtain a final RPN output classification result CLS and a regression result REG; CLS and CLS _ O are the same in size; REG and REG _ O have the same size, and the above steps are completed by 'space attention';
weighting and adding the classification results output by RPN1, RPN2 and RPN3 and the regression results according to weights of 0.2, 0.3 and 0.5 to obtain final target classification results and regression results of the suggested frames, and calculating loss and optimizing according to the formulas (3), (4) and (5); when the set number of training rounds is 50 rounds, the training is finished and the test is carried out.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010518472.1A CN111696137B (en) | 2020-06-09 | 2020-06-09 | Target tracking method based on multilayer feature mixing and attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010518472.1A CN111696137B (en) | 2020-06-09 | 2020-06-09 | Target tracking method based on multilayer feature mixing and attention mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111696137A CN111696137A (en) | 2020-09-22 |
CN111696137B true CN111696137B (en) | 2022-08-02 |
Family
ID=72479929
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010518472.1A Active CN111696137B (en) | 2020-06-09 | 2020-06-09 | Target tracking method based on multilayer feature mixing and attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111696137B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112258557B (en) * | 2020-10-23 | 2022-06-10 | 福州大学 | Visual tracking method based on space attention feature aggregation |
CN112288778B (en) * | 2020-10-29 | 2022-07-01 | 电子科技大学 | Infrared small target detection method based on multi-frame regression depth network |
CN112308013B (en) * | 2020-11-16 | 2023-03-31 | 电子科技大学 | Football player tracking method based on deep learning |
CN112489088A (en) * | 2020-12-15 | 2021-03-12 | 东北大学 | Twin network visual tracking method based on memory unit |
CN112651954A (en) * | 2020-12-30 | 2021-04-13 | 广东电网有限责任公司电力科学研究院 | Method and device for detecting insulator string dropping area |
CN112669350A (en) * | 2020-12-31 | 2021-04-16 | 广东电网有限责任公司电力科学研究院 | Adaptive feature fusion intelligent substation human body target tracking method |
CN112785624B (en) * | 2021-01-18 | 2023-07-04 | 苏州科技大学 | RGB-D characteristic target tracking method based on twin network |
CN113298850B (en) * | 2021-06-11 | 2023-04-21 | 安徽大学 | Target tracking method and system based on attention mechanism and feature fusion |
CN114120056A (en) * | 2021-10-29 | 2022-03-01 | 中国农业大学 | Small target identification method, small target identification device, electronic equipment, medium and product |
CN114399533B (en) * | 2022-01-17 | 2024-04-16 | 中南大学 | Single-target tracking method based on multi-level attention mechanism |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB201908574D0 (en) * | 2019-06-14 | 2019-07-31 | Vision Semantics Ltd | Optimised machine learning |
CN110544269A (en) * | 2019-08-06 | 2019-12-06 | 西安电子科技大学 | twin network infrared target tracking method based on characteristic pyramid |
CN110704665A (en) * | 2019-08-30 | 2020-01-17 | 北京大学 | Image feature expression method and system based on visual attention mechanism |
CN111126472A (en) * | 2019-12-18 | 2020-05-08 | 南京信息工程大学 | Improved target detection method based on SSD |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030053658A1 (en) * | 2001-06-29 | 2003-03-20 | Honeywell International Inc. | Surveillance system and methods regarding same |
CN110349185B (en) * | 2019-07-12 | 2022-10-11 | 安徽大学 | RGBT target tracking model training method and device |
CN111192292B (en) * | 2019-12-27 | 2023-04-28 | 深圳大学 | Target tracking method and related equipment based on attention mechanism and twin network |
-
2020
- 2020-06-09 CN CN202010518472.1A patent/CN111696137B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB201908574D0 (en) * | 2019-06-14 | 2019-07-31 | Vision Semantics Ltd | Optimised machine learning |
CN110544269A (en) * | 2019-08-06 | 2019-12-06 | 西安电子科技大学 | twin network infrared target tracking method based on characteristic pyramid |
CN110704665A (en) * | 2019-08-30 | 2020-01-17 | 北京大学 | Image feature expression method and system based on visual attention mechanism |
CN111126472A (en) * | 2019-12-18 | 2020-05-08 | 南京信息工程大学 | Improved target detection method based on SSD |
Non-Patent Citations (3)
Title |
---|
《Bridging the Gap Between Detection and Tracking: A Unified Approach》;Huang LH et al;《IEEE》;20200227;全文 * |
《基于多级特征和混合注意力机制的室内人群检测网络》;沈文祥等;《计算机应用》;20191210;第39卷(第12期);全文 * |
《基于深度特征增强的光学遥感目标检测技术研究》;胡滔;《中国优秀硕士学位论文全文数据库 工程科技Ⅱ辑》;20200215(第2020年第02期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111696137A (en) | 2020-09-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111696137B (en) | Target tracking method based on multilayer feature mixing and attention mechanism | |
CN110188239B (en) | Double-current video classification method and device based on cross-mode attention mechanism | |
CN108537824B (en) | Feature map enhanced network structure optimization method based on alternating deconvolution and convolution | |
Wang et al. | Self-supervised multiscale adversarial regression network for stereo disparity estimation | |
CN110929736B (en) | Multi-feature cascading RGB-D significance target detection method | |
CN110705448A (en) | Human body detection method and device | |
CN111696136B (en) | Target tracking method based on coding and decoding structure | |
CN108805151B (en) | Image classification method based on depth similarity network | |
CN113128424B (en) | Method for identifying action of graph convolution neural network based on attention mechanism | |
CN109766822A (en) | Gesture identification method neural network based and system | |
CN113610905B (en) | Deep learning remote sensing image registration method based on sub-image matching and application | |
CN112232134A (en) | Human body posture estimation method based on hourglass network and attention mechanism | |
CN114612832A (en) | Real-time gesture detection method and device | |
CN112819951A (en) | Three-dimensional human body reconstruction method with shielding function based on depth map restoration | |
CN116030498A (en) | Virtual garment running and showing oriented three-dimensional human body posture estimation method | |
CN116343334A (en) | Motion recognition method of three-stream self-adaptive graph convolution model fused with joint capture | |
CN115116139A (en) | Multi-granularity human body action classification method based on graph convolution network | |
CN115222998A (en) | Image classification method | |
CN114743162A (en) | Cross-modal pedestrian re-identification method based on generation of countermeasure network | |
Fu et al. | Complementarity-aware Local-global Feature Fusion Network for Building Extraction in Remote Sensing Images | |
CN114882234A (en) | Construction method of multi-scale lightweight dense connected target detection network | |
CN114743273A (en) | Human skeleton behavior identification method and system based on multi-scale residual error map convolutional network | |
CN112990154B (en) | Data processing method, computer equipment and readable storage medium | |
Ma et al. | Land Use Classification of High-Resolution Multispectral Satellite Images With Fine-Grained Multiscale Networks and Superpixel Postprocessing | |
CN110197226B (en) | Unsupervised image translation method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |