CN113947618A - Adaptive regression tracking method based on modulator - Google Patents

Adaptive regression tracking method based on modulator Download PDF

Info

Publication number
CN113947618A
CN113947618A CN202111222510.XA CN202111222510A CN113947618A CN 113947618 A CN113947618 A CN 113947618A CN 202111222510 A CN202111222510 A CN 202111222510A CN 113947618 A CN113947618 A CN 113947618A
Authority
CN
China
Prior art keywords
network
parameters
attention
context
regression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111222510.XA
Other languages
Chinese (zh)
Other versions
CN113947618B (en
Inventor
邬向前
卜巍
马丁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202111222510.XA priority Critical patent/CN113947618B/en
Publication of CN113947618A publication Critical patent/CN113947618A/en
Application granted granted Critical
Publication of CN113947618B publication Critical patent/CN113947618B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Manipulator (AREA)
  • Feedback Control In General (AREA)
  • Radar Systems Or Details Thereof (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a modulator-based adaptive regression tracking method, which comprises the following steps: designing a spatiotemporal context network based on attention, and generating affine parameters corresponding to spatiotemporal context; designing a track network to generate affine parameters corresponding to the track; and step three, the two parameters generated in the step one and the step two are blended into each layer of parameters of the general regression network, and the parameters of the general regression network are adaptively adjusted to enable the parameters to have higher response to a specific target. Compared with the prior art, the invention has the following advantages: the model does not need a fine adjustment process with low efficiency in the tracking process; the context prediction network encodes relevant important space-time backgrounds in past frames, and is helpful for distinguishing targets from the backgrounds; the trajectory provides the necessary prior knowledge for the positioning of the target in the current frame.

Description

Adaptive regression tracking method based on modulator
Technical Field
The invention relates to a target tracking method, in particular to a modulator-based adaptive regression tracking method.
Background
The search area is input and the purpose of regression tracking is to estimate the position of the target by computing a response map of the target, which is generated using a gaussian function. Since the appearance of the target is affected by interference factors such as illumination, it is very necessary to update the model in the tracking process. For this reason, depth regression tracking typically uses hundreds of gradient descent iterations to fine tune the model. Although these trackers have good tracking accuracy, the fine tuning process is inefficient, and the processing speed of the trackers is limited, and the processing speed is often an important index for evaluating the quality of the tracking method.
Disclosure of Invention
In order to avoid the fine tuning process with low efficiency and accelerate the regression tracking processing speed, the invention provides a modulator-based adaptive regression tracking method.
The purpose of the invention is realized by the following technical scheme:
a modulator-based adaptive regression tracking method aims to replace the fine tuning process of network parameters by the normalization (i.e. mean and variance) of the characteristics by a CBN layer, and the parameters of the network can be adjusted only through feed-forward propagation. Firstly, a modulator I-space-time context network is designed to generate a channel-level scale parameter gcTo adjust the weights of the different channels of the generic regression model. Secondly, a second-track network of the modulator is designed, and an element-level bias parameter b is generatedcTo incorporate a spatial prior in the generic regression model. And finally, embedding the CBN layer between the characteristics of each layer of the general regression model, namely, adaptively adjusting the characteristics of each layer in the network, so that the fine adjustment process with low efficiency is avoided, and the processing speed of the network is improved. The method specifically comprises the following steps:
step one, designing a space-time context network based on attention, generating affine parameters corresponding to space-time context, as shown in figure 1, a modulator I is arranged on the right side, the space-time context network is instant, the output of a condition batch normalization layer (CBN) is a 1-dimensional vector, and channel weight g of each layer of characteristics of a general regression network is generated by utilizing space-time context informationcThe method comprises the following specific steps:
(1) designing an attention-based spatiotemporal context network comprising 3 PMD units and their corresponding attention fusion modules, wherein: the kernel size in the LSTM of each PMD unit is 3 x 3, the hidden state unit is set to 64, the initial state is set to 0, and each PMD unit comprises long-time memory in 5 directions, namely a time direction, a right direction, a left direction, an upper direction and a lower direction;
(2) and fusing the information of each direction to obtain S:
S=[st-,sh-,sh+,sw-,sw+]T
and then, fusing the information in the S through an attention fusion module:
Figure BDA0003313132860000021
wherein S 'and S' represent the channel attention feature and the spatial attention feature, respectively,
Figure BDA0003313132860000022
is a 1-dimensional channel attention map,
Figure BDA0003313132860000023
is a 2-dimensional spatial attention map,
Figure BDA0003313132860000024
is a bit-wise multiplication operation;
(3) inputting S' into the 1 × 1 convolutional layer to obtain affine parameters g corresponding to the space-time contextc
Designing a track network to generate affine parameters corresponding to the track, wherein the left side is a modulator II, namely the track network, as shown in FIG. 1, the output of the modulator II is a 2-dimensional vector according to a conditional batch normalization layer (CBN), namely the offset b of each layer of characteristics of the general regression network is generated by utilizing the trackcThe method comprises the following specific steps:
(1) designing a trace network consisting of 3 convolutional layers with a kernel size of 3 × 3, the output of each convolutional layer being 128, 128 and 64 respectively, adding a ReLU activation layer between the convolutional layers to improve the nonlinearity of the feature representation, and then generating b using 6 convolutional layers of 1 × 1cDown-sampling of (2);
(2) designing a motion prior as a predicted position of a target in a previous frame, and representing a track of the previous frame as a track Gaussian map;
(3) downsampling the trajectory Gaussian map to different scales, and then generating b corresponding to the condition layer by using the trajectory Gaussian map of different scalesc
Figure BDA0003313132860000031
Wherein M istA down-sampled trajectory gaussian is represented,
Figure BDA0003313132860000032
and
Figure BDA0003313132860000033
is a scale and offset parameter learned by a 1 × 1 convolution, bcAffine parameters corresponding to the track;
step three, two parameters generated in the step one and the step two are blended into each layer of parameters of the general regression network, and the parameters of the general regression network are adjusted in a self-adaptive mode to enable the parameters to have higher response to a specific target, wherein: using the VGG-16 model pre-trained on ImageNet as the feature extractor, the regression model computes a Gaussian map of the target location by fusing the feature representations of conv4_3 and conv5_3, where after 6 convolutional layers between conv4_1 to conv5_3, respectively adding CBN layers, the CBN parameters of each layer are from g of the spatio-temporal context networkcAnd trace network bc
Compared with the prior art, the invention has the following advantages:
1. the model does not require an inefficient fine tuning process in the tracking process.
2. Context prediction networks encode relevant important spatiotemporal contexts in past frames to help distinguish objects from the context.
3. The trajectory provides the necessary prior knowledge for the positioning of the target in the current frame.
Drawings
FIG. 1 is a flow chart of an adaptive regression tracking method based on a modulator according to the present invention;
FIG. 2 is a structure of a spatiotemporal context network;
FIG. 3 is a comparison of the method of the present invention and other mainstream target tracking methods in the OTB2015 data set;
FIG. 4 is a comparison of the method of the present invention and other mainstream target tracking methods at the TC128 data set;
fig. 5 is a visual comparison of modulators.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings, but not limited thereto, and any modification or equivalent replacement of the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention shall be covered by the protection scope of the present invention.
The invention provides a modulator-based adaptive regression tracking method, which is named as CARM. In the feed forward process, two modulators are designed to adaptively adjust the parameters of the middle layer in the general regression network. The two modulators are realized through a neural network, and two important information in the tracking process are respectively concerned: spatiotemporal context and trajectory. The reason for selecting these two kinds of information is as follows: (1) the context is a background within a certain range of the target object, and the change of the context between two adjacent frames is relatively small. Thus, there is a certain spatiotemporal relationship between successive frames. At the same time, in order to extract more relevant spatiotemporal contexts and thus more accurately locate the target, it is necessary to refine the spatiotemporal context partially redundantly. (2) In the tracking process, the target moves relatively smoothly most of the time, and the appearance changes slowly. Therefore, the trajectory is an important clue for the target location of the current frame. Based on the above analysis, in order to extract the relatively important spatio-temporal context information and target trajectory, an attention-based context prediction network and a trajectory network are designed to respectively generate a set of parameters, and the parameters are fused into a general regression network to adjust the representation of the hierarchical features. The model is able to produce a higher response to a particular target without relying on a fine tuning process.
Fig. 1 shows the overall structure of the whole network, which can be roughly divided into three parts, which are as follows:
the first part is the design of two modulators, and before describing the design of the modulators, we will briefly describe the Conditional Batch Normalization (CBN) technique, which is widely used in the learning of Conditional examples.
Batch Normalization (BN) effectively trains the feedforward neural network by normalizing the feature statistics. Given batch input
Figure BDA0003313132860000051
The batch normalization process for the mean and standard deviation of each feature channel can be expressed as follows:
Figure BDA0003313132860000052
wherein, gcAnd bcAre affine parameters learned from the data. m (x)c) And s (x)c) Are the mean and standard deviation, which are calculated across the batch for each individual feature channel. On a batch normalization basis, Conditional Batch Normalization (CBN) mitigates domain skews by calculating statistics in the target domain. And g iscAnd bcMay be generated by other parameter generators. Thus, the present invention generates affine parameters g corresponding to the spatio-temporal context and trajectory, respectively, by designing two networks (modulators)cAnd bc. In equation (1), the input x passes through an affine parameter gcAnd bcNormalization is performed in the scale and offset domains, respectively. Wherein, gcIs the weight of the channel corresponding to a 1-dimensional vector, bcIs the offset of a 2-dimensional matrix corresponding to the horizontal and vertical coordinates. Therefore, the present invention designs two networks such that the space-time context produces channel-level scale parameters to adjust the weights of different channels, while the trajectory generates element-level bias parameters to incorporate spatial priors in the regression model.
Spatiotemporal context network: in the target tracking task, context refers to a background area within a certain range of the target and its surroundings, and the change in context between two frames is smooth in most cases. Therefore, a strong spatiotemporal relationship exists between the contexts of successive frames. Based on the above analysis, the present invention designs spatiotemporal context networks to generate channel weights through regression networks by extracting continuous spatiotemporal context information.
The structure of the spatio-temporal context network is shown in fig. 2, each Parallel multidirectional LSTM unit (PMD) includes 5 directions of Long Short-Term Memory (LSTM), which are respectively a time (t-) direction, a right (w +) direction, a left (w-) direction, an upper (h +) direction, and a lower (h-) direction. After each PMD unit, a fusion module based on attention mechanism screens the output of each direction in the PMD unit and selects the space-time context information which is most relevant to target positioning. The spatiotemporal context is composed by stacking N modules (PMD unit and its corresponding attention fusion module). Mathematically, the PMD element can be represented as follows:
Figure BDA0003313132860000071
wherein x iskRepresenting the input. i.e. ik、fkAnd okRespectively representing input, forget and output gates. PMD unit for calculating current state ckAnd hidden state sk
Figure BDA0003313132860000072
Is ckIs input. A convolution operation, an indication of a bit-wise multiplication. Both the non-linear functions s and tanh are operated on bit-wise, and
Figure BDA0003313132860000073
w and H are the weights of the input and output states, respectively.
If the LSTM outputs for each direction are directly added, this means that the weights for each direction information are the same. The invention provides a method for fusing and screening relatively important information in all directions by using an attention module for accurate target positioning. For this purpose, information of each direction is fused to obtain S:
S=[st-,sh-,sh+,sw-,sw+]T, (3);
then, the information in S is fused by the attention module:
Figure BDA0003313132860000074
wherein the content of the first and second substances,
Figure BDA0003313132860000075
is a 1-dimensional channel attention map,
Figure BDA0003313132860000076
is a 2-dimensional spatial attention map.
Figure BDA0003313132860000077
Is a bit-wise multiplication operation. This process ensures channel attention values that propagate along the spatial dimension and vice versa. For channel attention, channel attention maps are used to characterize the relationships between feature channels. Two different spatial context descriptors
Figure BDA0003313132860000078
And
Figure BDA0003313132860000079
mean pooling characteristics and maximum pooling characteristics are indicated, respectively. Channel attention map
Figure BDA00033131328600000710
By inputting the two spatial context descriptors into a fully connected layer. Each descriptor is then fed into a shared fully-connected layer, and the final representation is obtained by fusing the output feature vectors by element summation. The channel attention can be expressed as:
Figure BDA0003313132860000081
where s is the sigmoid function, W0And W1Representing the parameters of two fully connected layers.
After the channel attention maps are obtained, the spatial attention maps are generated using the spatial relationships between the features. To calculate spatial attention, the average pooling and maximum pooling, respectively, were first applied to S, and the results were concatenated along the channel axis. Then, convolution layers are used to generate a spatial attention map
Figure BDA0003313132860000082
The figure encodes the position to be emphasized. This process can be expressed as:
Figure BDA0003313132860000083
where s is the sigmoid function, [;]and f (-) denote the concatenation and convolution operations, respectively.
Figure BDA0003313132860000084
And
Figure BDA0003313132860000085
respectively, the average and maximum pooled feature representations along the channel. Finally, gcObtained by inputting S' into a 1X 1 convolutional layer.
And (4) a track network, wherein the track network plays a role of motion prior. Here, the motion prior is designed as the predicted position of the object in the previous frame. Since the final position of the target is estimated by a gaussian-like map, the trajectory of the previous frame is also denoted herein as a gaussian map. To match the resolution of different feature maps in the general regression network, the trajectory gaussian map is downsampled to different scales. Then generating b of corresponding condition layer by using trajectory gaussians of different scalesc
Figure BDA0003313132860000086
Wherein M istA down-sampled trajectory gaussian is represented,
Figure BDA0003313132860000087
and
Figure BDA0003313132860000088
are the scale and offset parameters learned by a 1 × 1 convolution.
The second part is the implementation, training and prediction of the network.
And (1) a general regression network. A VGG-16 model pre-trained on ImageNet was used as the feature extractor. The regression model calculates a gaussian map of the target position by fusing the feature representations of conv4_3 and conv5_ 3. Here, in 6 convolutional layers between conv4_1 to conv5_3, CBN layers are added after the 6 convolutional layers, respectively. (2) Spatiotemporal context networks. The spatiotemporal context is achieved by stacking 3 PMD units and their corresponding attention fusion modules, with the kernel size in the LSTM for each PMD unit being 3 × 3, the hidden state unit set to 64, and the initial state set to 0. (3) A network of traces. The trace network consists of 3 convolutional layers with a core size of 3 x 3. The output of each convolutional layer is 128, 128 and 64. Adding a ReLU active layer between convolutional layers improves the non-linearity of the feature representation. Then, b is generated using 6 convolutional layers of 1 × 1cTo match the feature metrics of conv4_1 through conv5_3 in a general regression network. All the above convolutional layers are randomly initialized.
Training this network goes through two phases. In the first stage, the generic regression network is trained using a multi-domain learning strategy. In each iteration, the network is updated in batches consisting of 8 training samples from a sequence, where one branch (i.e., regression model) corresponds to one domain. General regression network Using ILSVRC2015 and shrinkage loss LshrinkageAnd (5) training. The size of the input search area is 5 times of the target size, and the corresponding soft label passes through highAnd generating a Gaussian function. Learning rate of 10 using Adam optimizer-5And iterating 80000 times. In the second stage, the general regression network and the two modulators are jointly trained (i.e., the empty context network and the trajectory network). Given a pair of consecutive frames (x)i,xj) Generating two corresponding search regions (sa)i,saj) And soft label slj. Search area saiInput spatiotemporal context network generation scale parameter gcAnd predicted search area
Figure BDA0003313132860000091
Here, it is required to minimize the prediction search area
Figure BDA0003313132860000092
And input search area saiThe loss between:
Figure BDA0003313132860000093
wherein L isgdlIs the image Gradient Difference Loss (GDL), p is 1. Then, the soft label sljInput trace network Generation bc. The whole network is optimized by the following formula:
Figure BDA0003313132860000101
wherein the content of the first and second substances,
Figure BDA0003313132860000102
is the prediction output, a is 1. The second phase is also trained using the Adam optimizer and at a learning rate of 10-6Iterate 11000 times.
And predicting, namely, given a video frame, and cutting a search area according to a prediction result of a previous frame. The input to the entire network includes the search area and the search area of the previous frame and the gaussian response map. In the output response map, the coordinates with the largest value represent the predicted target position. Meanwhile, a scale pyramid strategy is used to predict the scale of the target.
Fourth, experimental results
In verifying the performance of the present invention, another version of CARM was proposed and named CARM-u. In CARM-u, the update frequency of the entire network is 10. And performance comparisons were made with the mainstream tracking method on four public data sets (OTB2015, TC128, UAV123, and VOT 2018). In OTB2015, the performance of the tracking method is evaluated by two indexes of Precision (Precision) and Success rate (Success). These two metrics are also used in both the TC128 and UAV123 data sets. In the VOT2018 dataset, methods evaluate performance by Accuracy (AR), robustness (RR), and expected average Error (EAO).
The performance of the invention was first tested on the OTB2015 data set. A comparison of CARM-u and CARM with SiamRPN + +, SaimBAN, DSLT, KYS, DiMP, PrDiMP, DaSiamRPN, ATOM, meta _ crest is shown in FIG. 3 and Table 1. According to fig. 3 and table 1, CARM achieved competitive results. The updated version of CARM (CARM-u) achieved the highest accuracy and success scores (i.e., 92.0% and 70.7%) compared to other regression-based tracking methods. The robust performance of the present invention can be attributed to two reasons: first, the spatiotemporal context network captures both the target and its local context changes. Second, the trajectory network provides a motion prior that pinpoints the target.
Table 1 comparison of OTB2015 data sets
Figure BDA0003313132860000111
The TC128 data set contains 128 full-color video sequences. CARM and CARM-u compare fair performance on this dataset with other mainstream methods. Fig. 4 shows that the method proposed by the present invention achieves the highest success rate score compared to other tracking methods. Since the TC128 contains a large number of small target objects, this indicates that the present invention performs well in tracking small targets.
The UAV123 data set contained over 11000 video frames captured from the drone platform. The results of accuracy and success rate are shown in table 2. Compared with a regression-based method, CARM-u is obviously superior to DSLT, and the accuracy and success rate scores of 82.2 and 62.5 are respectively achieved. In particular, the CARM, which does not require a fine-tuning version, defeats the DSLT in terms of accuracy and success rate, and operates at 6 times faster. The result shows that the method has better robustness and generalization capability.
TABLE 2 comparison of UAV123 data sets
Figure BDA0003313132860000112
VOT2018 is a challenging data set that has recently been published. Table 4 shows the performance comparison of the present invention with other tracking methods. According to table 3, the invention has the highest EAO value compared to both regression-based tracking methods DSLT and CREST, and ranks first on both AR and RR indices. Therefore, the meaningful characteristics extracted by the two modulators can not only accurately position the target, but also have stronger robustness.
TABLE 3 comparison of VOT2018 datasets
Figure BDA0003313132860000121
To verify the effectiveness of the components of the present invention, an ablation study was performed on the OTB2015 dataset. First, the performance impact of the two modulators on the generalized regression network was verified. Table 4 shows the results of the verification. As shown in table 4, all results support the contribution of both modulators to pinpointing the target. In addition, fig. 5 shows how the regulator adjusts the target response map, with the first and fourth columns as input search areas, the second and fifth columns as response maps for the universal regression network output, and the third and sixth columns as response maps for the universal regression network assisted by two modulators. Fig. 5 shows that the modulator helps the tracker to cope with various target appearance changes, and also effectively suppress the interferents in the background.
Table 4 ablation experiments on OTB2015 dataset
Figure BDA0003313132860000122

Claims (4)

1. A modulator-based adaptive regression tracking method, said method comprising the steps of:
designing a spatiotemporal context network based on attention, and generating affine parameters corresponding to spatiotemporal context;
designing a track network to generate affine parameters corresponding to the track;
and step three, the two parameters generated in the step one and the step two are blended into each layer of parameters of the general regression network, and the parameters of the general regression network are adaptively adjusted to enable the parameters to have higher response to a specific target.
2. The modulator-based adaptive regression tracking method according to claim 1, wherein the specific step of the first step is as follows:
(1) designing an attention-based spatiotemporal context network comprising 3 PMD units and their corresponding attention fusion modules, wherein: the kernel size in the LSTM of each PMD unit is 3 x 3, the hidden state unit is set to 64, the initial state is set to 0, and each PMD unit comprises long-time memory in 5 directions, namely a time direction, a right direction, a left direction, an upper direction and a lower direction;
(2) and fusing the information of each direction to obtain S:
S=[st-,sh-,sh+,sw-,sw+]T
and then, fusing the information in the S through an attention fusion module:
Figure FDA0003313132850000011
wherein, S'And S "respectively denote a channel attention feature and a spatial attention feature,
Figure FDA0003313132850000012
is a 1-dimensional channel attention map,
Figure FDA0003313132850000013
is a 2-dimensional spatial attention map,
Figure FDA0003313132850000021
is a bit-wise multiplication operation;
(3) inputting S' into the 1 × 1 convolutional layer to obtain affine parameters g corresponding to the space-time contextc
3. The modulator-based adaptive regression tracking method according to claim 1, wherein the specific steps of the second step are as follows:
(1) designing a trace network consisting of 3 convolutional layers with a kernel size of 3 × 3, the output of each convolutional layer being 128, 128 and 64 respectively, adding a ReLU activation layer between the convolutional layers to improve the nonlinearity of the feature representation, and then generating b using 6 convolutional layers of 1 × 1cDown-sampling of (2);
(2) designing a motion prior as a predicted position of a target in a previous frame, and representing a track of the previous frame as a track Gaussian map;
(3) downsampling the trajectory Gaussian map to different scales, and then generating b corresponding to the condition layer by using the trajectory Gaussian map of different scalesc
Figure FDA0003313132850000022
Wherein M istA down-sampled trajectory gaussian is represented,
Figure FDA0003313132850000023
and
Figure FDA0003313132850000024
is a scale and offset parameter learned by a 1 × 1 convolution, bcAffine parameters corresponding to the trajectory.
4. The adaptive regression tracking method based on modulator according to claim 1, wherein in the third step, a VGG-16 model pre-trained on ImageNet is used as a feature extractor, the regression model calculates a gaussian map of the target position by fusing the feature representations of conv4_3 and conv5_3, and after adding CBN layers to 6 convolutional layers between conv4_1 to conv5_3, respectively, the CBN parameters of each layer come from g of the spatio-temporal context networkcAnd trace network bc
CN202111222510.XA 2021-10-20 2021-10-20 Self-adaptive regression tracking method based on modulator Active CN113947618B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111222510.XA CN113947618B (en) 2021-10-20 2021-10-20 Self-adaptive regression tracking method based on modulator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111222510.XA CN113947618B (en) 2021-10-20 2021-10-20 Self-adaptive regression tracking method based on modulator

Publications (2)

Publication Number Publication Date
CN113947618A true CN113947618A (en) 2022-01-18
CN113947618B CN113947618B (en) 2023-08-29

Family

ID=79332090

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111222510.XA Active CN113947618B (en) 2021-10-20 2021-10-20 Self-adaptive regression tracking method based on modulator

Country Status (1)

Country Link
CN (1) CN113947618B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017147552A1 (en) * 2016-02-26 2017-08-31 Daniela Brunner Multi-format, multi-domain and multi-algorithm metalearner system and method for monitoring human health, and deriving health status and trajectory
CN109493364A (en) * 2018-09-26 2019-03-19 重庆邮电大学 A kind of target tracking algorism of combination residual error attention and contextual information
CN109685831A (en) * 2018-12-20 2019-04-26 山东大学 Method for tracking target and system based on residual error layering attention and correlation filter
CN110223316A (en) * 2019-06-13 2019-09-10 哈尔滨工业大学 Fast-moving target tracking method based on circulation Recurrent networks
CN110335290A (en) * 2019-06-04 2019-10-15 大连理工大学 Twin candidate region based on attention mechanism generates network target tracking method
CN110569706A (en) * 2019-06-25 2019-12-13 南京信息工程大学 Deep integration target tracking algorithm based on time and space network
CN112651995A (en) * 2020-12-21 2021-04-13 江南大学 On-line multi-target tracking method based on multifunctional aggregation and tracking simulation training
WO2021139069A1 (en) * 2020-01-09 2021-07-15 南京信息工程大学 General target detection method for adaptive attention guidance mechanism
CN113256677A (en) * 2021-04-16 2021-08-13 浙江工业大学 Method for tracking visual target with attention

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017147552A1 (en) * 2016-02-26 2017-08-31 Daniela Brunner Multi-format, multi-domain and multi-algorithm metalearner system and method for monitoring human health, and deriving health status and trajectory
CN109493364A (en) * 2018-09-26 2019-03-19 重庆邮电大学 A kind of target tracking algorism of combination residual error attention and contextual information
CN109685831A (en) * 2018-12-20 2019-04-26 山东大学 Method for tracking target and system based on residual error layering attention and correlation filter
CN110335290A (en) * 2019-06-04 2019-10-15 大连理工大学 Twin candidate region based on attention mechanism generates network target tracking method
CN110223316A (en) * 2019-06-13 2019-09-10 哈尔滨工业大学 Fast-moving target tracking method based on circulation Recurrent networks
CN110569706A (en) * 2019-06-25 2019-12-13 南京信息工程大学 Deep integration target tracking algorithm based on time and space network
WO2021139069A1 (en) * 2020-01-09 2021-07-15 南京信息工程大学 General target detection method for adaptive attention guidance mechanism
CN112651995A (en) * 2020-12-21 2021-04-13 江南大学 On-line multi-target tracking method based on multifunctional aggregation and tracking simulation training
CN113256677A (en) * 2021-04-16 2021-08-13 浙江工业大学 Method for tracking visual target with attention

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JUN YAN等: "Trajectory prediction for intelligent vehicles using spatial-attention mechanism", 《IET INTELLIGENT TRANSPORT SYSTEMS》, vol. 14, no. 13, pages 1855 - 1863 *
刘业鑫 等: "基于文本中心线的自然场景文本检测方法", 《智能计算机与应用》, vol. 10, no. 2, pages 374 - 379 *
詹紫微 等: "基于卷积神经网络的目标跟踪方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 2, pages 138 - 1271 *

Also Published As

Publication number Publication date
CN113947618B (en) 2023-08-29

Similar Documents

Publication Publication Date Title
Zellinger et al. Robust unsupervised domain adaptation for neural networks via moment alignment
CN110335290B (en) Twin candidate region generation network target tracking method based on attention mechanism
Lin et al. SCN: Switchable context network for semantic segmentation of RGB-D images
CN110210551B (en) Visual target tracking method based on adaptive subject sensitivity
CN110119703B (en) Human body action recognition method fusing attention mechanism and spatio-temporal graph convolutional neural network in security scene
EP4287144A1 (en) Video behavior recognition method and apparatus, and computer device and storage medium
Park et al. Robust video frame interpolation with exceptional motion map
Ku et al. A study of the Lamarckian evolution of recurrent neural networks
CN111508000A (en) Deep reinforcement learning target tracking method based on parameter space noise network
Chen et al. High-performance transformer tracking
CN113822125A (en) Processing method and device of lip language recognition model, computer equipment and storage medium
Yao et al. Robust online tracking via contrastive spatio-temporal aware network
CN108229432A (en) Face calibration method and device
CN114638408A (en) Pedestrian trajectory prediction method based on spatiotemporal information
CN110147768B (en) Target tracking method and device
CN114723784A (en) Pedestrian motion trajectory prediction method based on domain adaptation technology
CN114973071A (en) Unsupervised video target segmentation method and system based on long-term and short-term time sequence characteristics
Zhang et al. Residual memory inference network for regression tracking with weighted gradient harmonized loss
Duan et al. Dual attention adversarial attacks with limited perturbations
CN116895012A (en) Underwater image abnormal target identification method, system and equipment
Deck et al. Easing color shifts in score-based diffusion models
CN117011943A (en) Multi-scale self-attention mechanism-based decoupled 3D network action recognition method
CN113947618B (en) Self-adaptive regression tracking method based on modulator
CN110991565A (en) Target tracking optimization algorithm based on KCF
CN116030077A (en) Video salient region detection method based on multi-dataset collaborative learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant