CN113947618A

CN113947618A - Adaptive regression tracking method based on modulator

Info

Publication number: CN113947618A
Application number: CN202111222510.XA
Authority: CN
Inventors: 邬向前; 卜巍; 马丁
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2022-01-18
Anticipated expiration: 2041-10-20
Also published as: CN113947618B

Abstract

The invention discloses a modulator-based adaptive regression tracking method, which comprises the following steps: designing a spatiotemporal context network based on attention, and generating affine parameters corresponding to spatiotemporal context; designing a track network to generate affine parameters corresponding to the track; and step three, the two parameters generated in the step one and the step two are blended into each layer of parameters of the general regression network, and the parameters of the general regression network are adaptively adjusted to enable the parameters to have higher response to a specific target. Compared with the prior art, the invention has the following advantages: the model does not need a fine adjustment process with low efficiency in the tracking process; the context prediction network encodes relevant important space-time backgrounds in past frames, and is helpful for distinguishing targets from the backgrounds; the trajectory provides the necessary prior knowledge for the positioning of the target in the current frame.

Description

Adaptive regression tracking method based on modulator

Technical Field

The invention relates to a target tracking method, in particular to a modulator-based adaptive regression tracking method.

Background

The search area is input and the purpose of regression tracking is to estimate the position of the target by computing a response map of the target, which is generated using a gaussian function. Since the appearance of the target is affected by interference factors such as illumination, it is very necessary to update the model in the tracking process. For this reason, depth regression tracking typically uses hundreds of gradient descent iterations to fine tune the model. Although these trackers have good tracking accuracy, the fine tuning process is inefficient, and the processing speed of the trackers is limited, and the processing speed is often an important index for evaluating the quality of the tracking method.

Disclosure of Invention

In order to avoid the fine tuning process with low efficiency and accelerate the regression tracking processing speed, the invention provides a modulator-based adaptive regression tracking method.

The purpose of the invention is realized by the following technical scheme:

a modulator-based adaptive regression tracking method aims to replace the fine tuning process of network parameters by the normalization (i.e. mean and variance) of the characteristics by a CBN layer, and the parameters of the network can be adjusted only through feed-forward propagation. Firstly, a modulator I-space-time context network is designed to generate a channel-level scale parameter g_cTo adjust the weights of the different channels of the generic regression model. Secondly, a second-track network of the modulator is designed, and an element-level bias parameter b is generated_cTo incorporate a spatial prior in the generic regression model. And finally, embedding the CBN layer between the characteristics of each layer of the general regression model, namely, adaptively adjusting the characteristics of each layer in the network, so that the fine adjustment process with low efficiency is avoided, and the processing speed of the network is improved. The method specifically comprises the following steps:

step one, designing a space-time context network based on attention, generating affine parameters corresponding to space-time context, as shown in figure 1, a modulator I is arranged on the right side, the space-time context network is instant, the output of a condition batch normalization layer (CBN) is a 1-dimensional vector, and channel weight g of each layer of characteristics of a general regression network is generated by utilizing space-time context information_cThe method comprises the following specific steps:

(1) designing an attention-based spatiotemporal context network comprising 3 PMD units and their corresponding attention fusion modules, wherein: the kernel size in the LSTM of each PMD unit is 3 x 3, the hidden state unit is set to 64, the initial state is set to 0, and each PMD unit comprises long-time memory in 5 directions, namely a time direction, a right direction, a left direction, an upper direction and a lower direction;

(2) and fusing the information of each direction to obtain S:

S＝[s^t-,s^h-,s^h+,s^w-,s^w+]^T；

and then, fusing the information in the S through an attention fusion module:

wherein S 'and S' represent the channel attention feature and the spatial attention feature, respectively,

is a 1-dimensional channel attention map,

is a 2-dimensional spatial attention map,

is a bit-wise multiplication operation;

(3) inputting S' into the 1 × 1 convolutional layer to obtain affine parameters g corresponding to the space-time context_c；

Designing a track network to generate affine parameters corresponding to the track, wherein the left side is a modulator II, namely the track network, as shown in FIG. 1, the output of the modulator II is a 2-dimensional vector according to a conditional batch normalization layer (CBN), namely the offset b of each layer of characteristics of the general regression network is generated by utilizing the track_cThe method comprises the following specific steps:

(1) designing a trace network consisting of 3 convolutional layers with a kernel size of 3 × 3, the output of each convolutional layer being 128, 128 and 64 respectively, adding a ReLU activation layer between the convolutional layers to improve the nonlinearity of the feature representation, and then generating b using 6 convolutional layers of 1 × 1_cDown-sampling of (2);

(2) designing a motion prior as a predicted position of a target in a previous frame, and representing a track of the previous frame as a track Gaussian map;

(3) downsampling the trajectory Gaussian map to different scales, and then generating b corresponding to the condition layer by using the trajectory Gaussian map of different scales_c：

Wherein M is_tA down-sampled trajectory gaussian is represented,

and

is a scale and offset parameter learned by a 1 × 1 convolution, b_cAffine parameters corresponding to the track;

step three, two parameters generated in the step one and the step two are blended into each layer of parameters of the general regression network, and the parameters of the general regression network are adjusted in a self-adaptive mode to enable the parameters to have higher response to a specific target, wherein: using the VGG-16 model pre-trained on ImageNet as the feature extractor, the regression model computes a Gaussian map of the target location by fusing the feature representations of conv4_3 and conv5_3, where after 6 convolutional layers between conv4_1 to conv5_3, respectively adding CBN layers, the CBN parameters of each layer are from g of the spatio-temporal context network_cAnd trace network b_c。

Compared with the prior art, the invention has the following advantages:

1. the model does not require an inefficient fine tuning process in the tracking process.

2. Context prediction networks encode relevant important spatiotemporal contexts in past frames to help distinguish objects from the context.

3. The trajectory provides the necessary prior knowledge for the positioning of the target in the current frame.

Drawings

FIG. 1 is a flow chart of an adaptive regression tracking method based on a modulator according to the present invention;

FIG. 2 is a structure of a spatiotemporal context network;

FIG. 3 is a comparison of the method of the present invention and other mainstream target tracking methods in the OTB2015 data set;

FIG. 4 is a comparison of the method of the present invention and other mainstream target tracking methods at the TC128 data set;

fig. 5 is a visual comparison of modulators.

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings, but not limited thereto, and any modification or equivalent replacement of the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention shall be covered by the protection scope of the present invention.

The invention provides a modulator-based adaptive regression tracking method, which is named as CARM. In the feed forward process, two modulators are designed to adaptively adjust the parameters of the middle layer in the general regression network. The two modulators are realized through a neural network, and two important information in the tracking process are respectively concerned: spatiotemporal context and trajectory. The reason for selecting these two kinds of information is as follows: (1) the context is a background within a certain range of the target object, and the change of the context between two adjacent frames is relatively small. Thus, there is a certain spatiotemporal relationship between successive frames. At the same time, in order to extract more relevant spatiotemporal contexts and thus more accurately locate the target, it is necessary to refine the spatiotemporal context partially redundantly. (2) In the tracking process, the target moves relatively smoothly most of the time, and the appearance changes slowly. Therefore, the trajectory is an important clue for the target location of the current frame. Based on the above analysis, in order to extract the relatively important spatio-temporal context information and target trajectory, an attention-based context prediction network and a trajectory network are designed to respectively generate a set of parameters, and the parameters are fused into a general regression network to adjust the representation of the hierarchical features. The model is able to produce a higher response to a particular target without relying on a fine tuning process.

Fig. 1 shows the overall structure of the whole network, which can be roughly divided into three parts, which are as follows:

the first part is the design of two modulators, and before describing the design of the modulators, we will briefly describe the Conditional Batch Normalization (CBN) technique, which is widely used in the learning of Conditional examples.

Batch Normalization (BN) effectively trains the feedforward neural network by normalizing the feature statistics. Given batch input

The batch normalization process for the mean and standard deviation of each feature channel can be expressed as follows:

wherein, g_cAnd b_cAre affine parameters learned from the data. m (x)_c) And s (x)_c) Are the mean and standard deviation, which are calculated across the batch for each individual feature channel. On a batch normalization basis, Conditional Batch Normalization (CBN) mitigates domain skews by calculating statistics in the target domain. And g is_cAnd b_cMay be generated by other parameter generators. Thus, the present invention generates affine parameters g corresponding to the spatio-temporal context and trajectory, respectively, by designing two networks (modulators)_cAnd b_c. In equation (1), the input x passes through an affine parameter g_cAnd b_cNormalization is performed in the scale and offset domains, respectively. Wherein, g_cIs the weight of the channel corresponding to a 1-dimensional vector, b_cIs the offset of a 2-dimensional matrix corresponding to the horizontal and vertical coordinates. Therefore, the present invention designs two networks such that the space-time context produces channel-level scale parameters to adjust the weights of different channels, while the trajectory generates element-level bias parameters to incorporate spatial priors in the regression model.

Spatiotemporal context network: in the target tracking task, context refers to a background area within a certain range of the target and its surroundings, and the change in context between two frames is smooth in most cases. Therefore, a strong spatiotemporal relationship exists between the contexts of successive frames. Based on the above analysis, the present invention designs spatiotemporal context networks to generate channel weights through regression networks by extracting continuous spatiotemporal context information.

The structure of the spatio-temporal context network is shown in fig. 2, each Parallel multidirectional LSTM unit (PMD) includes 5 directions of Long Short-Term Memory (LSTM), which are respectively a time (t-) direction, a right (w +) direction, a left (w-) direction, an upper (h +) direction, and a lower (h-) direction. After each PMD unit, a fusion module based on attention mechanism screens the output of each direction in the PMD unit and selects the space-time context information which is most relevant to target positioning. The spatiotemporal context is composed by stacking N modules (PMD unit and its corresponding attention fusion module). Mathematically, the PMD element can be represented as follows:

wherein x is_kRepresenting the input. i.e. i_k、f_kAnd o_kRespectively representing input, forget and output gates. PMD unit for calculating current state c_kAnd hidden state s_k。

Is c_kIs input. A convolution operation, an indication of a bit-wise multiplication. Both the non-linear functions s and tanh are operated on bit-wise, and

w and H are the weights of the input and output states, respectively.

If the LSTM outputs for each direction are directly added, this means that the weights for each direction information are the same. The invention provides a method for fusing and screening relatively important information in all directions by using an attention module for accurate target positioning. For this purpose, information of each direction is fused to obtain S:

S＝[s^t-,s^h-,s^h+,s^w-,s^w+]^T, (3)；

then, the information in S is fused by the attention module:

wherein the content of the first and second substances,

is a 1-dimensional channel attention map,

is a 2-dimensional spatial attention map.

Is a bit-wise multiplication operation. This process ensures channel attention values that propagate along the spatial dimension and vice versa. For channel attention, channel attention maps are used to characterize the relationships between feature channels. Two different spatial context descriptors

And

mean pooling characteristics and maximum pooling characteristics are indicated, respectively. Channel attention map

By inputting the two spatial context descriptors into a fully connected layer. Each descriptor is then fed into a shared fully-connected layer, and the final representation is obtained by fusing the output feature vectors by element summation. The channel attention can be expressed as:

where s is the sigmoid function, W₀And W₁Representing the parameters of two fully connected layers.

After the channel attention maps are obtained, the spatial attention maps are generated using the spatial relationships between the features. To calculate spatial attention, the average pooling and maximum pooling, respectively, were first applied to S, and the results were concatenated along the channel axis. Then, convolution layers are used to generate a spatial attention map

The figure encodes the position to be emphasized. This process can be expressed as:

where s is the sigmoid function, [;]and f (-) denote the concatenation and convolution operations, respectively.

And

respectively, the average and maximum pooled feature representations along the channel. Finally, g_cObtained by inputting S' into a 1X 1 convolutional layer.

And (4) a track network, wherein the track network plays a role of motion prior. Here, the motion prior is designed as the predicted position of the object in the previous frame. Since the final position of the target is estimated by a gaussian-like map, the trajectory of the previous frame is also denoted herein as a gaussian map. To match the resolution of different feature maps in the general regression network, the trajectory gaussian map is downsampled to different scales. Then generating b of corresponding condition layer by using trajectory gaussians of different scales_c：

Wherein M is_tA down-sampled trajectory gaussian is represented,

and

are the scale and offset parameters learned by a 1 × 1 convolution.

The second part is the implementation, training and prediction of the network.

And (1) a general regression network. A VGG-16 model pre-trained on ImageNet was used as the feature extractor. The regression model calculates a gaussian map of the target position by fusing the feature representations of conv4_3 and conv5_ 3. Here, in 6 convolutional layers between conv4_1 to conv5_3, CBN layers are added after the 6 convolutional layers, respectively. (2) Spatiotemporal context networks. The spatiotemporal context is achieved by stacking 3 PMD units and their corresponding attention fusion modules, with the kernel size in the LSTM for each PMD unit being 3 × 3, the hidden state unit set to 64, and the initial state set to 0. (3) A network of traces. The trace network consists of 3 convolutional layers with a core size of 3 x 3. The output of each convolutional layer is 128, 128 and 64. Adding a ReLU active layer between convolutional layers improves the non-linearity of the feature representation. Then, b is generated using 6 convolutional layers of 1 × 1_cTo match the feature metrics of conv4_1 through conv5_3 in a general regression network. All the above convolutional layers are randomly initialized.

Training this network goes through two phases. In the first stage, the generic regression network is trained using a multi-domain learning strategy. In each iteration, the network is updated in batches consisting of 8 training samples from a sequence, where one branch (i.e., regression model) corresponds to one domain. General regression network Using ILSVRC2015 and shrinkage loss L_shrinkageAnd (5) training. The size of the input search area is 5 times of the target size, and the corresponding soft label passes through highAnd generating a Gaussian function. Learning rate of 10 using Adam optimizer^-5And iterating 80000 times. In the second stage, the general regression network and the two modulators are jointly trained (i.e., the empty context network and the trajectory network). Given a pair of consecutive frames (x)_i,x_j) Generating two corresponding search regions (sa)_i,sa_j) And soft label sl_j. Search area sa_iInput spatiotemporal context network generation scale parameter g_cAnd predicted search area

Here, it is required to minimize the prediction search area

And input search area sa_iThe loss between:

wherein L is_gdlIs the image Gradient Difference Loss (GDL), p is 1. Then, the soft label sl_jInput trace network Generation b_c. The whole network is optimized by the following formula:

wherein the content of the first and second substances,

is the prediction output, a is 1. The second phase is also trained using the Adam optimizer and at a learning rate of 10^-6Iterate 11000 times.

And predicting, namely, given a video frame, and cutting a search area according to a prediction result of a previous frame. The input to the entire network includes the search area and the search area of the previous frame and the gaussian response map. In the output response map, the coordinates with the largest value represent the predicted target position. Meanwhile, a scale pyramid strategy is used to predict the scale of the target.

Fourth, experimental results

In verifying the performance of the present invention, another version of CARM was proposed and named CARM-u. In CARM-u, the update frequency of the entire network is 10. And performance comparisons were made with the mainstream tracking method on four public data sets (OTB2015, TC128, UAV123, and VOT 2018). In OTB2015, the performance of the tracking method is evaluated by two indexes of Precision (Precision) and Success rate (Success). These two metrics are also used in both the TC128 and UAV123 data sets. In the VOT2018 dataset, methods evaluate performance by Accuracy (AR), robustness (RR), and expected average Error (EAO).

The performance of the invention was first tested on the OTB2015 data set. A comparison of CARM-u and CARM with SiamRPN + +, SaimBAN, DSLT, KYS, DiMP, PrDiMP, DaSiamRPN, ATOM, meta _ crest is shown in FIG. 3 and Table 1. According to fig. 3 and table 1, CARM achieved competitive results. The updated version of CARM (CARM-u) achieved the highest accuracy and success scores (i.e., 92.0% and 70.7%) compared to other regression-based tracking methods. The robust performance of the present invention can be attributed to two reasons: first, the spatiotemporal context network captures both the target and its local context changes. Second, the trajectory network provides a motion prior that pinpoints the target.

Table 1 comparison of OTB2015 data sets

The TC128 data set contains 128 full-color video sequences. CARM and CARM-u compare fair performance on this dataset with other mainstream methods. Fig. 4 shows that the method proposed by the present invention achieves the highest success rate score compared to other tracking methods. Since the TC128 contains a large number of small target objects, this indicates that the present invention performs well in tracking small targets.

The UAV123 data set contained over 11000 video frames captured from the drone platform. The results of accuracy and success rate are shown in table 2. Compared with a regression-based method, CARM-u is obviously superior to DSLT, and the accuracy and success rate scores of 82.2 and 62.5 are respectively achieved. In particular, the CARM, which does not require a fine-tuning version, defeats the DSLT in terms of accuracy and success rate, and operates at 6 times faster. The result shows that the method has better robustness and generalization capability.

TABLE 2 comparison of UAV123 data sets

VOT2018 is a challenging data set that has recently been published. Table 4 shows the performance comparison of the present invention with other tracking methods. According to table 3, the invention has the highest EAO value compared to both regression-based tracking methods DSLT and CREST, and ranks first on both AR and RR indices. Therefore, the meaningful characteristics extracted by the two modulators can not only accurately position the target, but also have stronger robustness.

TABLE 3 comparison of VOT2018 datasets

To verify the effectiveness of the components of the present invention, an ablation study was performed on the OTB2015 dataset. First, the performance impact of the two modulators on the generalized regression network was verified. Table 4 shows the results of the verification. As shown in table 4, all results support the contribution of both modulators to pinpointing the target. In addition, fig. 5 shows how the regulator adjusts the target response map, with the first and fourth columns as input search areas, the second and fifth columns as response maps for the universal regression network output, and the third and sixth columns as response maps for the universal regression network assisted by two modulators. Fig. 5 shows that the modulator helps the tracker to cope with various target appearance changes, and also effectively suppress the interferents in the background.

Table 4 ablation experiments on OTB2015 dataset

Claims

1. A modulator-based adaptive regression tracking method, said method comprising the steps of:

designing a spatiotemporal context network based on attention, and generating affine parameters corresponding to spatiotemporal context;

designing a track network to generate affine parameters corresponding to the track;

and step three, the two parameters generated in the step one and the step two are blended into each layer of parameters of the general regression network, and the parameters of the general regression network are adaptively adjusted to enable the parameters to have higher response to a specific target.

2. The modulator-based adaptive regression tracking method according to claim 1, wherein the specific step of the first step is as follows:

(2) and fusing the information of each direction to obtain S:

S＝[s^t-,s^h-,s^h+,s^w-,s^w+]^T；

and then, fusing the information in the S through an attention fusion module:

wherein, S'And S "respectively denote a channel attention feature and a spatial attention feature,

is a 1-dimensional channel attention map,

is a 2-dimensional spatial attention map,

is a bit-wise multiplication operation;

(3) inputting S' into the 1 × 1 convolutional layer to obtain affine parameters g corresponding to the space-time context_c。

3. The modulator-based adaptive regression tracking method according to claim 1, wherein the specific steps of the second step are as follows:

Wherein M is_tA down-sampled trajectory gaussian is represented,

and

is a scale and offset parameter learned by a 1 × 1 convolution, b_cAffine parameters corresponding to the trajectory.

4. The adaptive regression tracking method based on modulator according to claim 1, wherein in the third step, a VGG-16 model pre-trained on ImageNet is used as a feature extractor, the regression model calculates a gaussian map of the target position by fusing the feature representations of conv4_3 and conv5_3, and after adding CBN layers to 6 convolutional layers between conv4_1 to conv5_3, respectively, the CBN parameters of each layer come from g of the spatio-temporal context network_cAnd trace network b_c。