CN112150504A

CN112150504A - Visual tracking method based on attention mechanism

Info

Publication number: CN112150504A
Application number: CN202010765183.1A
Authority: CN
Inventors: 吴勇; 刘志; 黄梦珂
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2020-08-03
Filing date: 2020-08-03
Publication date: 2020-12-29

Abstract

The invention discloses a visual tracking method based on an attention mechanism. The specific operation steps are as follows: (1) processing the image data set, including unifying the size; (2) constructing a target state estimation deep learning network, and extracting the position information of the target; (3) inputting the training data processed in the step (1) into the deep learning network constructed in the step (2) for training until the network converges to obtain a trained network model, and outputting an adjusting vector by the network; (4) and (3) testing by using the trained network model in the step (3), giving a target coordinate of a first frame of picture to a video sequence to be tested, inputting the target state estimation network and the target classification network in the step (2), obtaining a rough target position by the target classification network, and obtaining a relatively accurate target state by the target state estimation network. The method can more accurately extract the position information of the target, thereby more accurately estimating the state of the tracked target.

Description

Visual tracking method based on attention mechanism

Technical Field

The invention relates to a visual tracking method based on deep learning, in particular to a visual tracking method based on an attention mechanism, aiming at enabling a network to better extract position information of a tracked target by utilizing an attention module.

Background

It is always a dream and desire of human beings how to make a computer have vision and analyze information in a video to assist or replace human beings to perform some tedious or dangerous work which is not suitable for human beings. In recent years, with the continuous improvement of machine learning and deep learning research, Artificial Intelligence (AI) technology is gradually applied to various industries, and people's lives are merged, for example: autopilot, voice recognition, face recognition, gaming, etc. The video target tracking technology provides a track characteristic for research such as target behavior analysis mainly by predicting the state of a target in a video, and is an important component of Computer Vision research (CV). The method is widely applied to intelligent monitoring, man-machine interaction, automatic driving, virtual reality, crime prediction, surgical navigation, missile navigation and military reconnaissance. As early as 1982, Marr et al constructed a framework for computer vision and demonstrated that fourier transforms of spatial frequency sensitive data could derive retinal receptive field geometry. In 2010, David proposed a tracking algorithm MOSSE based on correlation filtering, which can construct a stable correlation filter under the condition of only the first frame target image. In 2011, Henriques et al proposed a high-speed kernel correlation filtering algorithm KCF, which is a milestone algorithm in discriminant correlation filtering-based video target tracking algorithms, which proves the connection between the correlation filtering and the circulant matrix and obtains a comparable tracking performance. Tang et al replace a single kernel number in KCF with multiple kernels, which enables the tracking algorithm to fully exploit the advantages of invariance of multiple features to discriminate power spectra. Danelljan et al punishment is given to the correlation filter by adding a spatial regularization component, thereby providing a discriminant correlation filtering tracking algorithm based on spatial regularization. In recent years, the algorithm of the twin network is very interesting and has very good effect, and Tao et al learns a matching function from video training by using the twin network and then searches for a target in an image by using the fixed matching function. Bertinetto et al propose a full convolution twin network to learn a similarity between an object and a candidate object. Li et al propose a single-phase twin network tracking algorithm in combination with twin networks and regional candidate networks. The above tracking algorithms have achieved good performance in terms of speed, but are not accurate enough in estimating the state of the target.

Disclosure of Invention

The invention aims to improve the performance of the prior art, and provides a visual tracking method based on an attention mechanism, which can more accurately extract the position information of a target so as to more accurately estimate the state of the tracked target.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a visual tracking method based on an attention mechanism comprises the following specific operation steps:

(1) constructing a cross location attention module (CCLA);

(2) splitting a visual tracking problem into two tasks of target state estimation and target classification:

a) constructing a target state estimation CNN network by taking ResNet as a backbone network to accurately predict a target form in the tracking process;

b) constructing a discrimination network to separate the target from the background;

(3) embedding the cross attention network in the step (1) into ResNet Block3 and Block4 to enable the target state estimation network to predict the target form more easily and accurately, training the whole network on the basis of models trained by image classification networks ResNet-18, ResNet-34 and ResNet-50, wherein training sets are COCO (20G), LaSOT (227G) and TrackingNet (1.1T), a loss function is a minimum mean square error loss function, a random gradient descent algorithm is adopted to minimize the loss function until the network converges, and the network training is performed at a basic learning rate of 10^-3And training 40 epochs under the conditions that 15 epochs are multiplied by 0.2 and the batch size is 64 every time to obtain a converged network model.

(4) And (3) testing by using the network model trained in the step (3), giving a target coordinate of a first frame of picture for a video sequence to be tested, simultaneously inputting the target state estimation network and the target classification network in the step (2), firstly obtaining a rough target position by the target classification network, and then obtaining a relatively accurate target state by the target state estimation network.

Preferably, the construction of the cross positioning attention module in the step (1) comprises the following specific steps:

(1-1) the cross positioning attention module is formed by connecting an upper branch and a lower branch in parallel, wherein the upper branch and the lower branch are respectively CC²And CC;

(1-2) for the upper support of the cross positioning attention module, two cross modules are cascaded, and the upper support module acquires higher semantic information;

(1-3) for the lower support of the cross positioning attention module, the cross positioning attention module consists of a cross module, the cross module acquires the low-level information of the target, and the state of the target is accurately estimated through the combination of the cross positioning attention module and the upper support of the cross positioning attention module.

Preferably, the specific steps of the target state estimating network and the network discriminating in the step (2) are as follows:

(2-1) the state estimation network is composed of a twin-like network, the upper part of the twin network is an adjustment module, and the input of the module is a Reference Image; the lower part of the twin network is a Testing module, and the input of the Testing module is a Testing Image (Testing Image);

(2-2) the adjusting module is constructed by a ResNet network as a backbone network, and a Block4 of the ResNet network is connected with an upper support (CC) of the cross positioning attention module²) This way more semantic information about the target location is obtained. Block3 of ResNet is connected with the lower support (CC) of the cross positioning attention module, so that low-layer information of a target position is obtained, and the target state is better estimated through fusion of the low-layer information and the high-layer information. Connecting a Precision ROI Pooling layer behind the cross positioning attention module, and finally obtaining two adjusting vectors with the size of 512 multiplied by 1 by an adjusting module through inputting a target position of a reference image;

(2-3) the lower part of the adjusting network is composed of ResNet as a backbone network, Block3 and Block4 are multiplied by corresponding channels of the adjusting vector of the adjusting module through a convolution layer respectively, and then a Precision ROI Pooling layer is passed through respectively. In actual operation, performing random disturbance on a test image to simulate a target tracking scene, then obtaining 16 candidate target state frames, and finally obtaining a score for each candidate frame to obtain a target state which is closer to a real state as the higher the candidate frame is;

(2-4) constructing a judgment network by a related filtering algorithm, and preliminarily obtaining the rough position of the tracked target by the network; and then, accurately estimating the target state by using an adjusting module, firstly, selecting 3 candidate frames from the simulated candidate frames to obtain the highest target candidate frame, and averaging the 3 candidate frames to obtain the final target state.

Compared with the prior art, the invention has the following obvious prominent substantive characteristics and remarkable technical progress:

the method provided by the invention can be used for extracting the position information of the target more accurately by considering the fusion of the high-level information and the low-level information of the target in the image as much as possible, thereby estimating the state of the tracked target more accurately.

Drawings

FIG. 1 is a network flow diagram of a vision tracking method based on attention mechanism.

FIG. 2(a) is a schematic diagram of the cross positioning attention module method in step (1) of the present invention.

FIG. 2(b) is a schematic diagram of the cross module method for forming the cross location attention module in step (2) of the present invention.

Fig. 2(c) is a schematic diagram of the network model method for obtaining convergence in step (3) of the present invention.

Detailed Description

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

The simulation experiment carried out by the invention is realized on a PC test platform with a CPU of 4GHz, a memory of 64G, a GPU model of Titan RTX and a video memory of 24G based on PyTorch frame programming.

The first embodiment is as follows:

referring to fig. 1, a visual tracking method based on attention mechanism is characterized by comprising the following specific steps:

(1) constructing a cross location attention module (CCLA);

a) constructing a target state estimation CNN convolutional neural network by taking ResNet as a backbone network to accurately predict a target form in the tracking process;

Example two:

the present embodiment is substantially the same as the first embodiment in the following aspects:

the construction of the cross positioning attention module in the step (1) comprises the following specific steps:

The target state estimation network and the judgment network in the step (2) specifically comprise the following steps:

Example three:

as shown in fig. 1, a visual tracking method based on attention mechanism includes the following specific steps:

(1) for LaSOT (227G, 1400 videos, 352 ten thousand pictures), TrackingNet (1.1T, 30132 video trains, 511 video tests, 27 categories), COCO (30 ten thousand, 80 categories), three data sets of which the visual target can be tracked include unification of the sizes of an input image I and a label G, the sizes of the sizes are reduced to 288 × 288, and the size is input to a target state estimation network, as shown in fig. 2(a), a reference image is input to the upper part of the training network, a test image is input to the lower part of the training network, and the interval sequence between the reference image and the test image does not exceed 50 frames at most. As shown in fig. 2(b), after the features of the reference image are extracted through the main network ResNet, the reference image passes through a cross-shaped attention-positioning module, the upper branch of the cross-shaped attention-positioning module is composed of two cross-shaped attention modules in cascade, and the lower branch of the cross-shaped attention-positioning module is composed of one cross-shaped attention module, as shown in fig. 2(c), the specific process of the cross-shaped attention module is as follows:

(1-1) local feature map for input Module

Is a real number set, C is the number of channels, W is the width of the feature map, and H is the height of the feature map; first two convolution blocks with 1 x 1 convolution kernels are applied to H, then two feature maps Q and K are generated respectively,

(1-2) Using the feature maps Q and K that have been obtained, an attention map is generated by a similar operation

Deriving a vector from each point u in the spatial dimension of the feature map Q

Wherein C' is a fewer number of channels than C; then extracting feature vectors from the same row or column in the feature map K relative to the position u to form

Is omega_uThe ith element of (1). Mathematical expressions for similar operations are as follows:

where T is the transposed symbol, d_i，uE D is a feature graph Q_uAnd Ω_i，uI ═ 1,. H + W-1]，

(1-3) the last convolution block with 1 x 1 as convolution kernel directly acts on the feature map H to obtain the feature map

Then, each point u in the spatial dimension of the feature map V is obtained

And

Φ_uis a set of the same rows or the same columns of the feature map V with respect to the position u. The context information for obtaining the long distance is obtained by the accumulation operation, the mathematical expression is as follows,

wherein, H'_uIs a feature vector, is a feature map

One in position u; a. the_j，uIs a scalar value obtained by convolving the channel j and the position u with a convolution kernel of 1 × 1;

(2) as shown in fig. 2(b), a convolutional neural network capable of relatively accurate estimation of the state of the target is constructed: the state estimation network is composed of a twin-like network, the upper part of the twin network is an adjustment module, and the input of the module is a Reference Image; the lower part of the twin network is a Testing module, the input of the Testing module is a Testing Image (Testing Image), the adjusting module is used for extracting the position information of the target, and the Testing module is used for accurately estimating the state of the target, and the method is specifically realized as follows;

(2-1) extracting target location information: the adjusting module is constructed by a ResNet network as a backbone network, and a Block4 of the ResNet network is connected with an upper support (CC) of the cross positioning attention module²) This way more semantic information about the target location is obtained. Block3 of ResNet is connected with the lower support (CC) of the cross positioning attention module, so that low-layer information of a target position is obtained, and the target state is better estimated through fusion of the low-layer information and the high-layer information. Connecting a Precision ROI Pooling layer behind a cross positioning attention module, and finally obtaining two adjusting vectors with the size of 512 multiplied by 1 by an adjusting module through inputting a target position of a reference image, wherein the vectors comprise target position information;

(2-2) estimating a target state: the lower part of the adjusting network is composed of ResNet as a backbone network, Block3 and Block4 are multiplied by corresponding channels of adjusting vectors of the adjusting module through a convolution layer respectively, and then a Precision ROI Pooling layer is passed through respectively. In actual operation, performing random disturbance on a test image to simulate a target tracking scene, then obtaining 16 candidate target state frames, and finally obtaining a score for each candidate frame to obtain a target state which is closer to a real state as the higher the candidate frame is;

(3) as shown in fig. 2(c), the three training set images processed in step (1) and the labels corresponding to the three training set images are input into the deep learning network constructed in step (2) for training, the weight parameters of the main network ResNet are pre-training weights trained on the ImageNet large-scale data set, the weights of the main network are not updated in the training, the loss function is a minimum mean square error loss function suitable for the task, a random gradient descent algorithm is adopted to minimize the loss function until the network converges, and the network training is performed at a basic learning rate of 10^-3And training 40 epochs under the conditions that 15 epochs are trained to be multiplied by 0.2 and the batch size is 64 every time to obtain a convergent network model

(4) And finally, testing by using the network model trained in the step (3). Firstly, constructing a target classification network, constructing a discrimination network by a related filtering algorithm, and preliminarily obtaining the rough position of a tracked target by the network; and (2) accurately estimating the target state by using the network model trained in the step (3), i) selecting 3 candidate frames from the simulated candidate frames to obtain the highest target candidate frame, and ii) averaging the 3 candidate frames to obtain the final target state.

By combining the above embodiments, the present invention discloses a visual tracking method based on attention mechanism. The specific operation steps are as follows:

(1) processing the image data set, including unifying the size;

(2) the target state estimation deep learning network is constructed, a ResNet is used as a main network, and a cross positioning attention module (CCLA) is embedded into the ResNet main network, so that the position information of a target can be better extracted;

(3) inputting the training data processed in the step (1) into the deep learning network constructed in the step (2) for training until the network converges to obtain a trained network model, and outputting an adjusting vector by the network;

(4) performing an experiment on the test data set by using the trained network model in the step (3), and constructing a target classification network to roughly estimate the target position in the first step; and (2) accurately estimating the target state by using the network model trained in the step (3), i) selecting 3 candidate frames from the simulated candidate frames to obtain the highest target candidate frame, and ii) averaging the 3 candidate frames to obtain the final target state. The invention considers the fusion of the high-level information and the low-level information of the related target in the image as much as possible, and can more accurately extract the position information of the target, thereby more accurately estimating the state of the tracked target.

The embodiments of the present invention have been described with reference to the accompanying drawings, but the present invention is not limited to the embodiments, and various changes and modifications can be made according to the purpose of the invention, and any changes, modifications, substitutions, combinations or simplifications made according to the spirit and principle of the technical solution of the present invention shall be equivalent substitutions, as long as the purpose of the present invention is met, and the present invention shall fall within the protection scope of the present invention without departing from the technical principle and inventive concept of the present invention.

Claims

1. A visual tracking method based on an attention mechanism is characterized by comprising the following specific steps:

(1) constructing a cross location attention module (CCLA);

(3) embedding the cross attention network in the step (1) into ResNet Block3 and Block4 to enable the target state estimation network to predict the target form more easily and accurately, training the whole network on the basis of models trained by image classification networks ResNet-18, ResNet-34 and ResNet-50, wherein training sets are COCO (20G), LaSOT (227G) and TrackingNet (1.1T), a loss function is a minimum mean square error loss function, a random gradient descent algorithm is adopted to minimize the loss function until the network converges, and the network training is performed at a basic learning rate of 10^-3And training 40 with 15 epochs each training multiplied by 0.2 and a batch size of 64Obtaining a convergent network model after epoch;

2. The attention mechanism-based visual tracking method of claim 1, wherein: the construction of the cross positioning attention module in the step (1) comprises the following specific steps:

(1-3) for the lower support of the cross positioning attention module, the cross positioning attention module consists of a cross module, the cross module acquires the low-level information of the target, and the state of the target is estimated through the combination of the lower support of the cross positioning attention module and the upper support of the cross positioning attention module.

3. The attention-based visual tracking method of claim 2, wherein: the target state estimation network and the judgment network in the step (2) specifically comprise the following steps:

(2-2) the adjusting module is constructed by a ResNet network as a backbone network, and a Block4 of the ResNet network is connected with an upper support (CC) of the cross positioning attention module²) Acquiring more semantic information about the target position; block3 of ResNet connects the lower arms of the cross-positioning attention module(CC) obtaining low-level information of a target position, and estimating a target state through fusion of the low-level information and the high-level information; connecting a Precision ROI Pooling layer behind the cross positioning attention module, and finally obtaining two adjusting vectors with the size of 512 multiplied by 1 by an adjusting module through inputting a target position of a reference image;

(2-3) the lower part of the adjusting network also consists of ResNet as a backbone network, and Block3 and Block4 are multiplied by corresponding channels of the adjusting vector of the adjusting module through a convolution layer respectively, and then pass through a Precision ROI Pooling layer respectively; in actual operation, performing random disturbance on a test image to simulate a target tracking scene, then obtaining 16 candidate target state frames, and finally obtaining a score for each candidate frame to obtain a target state which is closer to a real state as the higher the candidate frame is;