CN112150504A - Visual tracking method based on attention mechanism - Google Patents
Visual tracking method based on attention mechanism Download PDFInfo
- Publication number
- CN112150504A CN112150504A CN202010765183.1A CN202010765183A CN112150504A CN 112150504 A CN112150504 A CN 112150504A CN 202010765183 A CN202010765183 A CN 202010765183A CN 112150504 A CN112150504 A CN 112150504A
- Authority
- CN
- China
- Prior art keywords
- network
- target
- module
- attention
- cross
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 230000000007 visual effect Effects 0.000 title claims abstract description 17
- 230000007246 mechanism Effects 0.000 title claims abstract description 12
- 238000012360 testing method Methods 0.000 claims abstract description 31
- 238000012549 training Methods 0.000 claims abstract description 27
- 239000013598 vector Substances 0.000 claims abstract description 14
- 230000006870 function Effects 0.000 claims description 14
- 238000001914 filtration Methods 0.000 claims description 9
- 238000011176 pooling Methods 0.000 claims description 8
- 230000004927 fusion Effects 0.000 claims description 6
- 238000012935 Averaging Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 abstract description 7
- 238000012545 processing Methods 0.000 abstract description 2
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 241000282414 Homo sapiens Species 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 241000195940 Bryophyta Species 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000002207 retinal effect Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
- G06T7/251—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a visual tracking method based on an attention mechanism. The specific operation steps are as follows: (1) processing the image data set, including unifying the size; (2) constructing a target state estimation deep learning network, and extracting the position information of the target; (3) inputting the training data processed in the step (1) into the deep learning network constructed in the step (2) for training until the network converges to obtain a trained network model, and outputting an adjusting vector by the network; (4) and (3) testing by using the trained network model in the step (3), giving a target coordinate of a first frame of picture to a video sequence to be tested, inputting the target state estimation network and the target classification network in the step (2), obtaining a rough target position by the target classification network, and obtaining a relatively accurate target state by the target state estimation network. The method can more accurately extract the position information of the target, thereby more accurately estimating the state of the tracked target.
Description
Technical Field
The invention relates to a visual tracking method based on deep learning, in particular to a visual tracking method based on an attention mechanism, aiming at enabling a network to better extract position information of a tracked target by utilizing an attention module.
Background
It is always a dream and desire of human beings how to make a computer have vision and analyze information in a video to assist or replace human beings to perform some tedious or dangerous work which is not suitable for human beings. In recent years, with the continuous improvement of machine learning and deep learning research, Artificial Intelligence (AI) technology is gradually applied to various industries, and people's lives are merged, for example: autopilot, voice recognition, face recognition, gaming, etc. The video target tracking technology provides a track characteristic for research such as target behavior analysis mainly by predicting the state of a target in a video, and is an important component of Computer Vision research (CV). The method is widely applied to intelligent monitoring, man-machine interaction, automatic driving, virtual reality, crime prediction, surgical navigation, missile navigation and military reconnaissance. As early as 1982, Marr et al constructed a framework for computer vision and demonstrated that fourier transforms of spatial frequency sensitive data could derive retinal receptive field geometry. In 2010, David proposed a tracking algorithm MOSSE based on correlation filtering, which can construct a stable correlation filter under the condition of only the first frame target image. In 2011, Henriques et al proposed a high-speed kernel correlation filtering algorithm KCF, which is a milestone algorithm in discriminant correlation filtering-based video target tracking algorithms, which proves the connection between the correlation filtering and the circulant matrix and obtains a comparable tracking performance. Tang et al replace a single kernel number in KCF with multiple kernels, which enables the tracking algorithm to fully exploit the advantages of invariance of multiple features to discriminate power spectra. Danelljan et al punishment is given to the correlation filter by adding a spatial regularization component, thereby providing a discriminant correlation filtering tracking algorithm based on spatial regularization. In recent years, the algorithm of the twin network is very interesting and has very good effect, and Tao et al learns a matching function from video training by using the twin network and then searches for a target in an image by using the fixed matching function. Bertinetto et al propose a full convolution twin network to learn a similarity between an object and a candidate object. Li et al propose a single-phase twin network tracking algorithm in combination with twin networks and regional candidate networks. The above tracking algorithms have achieved good performance in terms of speed, but are not accurate enough in estimating the state of the target.
Disclosure of Invention
The invention aims to improve the performance of the prior art, and provides a visual tracking method based on an attention mechanism, which can more accurately extract the position information of a target so as to more accurately estimate the state of the tracked target.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a visual tracking method based on an attention mechanism comprises the following specific operation steps:
(1) constructing a cross location attention module (CCLA);
(2) splitting a visual tracking problem into two tasks of target state estimation and target classification:
a) constructing a target state estimation CNN network by taking ResNet as a backbone network to accurately predict a target form in the tracking process;
b) constructing a discrimination network to separate the target from the background;
(3) embedding the cross attention network in the step (1) into ResNet Block3 and Block4 to enable the target state estimation network to predict the target form more easily and accurately, training the whole network on the basis of models trained by image classification networks ResNet-18, ResNet-34 and ResNet-50, wherein training sets are COCO (20G), LaSOT (227G) and TrackingNet (1.1T), a loss function is a minimum mean square error loss function, a random gradient descent algorithm is adopted to minimize the loss function until the network converges, and the network training is performed at a basic learning rate of 10-3And training 40 epochs under the conditions that 15 epochs are multiplied by 0.2 and the batch size is 64 every time to obtain a converged network model.
(4) And (3) testing by using the network model trained in the step (3), giving a target coordinate of a first frame of picture for a video sequence to be tested, simultaneously inputting the target state estimation network and the target classification network in the step (2), firstly obtaining a rough target position by the target classification network, and then obtaining a relatively accurate target state by the target state estimation network.
Preferably, the construction of the cross positioning attention module in the step (1) comprises the following specific steps:
(1-1) the cross positioning attention module is formed by connecting an upper branch and a lower branch in parallel, wherein the upper branch and the lower branch are respectively CC2And CC;
(1-2) for the upper support of the cross positioning attention module, two cross modules are cascaded, and the upper support module acquires higher semantic information;
(1-3) for the lower support of the cross positioning attention module, the cross positioning attention module consists of a cross module, the cross module acquires the low-level information of the target, and the state of the target is accurately estimated through the combination of the cross positioning attention module and the upper support of the cross positioning attention module.
Preferably, the specific steps of the target state estimating network and the network discriminating in the step (2) are as follows:
(2-1) the state estimation network is composed of a twin-like network, the upper part of the twin network is an adjustment module, and the input of the module is a Reference Image; the lower part of the twin network is a Testing module, and the input of the Testing module is a Testing Image (Testing Image);
(2-2) the adjusting module is constructed by a ResNet network as a backbone network, and a Block4 of the ResNet network is connected with an upper support (CC) of the cross positioning attention module2) This way more semantic information about the target location is obtained. Block3 of ResNet is connected with the lower support (CC) of the cross positioning attention module, so that low-layer information of a target position is obtained, and the target state is better estimated through fusion of the low-layer information and the high-layer information. Connecting a Precision ROI Pooling layer behind the cross positioning attention module, and finally obtaining two adjusting vectors with the size of 512 multiplied by 1 by an adjusting module through inputting a target position of a reference image;
(2-3) the lower part of the adjusting network is composed of ResNet as a backbone network, Block3 and Block4 are multiplied by corresponding channels of the adjusting vector of the adjusting module through a convolution layer respectively, and then a Precision ROI Pooling layer is passed through respectively. In actual operation, performing random disturbance on a test image to simulate a target tracking scene, then obtaining 16 candidate target state frames, and finally obtaining a score for each candidate frame to obtain a target state which is closer to a real state as the higher the candidate frame is;
(2-4) constructing a judgment network by a related filtering algorithm, and preliminarily obtaining the rough position of the tracked target by the network; and then, accurately estimating the target state by using an adjusting module, firstly, selecting 3 candidate frames from the simulated candidate frames to obtain the highest target candidate frame, and averaging the 3 candidate frames to obtain the final target state.
Compared with the prior art, the invention has the following obvious prominent substantive characteristics and remarkable technical progress:
the method provided by the invention can be used for extracting the position information of the target more accurately by considering the fusion of the high-level information and the low-level information of the target in the image as much as possible, thereby estimating the state of the tracked target more accurately.
Drawings
FIG. 1 is a network flow diagram of a vision tracking method based on attention mechanism.
FIG. 2(a) is a schematic diagram of the cross positioning attention module method in step (1) of the present invention.
FIG. 2(b) is a schematic diagram of the cross module method for forming the cross location attention module in step (2) of the present invention.
Fig. 2(c) is a schematic diagram of the network model method for obtaining convergence in step (3) of the present invention.
Detailed Description
Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
The simulation experiment carried out by the invention is realized on a PC test platform with a CPU of 4GHz, a memory of 64G, a GPU model of Titan RTX and a video memory of 24G based on PyTorch frame programming.
The first embodiment is as follows:
referring to fig. 1, a visual tracking method based on attention mechanism is characterized by comprising the following specific steps:
(1) constructing a cross location attention module (CCLA);
(2) splitting a visual tracking problem into two tasks of target state estimation and target classification:
a) constructing a target state estimation CNN convolutional neural network by taking ResNet as a backbone network to accurately predict a target form in the tracking process;
b) constructing a discrimination network to separate the target from the background;
(3) embedding the cross attention network in the step (1) into ResNet Block3 and Block4 to enable the target state estimation network to predict the target form more easily and accurately, training the whole network on the basis of models trained by image classification networks ResNet-18, ResNet-34 and ResNet-50, wherein training sets are COCO (20G), LaSOT (227G) and TrackingNet (1.1T), a loss function is a minimum mean square error loss function, a random gradient descent algorithm is adopted to minimize the loss function until the network converges, and the network training is performed at a basic learning rate of 10-3And training 40 epochs under the conditions that 15 epochs are multiplied by 0.2 and the batch size is 64 every time to obtain a converged network model.
(4) And (3) testing by using the network model trained in the step (3), giving a target coordinate of a first frame of picture for a video sequence to be tested, simultaneously inputting the target state estimation network and the target classification network in the step (2), firstly obtaining a rough target position by the target classification network, and then obtaining a relatively accurate target state by the target state estimation network.
Example two:
the present embodiment is substantially the same as the first embodiment in the following aspects:
the construction of the cross positioning attention module in the step (1) comprises the following specific steps:
(1-1) the cross positioning attention module is formed by connecting an upper branch and a lower branch in parallel, wherein the upper branch and the lower branch are respectively CC2And CC;
(1-2) for the upper support of the cross positioning attention module, two cross modules are cascaded, and the upper support module acquires higher semantic information;
(1-3) for the lower support of the cross positioning attention module, the cross positioning attention module consists of a cross module, the cross module acquires the low-level information of the target, and the state of the target is accurately estimated through the combination of the cross positioning attention module and the upper support of the cross positioning attention module.
The target state estimation network and the judgment network in the step (2) specifically comprise the following steps:
(2-1) the state estimation network is composed of a twin-like network, the upper part of the twin network is an adjustment module, and the input of the module is a Reference Image; the lower part of the twin network is a Testing module, and the input of the Testing module is a Testing Image (Testing Image);
(2-2) the adjusting module is constructed by a ResNet network as a backbone network, and a Block4 of the ResNet network is connected with an upper support (CC) of the cross positioning attention module2) This way more semantic information about the target location is obtained. Block3 of ResNet is connected with the lower support (CC) of the cross positioning attention module, so that low-layer information of a target position is obtained, and the target state is better estimated through fusion of the low-layer information and the high-layer information. Connecting a Precision ROI Pooling layer behind the cross positioning attention module, and finally obtaining two adjusting vectors with the size of 512 multiplied by 1 by an adjusting module through inputting a target position of a reference image;
(2-3) the lower part of the adjusting network is composed of ResNet as a backbone network, Block3 and Block4 are multiplied by corresponding channels of the adjusting vector of the adjusting module through a convolution layer respectively, and then a Precision ROI Pooling layer is passed through respectively. In actual operation, performing random disturbance on a test image to simulate a target tracking scene, then obtaining 16 candidate target state frames, and finally obtaining a score for each candidate frame to obtain a target state which is closer to a real state as the higher the candidate frame is;
(2-4) constructing a judgment network by a related filtering algorithm, and preliminarily obtaining the rough position of the tracked target by the network; and then, accurately estimating the target state by using an adjusting module, firstly, selecting 3 candidate frames from the simulated candidate frames to obtain the highest target candidate frame, and averaging the 3 candidate frames to obtain the final target state.
Example three:
as shown in fig. 1, a visual tracking method based on attention mechanism includes the following specific steps:
(1) for LaSOT (227G, 1400 videos, 352 ten thousand pictures), TrackingNet (1.1T, 30132 video trains, 511 video tests, 27 categories), COCO (30 ten thousand, 80 categories), three data sets of which the visual target can be tracked include unification of the sizes of an input image I and a label G, the sizes of the sizes are reduced to 288 × 288, and the size is input to a target state estimation network, as shown in fig. 2(a), a reference image is input to the upper part of the training network, a test image is input to the lower part of the training network, and the interval sequence between the reference image and the test image does not exceed 50 frames at most. As shown in fig. 2(b), after the features of the reference image are extracted through the main network ResNet, the reference image passes through a cross-shaped attention-positioning module, the upper branch of the cross-shaped attention-positioning module is composed of two cross-shaped attention modules in cascade, and the lower branch of the cross-shaped attention-positioning module is composed of one cross-shaped attention module, as shown in fig. 2(c), the specific process of the cross-shaped attention module is as follows:
(1-1) local feature map for input Module Is a real number set, C is the number of channels, W is the width of the feature map, and H is the height of the feature map; first two convolution blocks with 1 x 1 convolution kernels are applied to H, then two feature maps Q and K are generated respectively,
(1-2) Using the feature maps Q and K that have been obtained, an attention map is generated by a similar operationDeriving a vector from each point u in the spatial dimension of the feature map QWherein C' is a fewer number of channels than C; then extracting feature vectors from the same row or column in the feature map K relative to the position u to formIs omegauThe ith element of (1). Mathematical expressions for similar operations are as follows:
(1-3) the last convolution block with 1 x 1 as convolution kernel directly acts on the feature map H to obtain the feature map
Then, each point u in the spatial dimension of the feature map V is obtainedAndΦuis a set of the same rows or the same columns of the feature map V with respect to the position u. The context information for obtaining the long distance is obtained by the accumulation operation, the mathematical expression is as follows,
wherein, H'uIs a feature vector, is a feature mapOne in position u; a. thej,uIs a scalar value obtained by convolving the channel j and the position u with a convolution kernel of 1 × 1;
(2) as shown in fig. 2(b), a convolutional neural network capable of relatively accurate estimation of the state of the target is constructed: the state estimation network is composed of a twin-like network, the upper part of the twin network is an adjustment module, and the input of the module is a Reference Image; the lower part of the twin network is a Testing module, the input of the Testing module is a Testing Image (Testing Image), the adjusting module is used for extracting the position information of the target, and the Testing module is used for accurately estimating the state of the target, and the method is specifically realized as follows;
(2-1) extracting target location information: the adjusting module is constructed by a ResNet network as a backbone network, and a Block4 of the ResNet network is connected with an upper support (CC) of the cross positioning attention module2) This way more semantic information about the target location is obtained. Block3 of ResNet is connected with the lower support (CC) of the cross positioning attention module, so that low-layer information of a target position is obtained, and the target state is better estimated through fusion of the low-layer information and the high-layer information. Connecting a Precision ROI Pooling layer behind a cross positioning attention module, and finally obtaining two adjusting vectors with the size of 512 multiplied by 1 by an adjusting module through inputting a target position of a reference image, wherein the vectors comprise target position information;
(2-2) estimating a target state: the lower part of the adjusting network is composed of ResNet as a backbone network, Block3 and Block4 are multiplied by corresponding channels of adjusting vectors of the adjusting module through a convolution layer respectively, and then a Precision ROI Pooling layer is passed through respectively. In actual operation, performing random disturbance on a test image to simulate a target tracking scene, then obtaining 16 candidate target state frames, and finally obtaining a score for each candidate frame to obtain a target state which is closer to a real state as the higher the candidate frame is;
(3) as shown in fig. 2(c), the three training set images processed in step (1) and the labels corresponding to the three training set images are input into the deep learning network constructed in step (2) for training, the weight parameters of the main network ResNet are pre-training weights trained on the ImageNet large-scale data set, the weights of the main network are not updated in the training, the loss function is a minimum mean square error loss function suitable for the task, a random gradient descent algorithm is adopted to minimize the loss function until the network converges, and the network training is performed at a basic learning rate of 10-3And training 40 epochs under the conditions that 15 epochs are trained to be multiplied by 0.2 and the batch size is 64 every time to obtain a convergent network model
(4) And finally, testing by using the network model trained in the step (3). Firstly, constructing a target classification network, constructing a discrimination network by a related filtering algorithm, and preliminarily obtaining the rough position of a tracked target by the network; and (2) accurately estimating the target state by using the network model trained in the step (3), i) selecting 3 candidate frames from the simulated candidate frames to obtain the highest target candidate frame, and ii) averaging the 3 candidate frames to obtain the final target state.
By combining the above embodiments, the present invention discloses a visual tracking method based on attention mechanism. The specific operation steps are as follows:
(1) processing the image data set, including unifying the size;
(2) the target state estimation deep learning network is constructed, a ResNet is used as a main network, and a cross positioning attention module (CCLA) is embedded into the ResNet main network, so that the position information of a target can be better extracted;
(3) inputting the training data processed in the step (1) into the deep learning network constructed in the step (2) for training until the network converges to obtain a trained network model, and outputting an adjusting vector by the network;
(4) performing an experiment on the test data set by using the trained network model in the step (3), and constructing a target classification network to roughly estimate the target position in the first step; and (2) accurately estimating the target state by using the network model trained in the step (3), i) selecting 3 candidate frames from the simulated candidate frames to obtain the highest target candidate frame, and ii) averaging the 3 candidate frames to obtain the final target state. The invention considers the fusion of the high-level information and the low-level information of the related target in the image as much as possible, and can more accurately extract the position information of the target, thereby more accurately estimating the state of the tracked target.
The embodiments of the present invention have been described with reference to the accompanying drawings, but the present invention is not limited to the embodiments, and various changes and modifications can be made according to the purpose of the invention, and any changes, modifications, substitutions, combinations or simplifications made according to the spirit and principle of the technical solution of the present invention shall be equivalent substitutions, as long as the purpose of the present invention is met, and the present invention shall fall within the protection scope of the present invention without departing from the technical principle and inventive concept of the present invention.
Claims (3)
1. A visual tracking method based on an attention mechanism is characterized by comprising the following specific steps:
(1) constructing a cross location attention module (CCLA);
(2) splitting a visual tracking problem into two tasks of target state estimation and target classification:
a) constructing a target state estimation CNN network by taking ResNet as a backbone network to accurately predict a target form in the tracking process;
b) constructing a discrimination network to separate the target from the background;
(3) embedding the cross attention network in the step (1) into ResNet Block3 and Block4 to enable the target state estimation network to predict the target form more easily and accurately, training the whole network on the basis of models trained by image classification networks ResNet-18, ResNet-34 and ResNet-50, wherein training sets are COCO (20G), LaSOT (227G) and TrackingNet (1.1T), a loss function is a minimum mean square error loss function, a random gradient descent algorithm is adopted to minimize the loss function until the network converges, and the network training is performed at a basic learning rate of 10-3And training 40 with 15 epochs each training multiplied by 0.2 and a batch size of 64Obtaining a convergent network model after epoch;
(4) and (3) testing by using the network model trained in the step (3), giving a target coordinate of a first frame of picture for a video sequence to be tested, simultaneously inputting the target state estimation network and the target classification network in the step (2), firstly obtaining a rough target position by the target classification network, and then obtaining a relatively accurate target state by the target state estimation network.
2. The attention mechanism-based visual tracking method of claim 1, wherein: the construction of the cross positioning attention module in the step (1) comprises the following specific steps:
(1-1) the cross positioning attention module is formed by connecting an upper branch and a lower branch in parallel, wherein the upper branch and the lower branch are respectively CC2And CC;
(1-2) for the upper support of the cross positioning attention module, two cross modules are cascaded, and the upper support module acquires higher semantic information;
(1-3) for the lower support of the cross positioning attention module, the cross positioning attention module consists of a cross module, the cross module acquires the low-level information of the target, and the state of the target is estimated through the combination of the lower support of the cross positioning attention module and the upper support of the cross positioning attention module.
3. The attention-based visual tracking method of claim 2, wherein: the target state estimation network and the judgment network in the step (2) specifically comprise the following steps:
(2-1) the state estimation network is composed of a twin-like network, the upper part of the twin network is an adjustment module, and the input of the module is a Reference Image; the lower part of the twin network is a Testing module, and the input of the Testing module is a Testing Image (Testing Image);
(2-2) the adjusting module is constructed by a ResNet network as a backbone network, and a Block4 of the ResNet network is connected with an upper support (CC) of the cross positioning attention module2) Acquiring more semantic information about the target position; block3 of ResNet connects the lower arms of the cross-positioning attention module(CC) obtaining low-level information of a target position, and estimating a target state through fusion of the low-level information and the high-level information; connecting a Precision ROI Pooling layer behind the cross positioning attention module, and finally obtaining two adjusting vectors with the size of 512 multiplied by 1 by an adjusting module through inputting a target position of a reference image;
(2-3) the lower part of the adjusting network also consists of ResNet as a backbone network, and Block3 and Block4 are multiplied by corresponding channels of the adjusting vector of the adjusting module through a convolution layer respectively, and then pass through a Precision ROI Pooling layer respectively; in actual operation, performing random disturbance on a test image to simulate a target tracking scene, then obtaining 16 candidate target state frames, and finally obtaining a score for each candidate frame to obtain a target state which is closer to a real state as the higher the candidate frame is;
(2-4) constructing a judgment network by a related filtering algorithm, and preliminarily obtaining the rough position of the tracked target by the network; and then, accurately estimating the target state by using an adjusting module, firstly, selecting 3 candidate frames from the simulated candidate frames to obtain the highest target candidate frame, and averaging the 3 candidate frames to obtain the final target state.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010765183.1A CN112150504A (en) | 2020-08-03 | 2020-08-03 | Visual tracking method based on attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010765183.1A CN112150504A (en) | 2020-08-03 | 2020-08-03 | Visual tracking method based on attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112150504A true CN112150504A (en) | 2020-12-29 |
Family
ID=73888775
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010765183.1A Pending CN112150504A (en) | 2020-08-03 | 2020-08-03 | Visual tracking method based on attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112150504A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113221987A (en) * | 2021-04-30 | 2021-08-06 | 西北工业大学 | Small sample target detection method based on cross attention mechanism |
CN113298850A (en) * | 2021-06-11 | 2021-08-24 | 安徽大学 | Target tracking method and system based on attention mechanism and feature fusion |
CN114596273A (en) * | 2022-03-02 | 2022-06-07 | 江南大学 | Intelligent detection method for multiple defects of ceramic substrate by using YOLOV4 network |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108171112A (en) * | 2017-12-01 | 2018-06-15 | 西安电子科技大学 | Vehicle identification and tracking based on convolutional neural networks |
CN111291679A (en) * | 2020-02-06 | 2020-06-16 | 厦门大学 | Target specific response attention target tracking method based on twin network |
CN111354017A (en) * | 2020-03-04 | 2020-06-30 | 江南大学 | Target tracking method based on twin neural network and parallel attention module |
-
2020
- 2020-08-03 CN CN202010765183.1A patent/CN112150504A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108171112A (en) * | 2017-12-01 | 2018-06-15 | 西安电子科技大学 | Vehicle identification and tracking based on convolutional neural networks |
CN111291679A (en) * | 2020-02-06 | 2020-06-16 | 厦门大学 | Target specific response attention target tracking method based on twin network |
CN111354017A (en) * | 2020-03-04 | 2020-06-30 | 江南大学 | Target tracking method based on twin neural network and parallel attention module |
Non-Patent Citations (2)
Title |
---|
MARTIN DANELLJAN等: "ATOM: Accurate Tracking by Overlap Maximization", 《IEEE》 * |
ZILONG HUANG等: "CCNet: Criss-Cross Attention for Semantic Segmentation", 《IEEE》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113221987A (en) * | 2021-04-30 | 2021-08-06 | 西北工业大学 | Small sample target detection method based on cross attention mechanism |
CN113298850A (en) * | 2021-06-11 | 2021-08-24 | 安徽大学 | Target tracking method and system based on attention mechanism and feature fusion |
CN114596273A (en) * | 2022-03-02 | 2022-06-07 | 江南大学 | Intelligent detection method for multiple defects of ceramic substrate by using YOLOV4 network |
CN114596273B (en) * | 2022-03-02 | 2022-11-25 | 江南大学 | Intelligent detection method for multiple defects of ceramic substrate by using YOLOV4 network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111259850B (en) | Pedestrian re-identification method integrating random batch mask and multi-scale representation learning | |
CN108615010B (en) | Facial expression recognition method based on parallel convolution neural network feature map fusion | |
CN110414432B (en) | Training method of object recognition model, object recognition method and corresponding device | |
CN106709461B (en) | Activity recognition method and device based on video | |
CN110427867B (en) | Facial expression recognition method and system based on residual attention mechanism | |
CN110909651B (en) | Method, device and equipment for identifying video main body characters and readable storage medium | |
CN111325111A (en) | Pedestrian re-identification method integrating inverse attention and multi-scale deep supervision | |
CN112150504A (en) | Visual tracking method based on attention mechanism | |
CN113673510B (en) | Target detection method combining feature point and anchor frame joint prediction and regression | |
CN111274994B (en) | Cartoon face detection method and device, electronic equipment and computer readable medium | |
CN112529005B (en) | Target detection method based on semantic feature consistency supervision pyramid network | |
CN110503000B (en) | Teaching head-up rate measuring method based on face recognition technology | |
CN110826462A (en) | Human body behavior identification method of non-local double-current convolutional neural network model | |
CN110705566A (en) | Multi-mode fusion significance detection method based on spatial pyramid pool | |
CN112966574A (en) | Human body three-dimensional key point prediction method and device and electronic equipment | |
CN113643329B (en) | Twin attention network-based online update target tracking method and system | |
CN111414875A (en) | Three-dimensional point cloud head attitude estimation system based on depth regression forest | |
CN111428650B (en) | Pedestrian re-recognition method based on SP-PGGAN style migration | |
CN114897136A (en) | Multi-scale attention mechanism method and module and image processing method and device | |
CN111444957B (en) | Image data processing method, device, computer equipment and storage medium | |
CN114022727B (en) | Depth convolution neural network self-distillation method based on image knowledge review | |
CN115049833A (en) | Point cloud component segmentation method based on local feature enhancement and similarity measurement | |
CN113420289B (en) | Hidden poisoning attack defense method and device for deep learning model | |
CN117115911A (en) | Hypergraph learning action recognition system based on attention mechanism | |
CN117576149A (en) | Single-target tracking method based on attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201229 |
|
RJ01 | Rejection of invention patent application after publication |