CN112150504A - Visual tracking method based on attention mechanism - Google Patents

Visual tracking method based on attention mechanism Download PDF

Info

Publication number
CN112150504A
CN112150504A CN202010765183.1A CN202010765183A CN112150504A CN 112150504 A CN112150504 A CN 112150504A CN 202010765183 A CN202010765183 A CN 202010765183A CN 112150504 A CN112150504 A CN 112150504A
Authority
CN
China
Prior art keywords
network
target
module
attention
cross
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010765183.1A
Other languages
Chinese (zh)
Inventor
吴勇
刘志
黄梦珂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202010765183.1A priority Critical patent/CN112150504A/en
Publication of CN112150504A publication Critical patent/CN112150504A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/251Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a visual tracking method based on an attention mechanism. The specific operation steps are as follows: (1) processing the image data set, including unifying the size; (2) constructing a target state estimation deep learning network, and extracting the position information of the target; (3) inputting the training data processed in the step (1) into the deep learning network constructed in the step (2) for training until the network converges to obtain a trained network model, and outputting an adjusting vector by the network; (4) and (3) testing by using the trained network model in the step (3), giving a target coordinate of a first frame of picture to a video sequence to be tested, inputting the target state estimation network and the target classification network in the step (2), obtaining a rough target position by the target classification network, and obtaining a relatively accurate target state by the target state estimation network. The method can more accurately extract the position information of the target, thereby more accurately estimating the state of the tracked target.

Description

Visual tracking method based on attention mechanism
Technical Field
The invention relates to a visual tracking method based on deep learning, in particular to a visual tracking method based on an attention mechanism, aiming at enabling a network to better extract position information of a tracked target by utilizing an attention module.
Background
It is always a dream and desire of human beings how to make a computer have vision and analyze information in a video to assist or replace human beings to perform some tedious or dangerous work which is not suitable for human beings. In recent years, with the continuous improvement of machine learning and deep learning research, Artificial Intelligence (AI) technology is gradually applied to various industries, and people's lives are merged, for example: autopilot, voice recognition, face recognition, gaming, etc. The video target tracking technology provides a track characteristic for research such as target behavior analysis mainly by predicting the state of a target in a video, and is an important component of Computer Vision research (CV). The method is widely applied to intelligent monitoring, man-machine interaction, automatic driving, virtual reality, crime prediction, surgical navigation, missile navigation and military reconnaissance. As early as 1982, Marr et al constructed a framework for computer vision and demonstrated that fourier transforms of spatial frequency sensitive data could derive retinal receptive field geometry. In 2010, David proposed a tracking algorithm MOSSE based on correlation filtering, which can construct a stable correlation filter under the condition of only the first frame target image. In 2011, Henriques et al proposed a high-speed kernel correlation filtering algorithm KCF, which is a milestone algorithm in discriminant correlation filtering-based video target tracking algorithms, which proves the connection between the correlation filtering and the circulant matrix and obtains a comparable tracking performance. Tang et al replace a single kernel number in KCF with multiple kernels, which enables the tracking algorithm to fully exploit the advantages of invariance of multiple features to discriminate power spectra. Danelljan et al punishment is given to the correlation filter by adding a spatial regularization component, thereby providing a discriminant correlation filtering tracking algorithm based on spatial regularization. In recent years, the algorithm of the twin network is very interesting and has very good effect, and Tao et al learns a matching function from video training by using the twin network and then searches for a target in an image by using the fixed matching function. Bertinetto et al propose a full convolution twin network to learn a similarity between an object and a candidate object. Li et al propose a single-phase twin network tracking algorithm in combination with twin networks and regional candidate networks. The above tracking algorithms have achieved good performance in terms of speed, but are not accurate enough in estimating the state of the target.
Disclosure of Invention
The invention aims to improve the performance of the prior art, and provides a visual tracking method based on an attention mechanism, which can more accurately extract the position information of a target so as to more accurately estimate the state of the tracked target.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a visual tracking method based on an attention mechanism comprises the following specific operation steps:
(1) constructing a cross location attention module (CCLA);
(2) splitting a visual tracking problem into two tasks of target state estimation and target classification:
a) constructing a target state estimation CNN network by taking ResNet as a backbone network to accurately predict a target form in the tracking process;
b) constructing a discrimination network to separate the target from the background;
(3) embedding the cross attention network in the step (1) into ResNet Block3 and Block4 to enable the target state estimation network to predict the target form more easily and accurately, training the whole network on the basis of models trained by image classification networks ResNet-18, ResNet-34 and ResNet-50, wherein training sets are COCO (20G), LaSOT (227G) and TrackingNet (1.1T), a loss function is a minimum mean square error loss function, a random gradient descent algorithm is adopted to minimize the loss function until the network converges, and the network training is performed at a basic learning rate of 10-3And training 40 epochs under the conditions that 15 epochs are multiplied by 0.2 and the batch size is 64 every time to obtain a converged network model.
(4) And (3) testing by using the network model trained in the step (3), giving a target coordinate of a first frame of picture for a video sequence to be tested, simultaneously inputting the target state estimation network and the target classification network in the step (2), firstly obtaining a rough target position by the target classification network, and then obtaining a relatively accurate target state by the target state estimation network.
Preferably, the construction of the cross positioning attention module in the step (1) comprises the following specific steps:
(1-1) the cross positioning attention module is formed by connecting an upper branch and a lower branch in parallel, wherein the upper branch and the lower branch are respectively CC2And CC;
(1-2) for the upper support of the cross positioning attention module, two cross modules are cascaded, and the upper support module acquires higher semantic information;
(1-3) for the lower support of the cross positioning attention module, the cross positioning attention module consists of a cross module, the cross module acquires the low-level information of the target, and the state of the target is accurately estimated through the combination of the cross positioning attention module and the upper support of the cross positioning attention module.
Preferably, the specific steps of the target state estimating network and the network discriminating in the step (2) are as follows:
(2-1) the state estimation network is composed of a twin-like network, the upper part of the twin network is an adjustment module, and the input of the module is a Reference Image; the lower part of the twin network is a Testing module, and the input of the Testing module is a Testing Image (Testing Image);
(2-2) the adjusting module is constructed by a ResNet network as a backbone network, and a Block4 of the ResNet network is connected with an upper support (CC) of the cross positioning attention module2) This way more semantic information about the target location is obtained. Block3 of ResNet is connected with the lower support (CC) of the cross positioning attention module, so that low-layer information of a target position is obtained, and the target state is better estimated through fusion of the low-layer information and the high-layer information. Connecting a Precision ROI Pooling layer behind the cross positioning attention module, and finally obtaining two adjusting vectors with the size of 512 multiplied by 1 by an adjusting module through inputting a target position of a reference image;
(2-3) the lower part of the adjusting network is composed of ResNet as a backbone network, Block3 and Block4 are multiplied by corresponding channels of the adjusting vector of the adjusting module through a convolution layer respectively, and then a Precision ROI Pooling layer is passed through respectively. In actual operation, performing random disturbance on a test image to simulate a target tracking scene, then obtaining 16 candidate target state frames, and finally obtaining a score for each candidate frame to obtain a target state which is closer to a real state as the higher the candidate frame is;
(2-4) constructing a judgment network by a related filtering algorithm, and preliminarily obtaining the rough position of the tracked target by the network; and then, accurately estimating the target state by using an adjusting module, firstly, selecting 3 candidate frames from the simulated candidate frames to obtain the highest target candidate frame, and averaging the 3 candidate frames to obtain the final target state.
Compared with the prior art, the invention has the following obvious prominent substantive characteristics and remarkable technical progress:
the method provided by the invention can be used for extracting the position information of the target more accurately by considering the fusion of the high-level information and the low-level information of the target in the image as much as possible, thereby estimating the state of the tracked target more accurately.
Drawings
FIG. 1 is a network flow diagram of a vision tracking method based on attention mechanism.
FIG. 2(a) is a schematic diagram of the cross positioning attention module method in step (1) of the present invention.
FIG. 2(b) is a schematic diagram of the cross module method for forming the cross location attention module in step (2) of the present invention.
Fig. 2(c) is a schematic diagram of the network model method for obtaining convergence in step (3) of the present invention.
Detailed Description
Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
The simulation experiment carried out by the invention is realized on a PC test platform with a CPU of 4GHz, a memory of 64G, a GPU model of Titan RTX and a video memory of 24G based on PyTorch frame programming.
The first embodiment is as follows:
referring to fig. 1, a visual tracking method based on attention mechanism is characterized by comprising the following specific steps:
(1) constructing a cross location attention module (CCLA);
(2) splitting a visual tracking problem into two tasks of target state estimation and target classification:
a) constructing a target state estimation CNN convolutional neural network by taking ResNet as a backbone network to accurately predict a target form in the tracking process;
b) constructing a discrimination network to separate the target from the background;
(3) embedding the cross attention network in the step (1) into ResNet Block3 and Block4 to enable the target state estimation network to predict the target form more easily and accurately, training the whole network on the basis of models trained by image classification networks ResNet-18, ResNet-34 and ResNet-50, wherein training sets are COCO (20G), LaSOT (227G) and TrackingNet (1.1T), a loss function is a minimum mean square error loss function, a random gradient descent algorithm is adopted to minimize the loss function until the network converges, and the network training is performed at a basic learning rate of 10-3And training 40 epochs under the conditions that 15 epochs are multiplied by 0.2 and the batch size is 64 every time to obtain a converged network model.
(4) And (3) testing by using the network model trained in the step (3), giving a target coordinate of a first frame of picture for a video sequence to be tested, simultaneously inputting the target state estimation network and the target classification network in the step (2), firstly obtaining a rough target position by the target classification network, and then obtaining a relatively accurate target state by the target state estimation network.
Example two:
the present embodiment is substantially the same as the first embodiment in the following aspects:
the construction of the cross positioning attention module in the step (1) comprises the following specific steps:
(1-1) the cross positioning attention module is formed by connecting an upper branch and a lower branch in parallel, wherein the upper branch and the lower branch are respectively CC2And CC;
(1-2) for the upper support of the cross positioning attention module, two cross modules are cascaded, and the upper support module acquires higher semantic information;
(1-3) for the lower support of the cross positioning attention module, the cross positioning attention module consists of a cross module, the cross module acquires the low-level information of the target, and the state of the target is accurately estimated through the combination of the cross positioning attention module and the upper support of the cross positioning attention module.
The target state estimation network and the judgment network in the step (2) specifically comprise the following steps:
(2-1) the state estimation network is composed of a twin-like network, the upper part of the twin network is an adjustment module, and the input of the module is a Reference Image; the lower part of the twin network is a Testing module, and the input of the Testing module is a Testing Image (Testing Image);
(2-2) the adjusting module is constructed by a ResNet network as a backbone network, and a Block4 of the ResNet network is connected with an upper support (CC) of the cross positioning attention module2) This way more semantic information about the target location is obtained. Block3 of ResNet is connected with the lower support (CC) of the cross positioning attention module, so that low-layer information of a target position is obtained, and the target state is better estimated through fusion of the low-layer information and the high-layer information. Connecting a Precision ROI Pooling layer behind the cross positioning attention module, and finally obtaining two adjusting vectors with the size of 512 multiplied by 1 by an adjusting module through inputting a target position of a reference image;
(2-3) the lower part of the adjusting network is composed of ResNet as a backbone network, Block3 and Block4 are multiplied by corresponding channels of the adjusting vector of the adjusting module through a convolution layer respectively, and then a Precision ROI Pooling layer is passed through respectively. In actual operation, performing random disturbance on a test image to simulate a target tracking scene, then obtaining 16 candidate target state frames, and finally obtaining a score for each candidate frame to obtain a target state which is closer to a real state as the higher the candidate frame is;
(2-4) constructing a judgment network by a related filtering algorithm, and preliminarily obtaining the rough position of the tracked target by the network; and then, accurately estimating the target state by using an adjusting module, firstly, selecting 3 candidate frames from the simulated candidate frames to obtain the highest target candidate frame, and averaging the 3 candidate frames to obtain the final target state.
Example three:
as shown in fig. 1, a visual tracking method based on attention mechanism includes the following specific steps:
(1) for LaSOT (227G, 1400 videos, 352 ten thousand pictures), TrackingNet (1.1T, 30132 video trains, 511 video tests, 27 categories), COCO (30 ten thousand, 80 categories), three data sets of which the visual target can be tracked include unification of the sizes of an input image I and a label G, the sizes of the sizes are reduced to 288 × 288, and the size is input to a target state estimation network, as shown in fig. 2(a), a reference image is input to the upper part of the training network, a test image is input to the lower part of the training network, and the interval sequence between the reference image and the test image does not exceed 50 frames at most. As shown in fig. 2(b), after the features of the reference image are extracted through the main network ResNet, the reference image passes through a cross-shaped attention-positioning module, the upper branch of the cross-shaped attention-positioning module is composed of two cross-shaped attention modules in cascade, and the lower branch of the cross-shaped attention-positioning module is composed of one cross-shaped attention module, as shown in fig. 2(c), the specific process of the cross-shaped attention module is as follows:
(1-1) local feature map for input Module
Figure BDA0002614279180000051
Figure BDA0002614279180000052
Is a real number set, C is the number of channels, W is the width of the feature map, and H is the height of the feature map; first two convolution blocks with 1 x 1 convolution kernels are applied to H, then two feature maps Q and K are generated respectively,
Figure BDA0002614279180000053
(1-2) Using the feature maps Q and K that have been obtained, an attention map is generated by a similar operation
Figure BDA0002614279180000054
Deriving a vector from each point u in the spatial dimension of the feature map Q
Figure BDA0002614279180000055
Wherein C' is a fewer number of channels than C; then extracting feature vectors from the same row or column in the feature map K relative to the position u to form
Figure BDA0002614279180000056
Is omegauThe ith element of (1). Mathematical expressions for similar operations are as follows:
Figure BDA0002614279180000057
where T is the transposed symbol, di,uE D is a feature graph QuAnd Ωi,uI ═ 1,. H + W-1],
Figure BDA0002614279180000058
(1-3) the last convolution block with 1 x 1 as convolution kernel directly acts on the feature map H to obtain the feature map
Figure BDA0002614279180000059
Then, each point u in the spatial dimension of the feature map V is obtained
Figure BDA0002614279180000061
And
Figure BDA0002614279180000062
Φuis a set of the same rows or the same columns of the feature map V with respect to the position u. The context information for obtaining the long distance is obtained by the accumulation operation, the mathematical expression is as follows,
Figure BDA0002614279180000063
wherein, H'uIs a feature vector, is a feature map
Figure BDA0002614279180000064
One in position u; a. thej,uIs a scalar value obtained by convolving the channel j and the position u with a convolution kernel of 1 × 1;
(2) as shown in fig. 2(b), a convolutional neural network capable of relatively accurate estimation of the state of the target is constructed: the state estimation network is composed of a twin-like network, the upper part of the twin network is an adjustment module, and the input of the module is a Reference Image; the lower part of the twin network is a Testing module, the input of the Testing module is a Testing Image (Testing Image), the adjusting module is used for extracting the position information of the target, and the Testing module is used for accurately estimating the state of the target, and the method is specifically realized as follows;
(2-1) extracting target location information: the adjusting module is constructed by a ResNet network as a backbone network, and a Block4 of the ResNet network is connected with an upper support (CC) of the cross positioning attention module2) This way more semantic information about the target location is obtained. Block3 of ResNet is connected with the lower support (CC) of the cross positioning attention module, so that low-layer information of a target position is obtained, and the target state is better estimated through fusion of the low-layer information and the high-layer information. Connecting a Precision ROI Pooling layer behind a cross positioning attention module, and finally obtaining two adjusting vectors with the size of 512 multiplied by 1 by an adjusting module through inputting a target position of a reference image, wherein the vectors comprise target position information;
(2-2) estimating a target state: the lower part of the adjusting network is composed of ResNet as a backbone network, Block3 and Block4 are multiplied by corresponding channels of adjusting vectors of the adjusting module through a convolution layer respectively, and then a Precision ROI Pooling layer is passed through respectively. In actual operation, performing random disturbance on a test image to simulate a target tracking scene, then obtaining 16 candidate target state frames, and finally obtaining a score for each candidate frame to obtain a target state which is closer to a real state as the higher the candidate frame is;
(3) as shown in fig. 2(c), the three training set images processed in step (1) and the labels corresponding to the three training set images are input into the deep learning network constructed in step (2) for training, the weight parameters of the main network ResNet are pre-training weights trained on the ImageNet large-scale data set, the weights of the main network are not updated in the training, the loss function is a minimum mean square error loss function suitable for the task, a random gradient descent algorithm is adopted to minimize the loss function until the network converges, and the network training is performed at a basic learning rate of 10-3And training 40 epochs under the conditions that 15 epochs are trained to be multiplied by 0.2 and the batch size is 64 every time to obtain a convergent network model
(4) And finally, testing by using the network model trained in the step (3). Firstly, constructing a target classification network, constructing a discrimination network by a related filtering algorithm, and preliminarily obtaining the rough position of a tracked target by the network; and (2) accurately estimating the target state by using the network model trained in the step (3), i) selecting 3 candidate frames from the simulated candidate frames to obtain the highest target candidate frame, and ii) averaging the 3 candidate frames to obtain the final target state.
By combining the above embodiments, the present invention discloses a visual tracking method based on attention mechanism. The specific operation steps are as follows:
(1) processing the image data set, including unifying the size;
(2) the target state estimation deep learning network is constructed, a ResNet is used as a main network, and a cross positioning attention module (CCLA) is embedded into the ResNet main network, so that the position information of a target can be better extracted;
(3) inputting the training data processed in the step (1) into the deep learning network constructed in the step (2) for training until the network converges to obtain a trained network model, and outputting an adjusting vector by the network;
(4) performing an experiment on the test data set by using the trained network model in the step (3), and constructing a target classification network to roughly estimate the target position in the first step; and (2) accurately estimating the target state by using the network model trained in the step (3), i) selecting 3 candidate frames from the simulated candidate frames to obtain the highest target candidate frame, and ii) averaging the 3 candidate frames to obtain the final target state. The invention considers the fusion of the high-level information and the low-level information of the related target in the image as much as possible, and can more accurately extract the position information of the target, thereby more accurately estimating the state of the tracked target.
The embodiments of the present invention have been described with reference to the accompanying drawings, but the present invention is not limited to the embodiments, and various changes and modifications can be made according to the purpose of the invention, and any changes, modifications, substitutions, combinations or simplifications made according to the spirit and principle of the technical solution of the present invention shall be equivalent substitutions, as long as the purpose of the present invention is met, and the present invention shall fall within the protection scope of the present invention without departing from the technical principle and inventive concept of the present invention.

Claims (3)

1. A visual tracking method based on an attention mechanism is characterized by comprising the following specific steps:
(1) constructing a cross location attention module (CCLA);
(2) splitting a visual tracking problem into two tasks of target state estimation and target classification:
a) constructing a target state estimation CNN network by taking ResNet as a backbone network to accurately predict a target form in the tracking process;
b) constructing a discrimination network to separate the target from the background;
(3) embedding the cross attention network in the step (1) into ResNet Block3 and Block4 to enable the target state estimation network to predict the target form more easily and accurately, training the whole network on the basis of models trained by image classification networks ResNet-18, ResNet-34 and ResNet-50, wherein training sets are COCO (20G), LaSOT (227G) and TrackingNet (1.1T), a loss function is a minimum mean square error loss function, a random gradient descent algorithm is adopted to minimize the loss function until the network converges, and the network training is performed at a basic learning rate of 10-3And training 40 with 15 epochs each training multiplied by 0.2 and a batch size of 64Obtaining a convergent network model after epoch;
(4) and (3) testing by using the network model trained in the step (3), giving a target coordinate of a first frame of picture for a video sequence to be tested, simultaneously inputting the target state estimation network and the target classification network in the step (2), firstly obtaining a rough target position by the target classification network, and then obtaining a relatively accurate target state by the target state estimation network.
2. The attention mechanism-based visual tracking method of claim 1, wherein: the construction of the cross positioning attention module in the step (1) comprises the following specific steps:
(1-1) the cross positioning attention module is formed by connecting an upper branch and a lower branch in parallel, wherein the upper branch and the lower branch are respectively CC2And CC;
(1-2) for the upper support of the cross positioning attention module, two cross modules are cascaded, and the upper support module acquires higher semantic information;
(1-3) for the lower support of the cross positioning attention module, the cross positioning attention module consists of a cross module, the cross module acquires the low-level information of the target, and the state of the target is estimated through the combination of the lower support of the cross positioning attention module and the upper support of the cross positioning attention module.
3. The attention-based visual tracking method of claim 2, wherein: the target state estimation network and the judgment network in the step (2) specifically comprise the following steps:
(2-1) the state estimation network is composed of a twin-like network, the upper part of the twin network is an adjustment module, and the input of the module is a Reference Image; the lower part of the twin network is a Testing module, and the input of the Testing module is a Testing Image (Testing Image);
(2-2) the adjusting module is constructed by a ResNet network as a backbone network, and a Block4 of the ResNet network is connected with an upper support (CC) of the cross positioning attention module2) Acquiring more semantic information about the target position; block3 of ResNet connects the lower arms of the cross-positioning attention module(CC) obtaining low-level information of a target position, and estimating a target state through fusion of the low-level information and the high-level information; connecting a Precision ROI Pooling layer behind the cross positioning attention module, and finally obtaining two adjusting vectors with the size of 512 multiplied by 1 by an adjusting module through inputting a target position of a reference image;
(2-3) the lower part of the adjusting network also consists of ResNet as a backbone network, and Block3 and Block4 are multiplied by corresponding channels of the adjusting vector of the adjusting module through a convolution layer respectively, and then pass through a Precision ROI Pooling layer respectively; in actual operation, performing random disturbance on a test image to simulate a target tracking scene, then obtaining 16 candidate target state frames, and finally obtaining a score for each candidate frame to obtain a target state which is closer to a real state as the higher the candidate frame is;
(2-4) constructing a judgment network by a related filtering algorithm, and preliminarily obtaining the rough position of the tracked target by the network; and then, accurately estimating the target state by using an adjusting module, firstly, selecting 3 candidate frames from the simulated candidate frames to obtain the highest target candidate frame, and averaging the 3 candidate frames to obtain the final target state.
CN202010765183.1A 2020-08-03 2020-08-03 Visual tracking method based on attention mechanism Pending CN112150504A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010765183.1A CN112150504A (en) 2020-08-03 2020-08-03 Visual tracking method based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010765183.1A CN112150504A (en) 2020-08-03 2020-08-03 Visual tracking method based on attention mechanism

Publications (1)

Publication Number Publication Date
CN112150504A true CN112150504A (en) 2020-12-29

Family

ID=73888775

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010765183.1A Pending CN112150504A (en) 2020-08-03 2020-08-03 Visual tracking method based on attention mechanism

Country Status (1)

Country Link
CN (1) CN112150504A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221987A (en) * 2021-04-30 2021-08-06 西北工业大学 Small sample target detection method based on cross attention mechanism
CN113298850A (en) * 2021-06-11 2021-08-24 安徽大学 Target tracking method and system based on attention mechanism and feature fusion
CN114596273A (en) * 2022-03-02 2022-06-07 江南大学 Intelligent detection method for multiple defects of ceramic substrate by using YOLOV4 network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108171112A (en) * 2017-12-01 2018-06-15 西安电子科技大学 Vehicle identification and tracking based on convolutional neural networks
CN111291679A (en) * 2020-02-06 2020-06-16 厦门大学 Target specific response attention target tracking method based on twin network
CN111354017A (en) * 2020-03-04 2020-06-30 江南大学 Target tracking method based on twin neural network and parallel attention module

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108171112A (en) * 2017-12-01 2018-06-15 西安电子科技大学 Vehicle identification and tracking based on convolutional neural networks
CN111291679A (en) * 2020-02-06 2020-06-16 厦门大学 Target specific response attention target tracking method based on twin network
CN111354017A (en) * 2020-03-04 2020-06-30 江南大学 Target tracking method based on twin neural network and parallel attention module

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MARTIN DANELLJAN等: "ATOM: Accurate Tracking by Overlap Maximization", 《IEEE》 *
ZILONG HUANG等: "CCNet: Criss-Cross Attention for Semantic Segmentation", 《IEEE》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221987A (en) * 2021-04-30 2021-08-06 西北工业大学 Small sample target detection method based on cross attention mechanism
CN113298850A (en) * 2021-06-11 2021-08-24 安徽大学 Target tracking method and system based on attention mechanism and feature fusion
CN114596273A (en) * 2022-03-02 2022-06-07 江南大学 Intelligent detection method for multiple defects of ceramic substrate by using YOLOV4 network
CN114596273B (en) * 2022-03-02 2022-11-25 江南大学 Intelligent detection method for multiple defects of ceramic substrate by using YOLOV4 network

Similar Documents

Publication Publication Date Title
CN111259850B (en) Pedestrian re-identification method integrating random batch mask and multi-scale representation learning
CN108615010B (en) Facial expression recognition method based on parallel convolution neural network feature map fusion
CN110414432B (en) Training method of object recognition model, object recognition method and corresponding device
CN106709461B (en) Activity recognition method and device based on video
CN110427867B (en) Facial expression recognition method and system based on residual attention mechanism
CN110909651B (en) Method, device and equipment for identifying video main body characters and readable storage medium
CN111325111A (en) Pedestrian re-identification method integrating inverse attention and multi-scale deep supervision
CN112150504A (en) Visual tracking method based on attention mechanism
CN113673510B (en) Target detection method combining feature point and anchor frame joint prediction and regression
CN111274994B (en) Cartoon face detection method and device, electronic equipment and computer readable medium
CN112529005B (en) Target detection method based on semantic feature consistency supervision pyramid network
CN110503000B (en) Teaching head-up rate measuring method based on face recognition technology
CN110826462A (en) Human body behavior identification method of non-local double-current convolutional neural network model
CN110705566A (en) Multi-mode fusion significance detection method based on spatial pyramid pool
CN112966574A (en) Human body three-dimensional key point prediction method and device and electronic equipment
CN113643329B (en) Twin attention network-based online update target tracking method and system
CN111414875A (en) Three-dimensional point cloud head attitude estimation system based on depth regression forest
CN111428650B (en) Pedestrian re-recognition method based on SP-PGGAN style migration
CN114897136A (en) Multi-scale attention mechanism method and module and image processing method and device
CN111444957B (en) Image data processing method, device, computer equipment and storage medium
CN114022727B (en) Depth convolution neural network self-distillation method based on image knowledge review
CN115049833A (en) Point cloud component segmentation method based on local feature enhancement and similarity measurement
CN113420289B (en) Hidden poisoning attack defense method and device for deep learning model
CN117115911A (en) Hypergraph learning action recognition system based on attention mechanism
CN117576149A (en) Single-target tracking method based on attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201229

RJ01 Rejection of invention patent application after publication