CN116402858B - Transformer-based space-time information fusion infrared target tracking method - Google Patents

Transformer-based space-time information fusion infrared target tracking method Download PDF

Info

Publication number
CN116402858B
CN116402858B CN202310406030.1A CN202310406030A CN116402858B CN 116402858 B CN116402858 B CN 116402858B CN 202310406030 A CN202310406030 A CN 202310406030A CN 116402858 B CN116402858 B CN 116402858B
Authority
CN
China
Prior art keywords
network
formula
image
infrared
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310406030.1A
Other languages
Chinese (zh)
Other versions
CN116402858A (en
Inventor
齐美彬
汪沁昕
庄硕
张可
李坤袁
刘一敏
杨艳芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202310406030.1A priority Critical patent/CN116402858B/en
Publication of CN116402858A publication Critical patent/CN116402858A/en
Application granted granted Critical
Publication of CN116402858B publication Critical patent/CN116402858B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10048Infrared image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a space-time information fusion infrared target tracking method based on a transducer, which comprises the following steps: firstly, preprocessing an infrared image; the second step, construct the infrared target tracking network, including: an infrared image feature extraction sub-network, an infrared image feature fusion sub-network, a corner prediction head sub-network, a salient point focusing sub-network and an IOU-Aware target state evaluation head sub-network; thirdly, constructing a loss function of the infrared target tracking network; and step four, optimizing the infrared target tracking network by adopting a two-stage training method. The invention realizes the fusion of space-time information in the infrared target tracking process by designing a plurality of components, and aims to improve the accuracy and the robustness of the infrared target tracking method under different tracking scenes.

Description

Transformer-based space-time information fusion infrared target tracking method
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a space-time information fusion infrared target tracking method based on a transducer.
Background
Thermal infrared target tracking is a very promising research direction in the field of visual target tracking, and the task of the thermal infrared target tracking is to continuously predict the position of the target in subsequent video frames by giving the basic state of the target to be tracked in an infrared video sequence. Because the infrared image imaging mode does not depend on the intensity of light, the infrared image imaging mode is only related to the temperature of object radiation. Therefore, the infrared target tracking can track the target under the condition of low visibility and even complete darkness, and has all-weather and working capacity under complex environments. Therefore, the system is widely applied to the fields of marine rescue, video monitoring, night driving assistance and the like.
Despite its unique advantages, infrared target tracking also faces a number of challenges. For example, infrared targets have no color information, lack rich texture features, blurred contour texture, etc. These disadvantages cause a lack of local detail features of the infrared target, thereby preventing the existing feature extraction model designed for visible light images from obtaining a strongly discriminative feature representation of the infrared target. In addition, thermal infrared target tracking is also faced with a range of challenges such as thermal crossover, occlusion, dimensional changes, etc. To solve these problems, some infrared target tracking models based on manual design features have been proposed, and despite some advances in these methods, the limited characterization capabilities of manual features still limit the improvement in tracker performance.
Given the powerful feature representation capabilities of convolutional neural networks, some researchers began trying to introduce CNN features into the infrared target tracking task. For example, MCFTS uses a pre-trained convolutional neural network to extract features of multiple convolutional layers of a thermal infrared target, and in combination with a correlation filter, an integrated infrared tracker is constructed. In recent years, the Siam series network is widely applied to a visible light tracking task, and tracking is regarded as a matching problem, and online tracking is performed through offline training of the matching network. In light of this, many infrared trackers based on the Siam network framework have evolved. Wherein the MMNet integrates the characteristic identification feature and fine granularity feature of the TIR by using a multi-task matching framework, and the SiamMSS proposes a plurality of groups of space shift models to enhance the details of the feature map. However, the existing Siam infrared tracker focuses on only spatial information, i.e. the first frame is used as a fixed template to realize matching tracking of the target in the subsequent frame, or the related filter is combined with the Siam network to realize updating of the template by using the historical prediction information. Although these tracking algorithms have good performance and real-time tracking speed in many conventional tracking scenarios. However, when the target is subject to severe appearance changes, non-rigid deformations and partial occlusions, severe drift may occur and cannot recover from tracking failure.
Disclosure of Invention
The invention aims to solve the defects of the prior art, and provides an infrared target tracking method based on space-time information fusion of a transducer, which aims to capture global dependency relationship between infrared image features through an attention mechanism of the transducer, and introduce space-time information with reference value into a model by utilizing information of salient points and an evaluation standard of IOU-Aware, so that the accuracy and the robustness of the infrared target tracking method are further improved.
The invention adopts the following scheme to solve the problems:
the invention relates to a space-time information fusion infrared target tracking method based on a transducer, which is characterized by comprising the following steps of:
step one, preprocessing an infrared image;
step 1.1: arbitrarily selecting a video sequence V containing an infrared target Obj from an infrared target tracking data set, and performing image V on any ith frame in the video sequence V i Image V of the j-th frame j Image V of the kth frame k Cutting and scaling are carried out to respectively obtain preprocessed static template imagesPretreated dynamic template image +.>Search image after pretreatment +.>Will V i ′,V′ j ,V′ k As input to an infrared target tracking network, where H T ,W T Is V (V) i ' height and width, H D ,W D Is V' j Height and width of H S ,W S Is V' k C' is the number of channels for each image;
step two, constructing an infrared target tracking network, which comprises the following steps: an infrared image feature extraction sub-network, an infrared image feature fusion sub-network, a corner prediction head sub-network, a salient point focusing sub-network and an IOU-Aware target state evaluation head sub-network;
step 2.1: the feature extraction sub-network is a ResNet50 network and is used for respectively preprocessing the static template image V i 'dynamic module image V' j And search image V' k Extracting features to obtain static template feature mapDynamic template feature map->And search mapImage feature map->D is the downsampling multiple of the feature extraction network, and C is the channel number of the feature map after downsampling;
step 2.2: will beFlattening the template in space dimension to obtain corresponding static template characteristic sequence +.>Dynamic template feature sequence->Search image feature sequence->Then splicing to obtain mixed characteristic sequence +.>
Step 2.3: in the mixed feature sequence f m Adding sinusoidal position codesObtaining a hybrid signature sequence comprising a position code +.>
Step 2.4: constructing the infrared image feature fusion sub-network for the mixed feature sequence f M Processing to obtain a search feature map
Step 2.5: the corner prediction head sub-network consists of two full convolution networks, each full convolution network comprises A stacked Conv-BN-ReLU layers and one Conv layer for F S The prediction boundary frame of the' contained infrared target Obj carries out angular point probability prediction, so that the two full convolution networks respectively output angular point probability distribution diagrams of the upper left corner of the prediction boundary frameAnd the corner probability distribution map of the lower right corner +.>
Step 2.6: calculating the upper left corner coordinates (x 'of the prediction boundary box using equation (1)' tl ,y′ tl ) And lower right angular position (x' br ,y′ br ) Thereby obtaining the search image V 'of the infrared target Obj' k In (c) a prediction bounding box B '= (x' tl ,y′ tl ,x′ br ,y′ br ) Wherein (x, y) represents the corner probability distribution map P tl ,P br Upper coordinates, and
step 2.7: the salient point focusing sub-network is used for extracting salient point characteristics
Step 2.8: the IOU-Aware target state evaluation head sub-network consists of a plurality of layers of perceptron and F 'is adopted' S Comprising B F All salient point features insideInputting the prediction boundary box B 'into an IOU-Aware target state evaluation head subnetwork, and outputting an IOU Score of the prediction boundary box B';
step three, constructing a loss function of the infrared target tracking network;
step 3.1: constructing a loss function L of the corner prediction head subnetwork by using the formula (2) bp
In the formula (2), the amino acid sequence of the compound,λ GIOU super parameters, b= (x), which are all real number domains tl ,y tl ,x br ,y br ) Four corner coordinates representing a real frame of an infrared target Obj; l1_loss represents loss of four corner distances of the prediction boundary box and the real box, and is obtained by the formula (3); GIOU_loss represents the loss of the generalized cross-correlation of the prediction bounding box and the real box, and is obtained by the formula (4);
in the formula (3), B t 'the coordinates of the t-th corner representing the prediction boundary box B', B t The t-th corner coordinate of the real frame B is represented;
GIOU_loss=1-GIOU (4)
in the formula (4), GIOU represents the generalized cross-over ratio of B' and B, and is obtained from the formula (5);
in the formula (5), rec represents a minimum rectangular frame area including B' and B, and is obtained from the formula (6); IOU represents the intersection ratio of B' and B and is obtained by the formula (8);
rec=(x 4 -x 1 )(y 4 -y 1 ) (6)
in formula (6), x 4 ,y 4 Maximum values of the right lower angular coordinates of B' and B, x, respectively 1 ,y 1 Representing the minimum value of the upper left corner coordinates of B' and B, respectively, and obtained by the formula (7);
In the formula (8), the unit represents the union area of B' and B, and is obtained by the formula (9);
union=S′+S-inter (9)
in the formula (9), inter represents the intersection area of B' and B, and is obtained from the formula (10); s 'represents the area of B', S represents the area of B, and is obtained by the formula (11);
inter=(x 3 -x 2 )(y 3 -y 2 ) (10)
in the formula (10), x 2 ,y 2 Maximum values of the upper left corner coordinates of B' and B, x, respectively 3 ,y 3 The minimum values of the lower right corner coordinates of B' and B are respectively represented, and are obtained by the formula (12);
in the formula (11), B' w ,B′ h Respectively represent the width and the height of B', B w ,B h Respectively representing the width and height of B, and is obtained by the formula (13)
Step 3.2: constructing a loss function L of the IOU-Aware target state evaluation head subnetwork by using a formula (14) IATSE
L IATSE =-|IOU-Score| β ((1-IOU)log(1-Score)+IOU log(Score)) (14)
In the formula (14), beta is a real-number category super parameter;
optimizing an infrared target tracking network by adopting a two-stage training method;
step 4.1: during the first stage training, freezing the IOU-Aware target state evaluation head subnetwork, training other networks except the IOU-Aware target state evaluation head subnetwork in the infrared target tracking network by using a gradient descent algorithm, updating network parameters by minimizing a loss function shown in the step (2), and stopping training when the training iteration number reaches the set number of times, thereby obtaining the infrared target tracking network after preliminary training;
step 4.2: during the second stage training, freezing the infrared image feature extraction sub-network after the preliminary training, the infrared image feature fusion sub-network after the preliminary training and the salient point focusing sub-network after the preliminary training, training the corner prediction head sub-network and the IOU-Aware target state evaluation head sub-network after the preliminary training by using a gradient descent algorithm, updating network parameters by minimizing a loss function shown in the (15), and stopping training when the training iteration number reaches the set number of times, thereby obtaining a trained infrared target tracking model for realizing continuous accurate positioning of an infrared target;
in the formula (15), the amino acid sequence of the compound,is a real-domain super-parameter.
The infrared target tracking method based on the space-time information fusion of the transformers is also characterized in that the infrared image feature fusion sub-network in the step 2.4 comprises the following steps: the encoder module based on the transducer, the decoder module based on the transducer and the coder-decoder post-processing module, and the search feature map is obtained according to the following steps
Step 2.4.1: the transducer-based encoder module consists of R multi-headed self-attention blocks and will contain a position-coded hybrid feature sequence f M Modeling global relationships in spatial and temporal dimensions in an input encoder module to obtain a discriminative spatio-temporal feature sequence f' M R is the number of multi-headed self-attention blocks in the encoder module;
step 2.4.2: the transducer-based decoder module consists of N multi-headed self-attention blocks and combines a spatio-temporal feature sequence f' M And a single target queryThe input decoder module carries out cross attention processing and outputs enhanced target inquiry +.>N is the number of multi-headed self-attention blocks in the decoder block;
step 2.4.3: the codec post-processing module generates a temporal-spatial feature sequence f' M Is decoupled from the corresponding search region feature sequenceAnd calculate f' S Similarity score with oq->And then similarity scores att and f' S After element-wise multiplication, an enhanced search region feature sequence is obtained>Finally f S Restoring to enhanced search feature map +.>
In said step 2.7The salient point focusing subnetwork of (2) comprises: the salient point coordinate prediction module and the salient point feature extraction module are used for obtaining a search image V '' k The salient point features contained;
step 2.7.1: the salient point coordinate prediction module maps B 'to F' S After the mapping, the mapped coordinates B are obtained F Then from F 'by the ROIAlign operation' S Extraction of B F Corresponding region level featuresWherein K represents F P Is the width and height of (2);
the salient point coordinate prediction module performs the prediction on F through a convolution layer pair P After the dimension reduction operation is carried out, obtaining the region level characteristics after dimension reductionThen F 'is carried out' P Flattened as one-dimensional tensor->Then inputting the obtained product into a multi-layer perceptron to predict to obtain F P Predicted coordinates corresponding to L salient points +.>Wherein C 'represents F' P L represents the number of salient points;
step 2.7.2: will beRestoring to two-dimensional tensor->After that, the salient point feature extraction module extracts the salient points from F by bilinear interpolation P Mid-sampling Loc sp Corresponding salient Point feature->
The electronic device of the invention comprises a memory and a processor, wherein the memory is used for storing a program for supporting the processor to execute the infrared target tracking method, and the processor is configured to execute the program stored in the memory.
The invention relates to a computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of the infrared target tracking method.
Compared with the prior art, the invention has the beneficial effects that:
1. most of the existing infrared target tracking technologies ignore the utilization of time information, so that the model is difficult to recover when tracking fails. Therefore, the invention is based on the traditional two-branch tracking framework (static template diagram-search diagram) based on Siam, and additionally adds a dynamic template selection branch which introduces a dynamic template changing along with time for the model, and the dynamic template is used as the input of the model together with the static template and the search diagram. In addition, the invention further captures the global dependency relationship of the space-time information by utilizing the encoder-decoder structure of the transducer in the characteristic fusion stage, thereby overcoming the problem that the general infrared tracking technology can only locally model the target characteristic information.
2. In order to further capture the state change of the target object with time, the invention introduces the salient point information into a dynamic template selection branch, and realizes the evaluation of the quality of the target image by explicitly searching a plurality of salient points on the target image and focusing the information of all the salient points, thereby selecting a proper candidate object for updating the template image, and improving the tracking performance of the infrared target tracking method under the conditions of appearance change, non-rigid deformation and the like of the target.
3. The existing target tracker using the dynamic template selection module to introduce time information fails to provide an explicit standard for quality assessment of target images during the training phase. They randomly assign label of the target image during the training phase (i.e., positive sample is 1 and negative sample is 0), indicating that the image is selected as a dynamic template when label is 1. The fuzzy estimation on the quality of the target image can cause that the model cannot make the most accurate quality estimation on the current state of the target image during testing, so that redundant time information which does not have reference value is introduced into the model, and the effect of the template updating module is weakened. Aiming at the problem, the invention selects the IOU-aware score between the prediction boundary box and the real frame as the training target of the dynamic template selection module, the score defines the measurement standard of whether the target image can be used as the dynamic template of the tracker as the positioning accuracy degree of the angular point prediction head, and the training target has a clear evaluation standard at the moment, so that the model can achieve better tracking effect during testing.
Drawings
FIG. 1 is a flow chart of a network of the present invention;
FIG. 2 is a block diagram of a network of the present invention;
FIG. 3 is a block diagram of an IOU-Aware target state evaluation head according to the present invention.
Detailed Description
In this embodiment, an infrared target tracking method based on temporal and spatial information fusion of a transducer, as shown in fig. 1, includes the following steps:
step one, preprocessing an infrared image;
step 1.1: arbitrarily selecting a video sequence v= { V containing a specific infrared target Obj from the infrared target tracking dataset 1 ,V 2 ,…,V n ,…,V I I represents the total frame number of the selected infrared video sequence, V n Representing an nth frame of image in a video sequence, n E [1, I]And for any ith frame image V in video sequence V i Image V of the j-th frame j Image V of the kth frame k Cutting and scaling are carried out to respectively obtain preprocessed static template imagesPreprocessed dynamic template imageSearch after pretreatmentCord image->Will V i ′,V j ′,V k ' as input to an infrared target tracking network, where H T ,W T Is T n Height and width of H D ,W D For D n Height and width of H S ,W S Is S n C' is the initial channel number of each image, i, j, k E [1, I]. In the present embodiment, V i ' height and width H T =W T =128,V j ' height and width H D =W D =128,V k ' height and width H S =W S The initial channel number C' =3 for each image =320;
step two, constructing an infrared target tracking network, which comprises the following steps: an infrared image feature extraction sub-network, an infrared image feature fusion sub-network, a corner prediction head sub-network, a salient point focusing sub-network and an IOU-Aware target state evaluation head sub-network;
step 2.1: the feature extraction sub-network is a ResNet50 network and is used for respectively preprocessing the static template image V i ' dynamic module image V j ' sum search image V k ' extracting features, and correspondingly obtaining a static template feature mapDynamic template feature map->And search image feature map->In this example, the downsampling multiple d=16 of the feature extraction network, and the channel number c=256 of each feature map after downsampling;
step 2.2: will beRespectively flattening from space dimension to obtain correspondingIs>Dynamic template feature sequence->Search image feature sequence->Then splicing to obtain mixed characteristic sequence +.>
Step 2.3: in the mixed feature sequence f m Adding sinusoidal position codesObtaining a hybrid signature sequence comprising a position code +.>
Step 2.4, constructing an infrared image feature fusion sub-network, which comprises the following steps: a transducer-based encoder module, a transducer-based decoder module, and a codec post-processing module:
step 2.4.1: the transducer-based encoder module consists of R multi-headed self-attention blocks and will contain a position-coded hybrid feature sequence f M Modeling global relationships in spatial and temporal dimensions in an input encoder module to obtain a discriminative spatio-temporal feature sequence f' M R is the number of multi-headed self-attention blocks in the encoder block. In this example, r=6;
step 2.4.2: the transducer-based decoder module consists of N multi-headed self-attention blocks and combines a spatio-temporal feature sequence f' M And a single target queryInput into decoder modulePerforming cross attention processing to output enhanced target query +.>N is the number of multi-headed self-attention blocks in the decoder block. In this example, n=6;
step 2.4.3: the codec post-processing module extracts from the spatio-temporal feature sequence f' M Is decoupled from the corresponding search region feature sequenceAnd calculate f' S Similarity score with oq->And then similarity scores att and f' S After element-wise multiplication, an enhanced search region feature sequence is obtained>Finally f S Restoring to enhanced search feature map +.>
Step 2.5 the corner prediction head sub-network consists of two full convolutional networks, each comprising A stacked Conv-BN-ReLU layers and one Conv layer for the F '' S The prediction boundary frame of the infrared target Obj is used for carrying out angular point probability prediction, so that the angular point probability distribution map of the left upper corner of the prediction boundary frame is respectively output by two full convolution networksAnd the corner probability distribution map of the lower right corner +.>In this example, a=4, the Conv layer in each Conv-BN-ReLU layer has a convolution kernel size of 3×3, a stride of 1, a padding of 1, a parameter momentum=0.1 for BN layer, the last individual ConvThe convolution kernel has a size of 1 x 1 and a stride of 1.
Step 2.6: calculating the upper left corner coordinates (x 'of the prediction boundary box using equation (1)' tl ,y′ tl ) And lower right angular position (x' br ,y′ br ) Thereby obtaining the search image V 'of the infrared target Obj' k In (c) a prediction bounding box B '= (x' tl ,y′ tl ,x′ br ,y′ br ) Wherein (x, y) represents the corner probability distribution map P tl ,P br Upper coordinates, and
step 2.7: a salient point focusing sub-network comprising: the salient point coordinate prediction module and the salient point feature extraction module are used for obtaining a search image V '' k The salient point features contained;
step 2.7.1: the salient point coordinate prediction module maps B 'to F' S After the mapping, the mapped coordinates B are obtained F Then from F 'by the ROIAlign operation' S Extraction of B F Corresponding region level featuresWherein K represents F P Is the width and height of (2);
the salient point coordinate prediction module performs the convolutional layer on F P After the dimension reduction operation is carried out, obtaining the region level characteristics after dimension reductionThen F 'is carried out' P Flattened as one-dimensional tensor->Then inputting the obtained product into a multi-layer perceptron to predict to obtain F P Predicted coordinates corresponding to L salient points +.>Wherein C 'represents F' P L represents the number of salient points. In this example, k=7, l=8, and the multi-layered perceptron is formed by connecting 4 linear layers, wherein the output channel of the first linear layer is 256, the output channel of the second linear layer is 512, the output channel of the third linear layer is 512, and the output channel of the fourth linear layer is 16;
step 2.7.2: will beRestoring to two-dimensional tensor->The salient point feature extraction module extracts the salient points from F by bilinear interpolation P Mid-sampling Loc' sp Corresponding salient Point feature->
Step 2.8: the IOU-Aware target state evaluation head subnetwork is composed of multiple layers of perceptron and F' S Comprising B F All salient point features insideInput into the IOU-Aware target state evaluation head subnetwork, and output the predicted IOU Score for B'.
The training goal of the dynamic selection module of a general space-time tracking model is a classification score (i.e., foreground is "1" and background is "0"). The invention provides an IOU-Aware target state evaluation head which is composed of a plurality of layers of perceptrons, and the structure diagram is shown in figure 3. In this example, the IOU-Aware target state evaluation head is formed by connecting 4 linear layers, wherein the output channel of the first linear layer is 1024, the output channel of the second linear layer is 512, the output channel of the third linear layer is 256, and the output channel of the fourth linear layer is 1. The input is the characteristic of all salient points in the target image prediction boundary box, the output applies the design of IOU-Aware, and the training target is replaced by the IOU score between the prediction boundary box and the real box from the general classification score (namely the foreground is '1', and the background is '0') so as to strengthen the connection of classification and regression branches. Based on the reselection of the training objectives, the Score output by the IOU-Aware target state assessment head at this time aggregates the information of all salient points in the prediction bounding box, representing the IoU Score of the prediction bounding box, and is therefore referred to as IoU-Aware target state assessment Score. The score provides a criterion for the evaluation of the current state of the target image, IOU-Aware. By integrating the salient point information of the target object into the evaluation of the IOU-Aware, a joint representation of the regression frame itself and the most discernable features contained therein can be obtained, which defines a measure of whether the target image can be used as a dynamic template of the tracker as the degree of accuracy of the positioning of the corner pre-head, since the more accurate the corner pre-head predicts the target bounding box, the more useful information contained therein that can be used to evaluate the quality of the target image, and the more accurate the evaluation result.
Step three, constructing a loss function of the infrared target tracking network;
step 3.1, constructing a loss function L of the corner prediction head sub-network by utilizing the corner (2) bp
In formula (2), b= (x) tl ,y tl ,x br ,y br ) Four corner coordinates representing a real frame of an infrared target Obj; l1_loss represents loss of four corner distances of the prediction boundary box and the real box, and is obtained by the formula (3); GIOU_loss represents the loss of the generalized cross-correlation of the prediction bounding box and the real box, and is obtained by the formula (4);
in the formula (3), B' t Four corner coordinates representing a prediction bounding box B', B t Four representing the real frame BAnd (5) corner coordinates.
GIOU_loss=1-GIOU (4)
In the formula (4), GIOU represents the generalized cross-over ratio of B' and B, and is obtained from the formula (5);
in the formula (5), rec represents a minimum rectangular frame area including B' and B, and is obtained from the formula (6); IOU represents the intersection ratio of B' and B and is obtained by the formula (8);
rec=(x 4 -x 1 )(y 4 -y 1 ) (6)
in formula (6), x 4 ,y 4 Maximum values of the right lower angular coordinates of B' and B, x, respectively 1 ,y 1 The minimum values of the upper left corner coordinates of B' and B are respectively represented, and are obtained by the formula (7);
in the formula (8), the unit represents the union area of B' and B, and is obtained by the formula (9);
union=S′+S-inter (9)
in the formula (9), inter represents the intersection area of B' and B, and is obtained from the formula (10); s 'represents the area of B', S represents the area of B, and is obtained by the formula (11);
inter=(x 3 -x 2 )(y 3 -y 2 ) (10)
in the formula (10), x 2 ,y 2 Maximum values of the upper left corner coordinates of B' and B, x, respectively 3 ,y 3 The minimum values of the lower right corner coordinates of B' and B are respectively represented, and are obtained by the formula (12);
in the formula (11), B' w ,B′ h Respectively represent the width and the height of B', B w ,B h Respectively representing the width and height of B, and is obtained by the formula (13)
Step 3.2, constructing the loss function L of the IOU-Aware target state evaluation head sub-network by utilizing the step (14) IATSE
L IATSE =-|IOU-Score| β ((1-IOU)log(1-Score)+IOU log(Score)) (14)
In equation (14), β is a real-domain super-parameter. In this example, β=2;
optimizing an infrared target tracking network by adopting a two-stage training method;
step 4.1: during the first stage training, freezing the IOU-Aware target state evaluation head subnetwork, training other networks except the IOU-Aware target state evaluation head subnetwork in the infrared target tracking network by using a gradient descent algorithm, updating network parameters by minimizing a loss function shown in the step (2), and stopping training when the training iteration number reaches the set number of times, thereby obtaining the infrared target tracking network after preliminary training;
step 4.2: during the second stage training, freezing the infrared image feature extraction sub-network after the preliminary training, the infrared image feature fusion sub-network after the preliminary training and the salient point focusing sub-network after the preliminary training, training the corner prediction head sub-network and the IOU-Aware target state evaluation head sub-network after the preliminary training by using a gradient descent algorithm, updating network parameters by minimizing a loss function shown in the (15), and stopping training when the training iteration number reaches the set number of times, thereby obtaining a trained infrared target tracking model for realizing continuous accurate positioning of a specific infrared target;
in the formula (15), the amino acid sequence of the compound,is a real-domain super-parameter. In this example, the->
In this embodiment, an electronic device includes a memory for storing a program supporting the processor to execute the above method, and a processor configured to execute the program stored in the memory.
In this embodiment, a computer-readable storage medium stores a computer program that, when executed by a processor, performs the steps of the method described above.
The invention establishes two standards for the dynamic template updating mechanism: (1) update threshold (2) update interval. Only when the update interval is reached and the Score output by the IOU-Aware target state assessment head reaches the update threshold, the current search image is selected as a dynamic template for the subsequent tracking process. The dynamic templates are updated continuously during the tracking process. The overall tracking flow of the present invention is shown in figure 1 below. Specifically, a first frame of the video sequence is selected as a fixed static template image, and a region with the size which is 2 times of the area of a target frame of the static template image is cut and scaled by taking the center of the target frame of the static template image as the center to obtain a preprocessed static template image with the size of 128 multiplied by 128. The method comprises the steps that other frames except for a first frame of a video sequence are search pictures, the preprocessing of a search image is to take the center of a target frame predicted by the previous frame as the center, and the current search picture is cut and scaled by an area with the size of 5 times that of the target frame area to obtain a preprocessed search image with the size of 320 multiplied by 320. The dynamic template is determined by the dynamic template selection module, and if the current search image for predicting the target frame position meets the updating condition of the dynamic template selection module, the search image takes the center of a predicted target frame as the center, and the area with the size which is 2 times that of the target frame is cut and scaled to obtain a preprocessed dynamic template image with the size of 128 multiplied by 128. The preprocessed static template image, dynamic template image and search image can be used as the input of the tracking network together when the tracker predicts the target position of the next frame of search image. In the test process, the double templates and the current search frame are sent to the network for feature extraction and fusion. Then, the corner prediction head outputs a predicted target boundary frame of the current search frame, and then searches for salient points within a range surrounded by the boundary frame. And finally, all salient point features extracted through bilinear interpolation are sent to an IOU-Aware target state evaluation head to obtain the state evaluation score of the current search frame. When the score meets the update threshold and reaches the update interval, the current search frame will be considered as a dynamic template for the subsequent tracking process.
Table 1.1 comparison of ablation experimental results
Table 1.2 comparison of ablation experimental results
Table 2 comparison of results of different ir tracking algorithms on PTB-TIR datasets
Table 3 comparison of results of different ir tracking algorithms on LSOTB-TIR dataset
The infrared tracking network structure based on the space-time information fusion of the transformers is shown in fig. 2, the global dependency relationship between the dual-template feature sequences and the elements of the search feature sequences is obtained by using a coder-decoder structure based on the transformers, and the dynamic template selection module is focused on the most discernable features by using the salient point information. Meanwhile, the algorithm also introduces an IOU-Aware evaluation component, and integrates the quality evaluation of the dynamic template into the IOU prediction, so that a more reliable standard is provided for the quality evaluation of the dynamic template. Table 1.1 is a comparison of the results of ablation experiments for the Point of significance component (SPF) and the IOU-Aware (I-A) components of the present invention. The experiment takes Stark-s algorithm in the RGB tracking field as a reference model, and the obvious advantage of the invention in space-time information utilization can be seen by respectively adding an SPF component and an I-A component on the reference model. Wherein Accuracy (Acc) is an Accuracy index, robustness (Rob) is a robustness index, and EAO is a desired average overlap ratio. The larger Acc in the index indicates the smaller difference of the center distance between the real frame and the predicted frame, the larger Rob indicates the smaller number of tracking loss of the tracker, and the larger EAO indicates the better average performance of the tracker. The results of Table 1.1 show that utilizing the information of the salient points and the IOU-Aware evaluation component can effectively improve the tracking performance of the network. The reference model introduces the information of the whole target image into the evaluation of the dynamic template, and the invention only selects part of the information to be introduced into the evaluation of the dynamic template by limiting the search range of the salient points in the prediction boundary box, and the table 1.2 shows the comparison result of the search range of the dynamic template information. It can be seen from the table that the present invention is superior to the reference model in all evaluation metrics, which suggests that the quality estimation of the target image is more dependent on identifying key features than on assigning equal importance to all features.
Tables 2 and 3 are comparison of the evaluation results of the present invention with other infrared target tracking algorithms on both the PTB-TIR and LSOTB-TIR infrared data sets. Where STFT (Ours) represents the present invention, ECO-deep, ECO-TIR, MCFTS represents depth feature based correlation filter tracker, MDNet, VITAL represents other depth trackers, siamFC SiamRPN++, siamMask, siamMSS, HSSNet, MLSSNet, MMNet, STMTrack, stark-s, stark-st is a Siam network based tracker. Success is a Success rate index, precision is a Precision index, norm Precision is a normalized Precision index, and the larger the Success is, the higher the overlapping degree of the prediction frame and the real frame is, and the larger the Precision and the Norm Precision are, the smaller the difference of the center distances between the prediction frame and the real frame is. The results in tables 2 and 3 show that the overall performance of the present invention is superior to the infrared tracking method described above under the current evaluation index.

Claims (4)

1. The infrared target tracking method based on the space-time information fusion of the transducer is characterized by comprising the following steps of:
step one, preprocessing an infrared image;
step 1.1: arbitrarily selecting a video sequence V containing an infrared target Obj from an infrared target tracking data set, and performing image V on any ith frame in the video sequence V i Image V of the j-th frame j Image V of the kth frame k Cutting and scaling are carried out to respectively obtain preprocessed static template imagesPretreated dynamic template image +.>Search image after pretreatment +.>Will V i ′,V′ j ,V′ k As input to an infrared target tracking network, where H T ,W T Is V (V) i ' height and width, H D ,W D Is V' j Height and width of H S ,W S Is V' k C' is the number of channels for each image;
step two, constructing an infrared target tracking network, which comprises the following steps: an infrared image feature extraction sub-network, an infrared image feature fusion sub-network, a corner prediction head sub-network, a salient point focusing sub-network and an IOU-Aware target state evaluation head sub-network;
step 2.1: the feature extraction sub-network is a ResNet50 network and is used for respectively preprocessing the static template image V i ' dynamic module image V j ' sum search image V k ' extracting features, and correspondingly obtaining a static template feature mapDynamic template feature map->And search image feature map->D is the downsampling multiple of the feature extraction network, and C is the channel number of the feature map after downsampling;
step 2.2: will beFlattening the template in space dimension to obtain corresponding static template characteristic sequence +.>Dynamic template feature sequence->Search image feature sequence->Then splicing to obtain mixed characteristic sequence +.>
Step 2.3: in the mixed feature sequence f m Adding sinusoidal position codesObtaining a hybrid signature sequence comprising a position code +.>
Step 2.4: constructing the infrared image feature fusion sub-network for the mixed feature sequence f M Processing to obtain a search feature map
The infrared image feature fusion subnetwork comprises: the encoder module based on the transducer, the decoder module based on the transducer and the coder-decoder post-processing module, and the search feature map is obtained according to the following steps
Step 2.4.1: the transducer-based encoder module consists of R multi-headed self-attention blocks and will contain a position-coded hybrid feature sequence f M Modeling global relationships in spatial and temporal dimensions in an input encoder module to obtain a discriminative spatio-temporal feature sequence f' M R is the number of multi-headed self-attention blocks in the encoder module;
step 2.4.2: the transducer-based decoder module consists of N multi-headed self-attention blocks and combines a spatio-temporal feature sequence f' M And a single target queryThe input decoder module carries out cross attention processing and outputs enhanced target inquiry +.>N is the number of multi-headed self-attention blocks in the decoder block;
step 2.4.3: the codec post-processing module generates a temporal-spatial feature sequence f' M Is decoupled from the corresponding search region feature sequenceAnd calculate f' S Similarity score with oq->And then similarity scores att and f' S After element-wise multiplication, an enhanced search region feature sequence is obtained>Finally f S Restoring to enhanced search feature map +.>
Step 2.5: the corner prediction head sub-network consists of two full convolution networks, each full convolution network comprises A stacked Conv-BN-ReLU layers and one Conv layer for F' S The prediction boundary frame of the infrared target Obj is included to conduct angular point probability prediction, so that the two full convolution networks respectively output angular point probability distribution diagrams of the upper left corner of the prediction boundary frameAnd the corner probability distribution map of the lower right corner +.>
Step 2.6: calculating the upper left corner coordinates (x 'of the prediction boundary box using equation (1)' tl ,y′ tl ) And lower right angular position (x' br ,y′ br ) Thereby obtaining the search image V 'of the infrared target Obj' k In (c) a prediction bounding box B '= (x' tl ,y′ tl ,x′ br ,y′ br ) Wherein (x, y) represents the corner probability distribution map P tl ,P br Upper coordinates, and
step 2.7: the salient point focusing sub-network is used for extracting salient point characteristics
Step 2.8: the IOU-Aware target state evaluation head sub-network consists of a plurality of layers of perceptron and F 'is adopted' S Comprising B F All salient point features insideInputting the prediction boundary box B 'into an IOU-Aware target state evaluation head subnetwork, and outputting an IOU Score of the prediction boundary box B'; l represents the number of salient points; b (B) F Representing the mapping of B 'to F' S After the mapping, the mapped coordinates are obtained;
setting updating conditions of the dynamic template as follows: the IOU-Aware target state evaluation head sub-network achieves an update interval, and the output IOU Score achieves an update threshold;
taking a first frame of the video sequence as a fixed static template image, taking the center of a target frame of the static template image as the center, cutting and scaling the area where the target frame is positioned, and obtaining a preprocessed static template image; the method comprises the steps that other frames except a first frame of a video sequence are searched target images, the pretreatment of the searched target images is to cut and scale the currently searched target images by taking the center of a target frame obtained by predicting the previous frame as the center, and then the pretreated search images are obtained; if the current search image for predicting the position of the target frame meets the updating condition of the dynamic template, selecting the current search image as the dynamic template image of the follow-up tracking process, and cutting and scaling the area where the target frame is positioned by taking the center of the predicted target frame as the center of the current search target image to obtain a preprocessed dynamic template image;
inputting the preprocessed static template image, the preprocessed dynamic template image and the preprocessed search image into an IOU-Aware target state evaluation head subnetwork to predict the IOU score of the prediction boundary box;
step three, constructing a loss function of the infrared target tracking network;
step 3.1: constructing a loss function L of the corner prediction head subnetwork by using the formula (2) bp
In the formula (2), the amino acid sequence of the compound,λ GIOU super parameters, b= (x), which are all real number domains tl ,y tl ,x br ,y br ) Four corner coordinates representing a real frame of an infrared target Obj; l1_loss represents loss of four corner distances of the prediction boundary box and the real box, and is obtained by the formula (3); GIOU_loss represents the loss of the generalized cross-correlation of the prediction bounding box and the real box, and is obtained by the formula (4);
in the formula (3), B t 'the coordinates of the t-th corner representing the prediction boundary box B', B t The t-th corner coordinate of the real frame B is represented;
GIOU_loss=1-GIOU (4)
in the formula (4), GIOU represents the generalized cross-over ratio of B' and B, and is obtained from the formula (5);
in the formula (5), rec represents a minimum rectangular frame area including B' and B, and is obtained from the formula (6); IOU represents the intersection ratio of B' and B and is obtained by the formula (8);
rec=(x 4 -x 1 )(y 4 -y 1 ) (6)
in formula (6), x 4 ,y 4 Maximum values of the right lower angular coordinates of B' and B, x, respectively 1 ,y 1 The minimum values of the upper left corner coordinates of B' and B are respectively represented, and are obtained by the formula (7);
in the formula (8), the unit represents the union area of B' and B, and is obtained by the formula (9);
union=S′+S-inter (9)
in the formula (9), inter represents the intersection area of B' and B, and is obtained from the formula (10); s 'represents the area of B', S represents the area of B, and is obtained by the formula (11);
inter=(x 3 -x 2 )(y 3 -y 2 ) (10)
in the formula (10), x 2 ,y 2 Maximum values of the upper left corner coordinates of B' and B, x, respectively 3 ,y 3 The minimum values of the lower right corner coordinates of B' and B are respectively represented, and are obtained by the formula (12);
in the formula (11), B' w ,B′ h Respectively represent the width and the height of B', B w ,B h Respectively representing the width and height of B, and is obtained by the formula (13)
Step 3.2: constructing a loss function L of the IOU-Aware target state evaluation head subnetwork by using a formula (14) IATSE
L IATSE =-|IOU-Score| β ((1-IOU)log(1-Score)+IOU log(Score)) (14)
In the formula (14), beta is a real-number category super parameter;
optimizing an infrared target tracking network by adopting a two-stage training method;
step 4.1: during the first stage training, freezing the IOU-Aware target state evaluation head subnetwork, training other networks except the IOU-Aware target state evaluation head subnetwork in the infrared target tracking network by using a gradient descent algorithm, updating network parameters by minimizing a loss function shown in the step (2), and stopping training when the training iteration number reaches the set number of times, thereby obtaining the infrared target tracking network after preliminary training;
step 4.2: during the second stage training, freezing the infrared image feature extraction sub-network after the preliminary training, the infrared image feature fusion sub-network after the preliminary training and the salient point focusing sub-network after the preliminary training, training the corner prediction head sub-network and the IOU-Aware target state evaluation head sub-network after the preliminary training by using a gradient descent algorithm, updating network parameters by minimizing a loss function shown in the (15), and stopping training when the training iteration number reaches the set number of times, thereby obtaining a trained infrared target tracking model for realizing continuous accurate positioning of an infrared target;
in the formula (15), the amino acid sequence of the compound,is a real-domain super-parameter.
2. The infrared target tracking method based on the temporal-spatial information fusion of the transformers according to claim 1, wherein the salient point focusing sub-network in step 2.7 comprises: the salient point coordinate prediction module and the salient point feature extraction module are used for obtaining a search image V '' k The salient point features contained;
step 2.7.1: the salient point coordinate prediction module maps B 'to F' S After the mapping, the mapped coordinates B are obtained F Then from F 'by the ROIAlign operation' S Extraction of B F Corresponding region level featuresWherein K represents F P Is the width and height of (2);
the salient point coordinate prediction module performs the prediction on F through a convolution layer pair P After the dimension reduction operation is carried out, obtaining the region level characteristics after dimension reductionThen F 'is carried out' P Flattened as one-dimensional tensor->Then inputting the obtained product into a multi-layer perceptron to predict to obtain F P Predicted coordinates corresponding to L salient points +.>Wherein C 'represents F' P L represents the number of salient points;
and 2, step 2.7.2: will beRestoring to two-dimensional tensor->After that, the salient point feature extraction module extracts the salient points from F by bilinear interpolation P Mid-sampling Loc' sp Corresponding salient Point feature->
3. An electronic device comprising a memory and a processor, wherein the memory is configured to store a program that supports the processor to perform the infrared target tracking method of claim 1 or 2, the processor being configured to execute the program stored in the memory.
4. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when run by a processor performs the steps of the infrared target tracking method of claim 1 or 2.
CN202310406030.1A 2023-04-11 2023-04-11 Transformer-based space-time information fusion infrared target tracking method Active CN116402858B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310406030.1A CN116402858B (en) 2023-04-11 2023-04-11 Transformer-based space-time information fusion infrared target tracking method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310406030.1A CN116402858B (en) 2023-04-11 2023-04-11 Transformer-based space-time information fusion infrared target tracking method

Publications (2)

Publication Number Publication Date
CN116402858A CN116402858A (en) 2023-07-07
CN116402858B true CN116402858B (en) 2023-11-21

Family

ID=87017716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310406030.1A Active CN116402858B (en) 2023-04-11 2023-04-11 Transformer-based space-time information fusion infrared target tracking method

Country Status (1)

Country Link
CN (1) CN116402858B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117036417A (en) * 2023-09-12 2023-11-10 南京信息工程大学 Multi-scale transducer target tracking method based on space-time template updating
CN116912649B (en) * 2023-09-14 2023-11-28 武汉大学 Infrared and visible light image fusion method and system based on relevant attention guidance

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019137912A1 (en) * 2018-01-12 2019-07-18 Connaught Electronics Ltd. Computer vision pre-fusion and spatio-temporal tracking
CN114550040A (en) * 2022-02-18 2022-05-27 南京大学 End-to-end single target tracking method and device based on mixed attention mechanism
CN114638862A (en) * 2022-03-24 2022-06-17 清华大学深圳国际研究生院 Visual tracking method and tracking device
CN114862844A (en) * 2022-06-13 2022-08-05 合肥工业大学 Infrared small target detection method based on feature fusion
CN114972439A (en) * 2022-06-17 2022-08-30 贵州大学 Novel target tracking algorithm for unmanned aerial vehicle
CN115147459A (en) * 2022-07-31 2022-10-04 哈尔滨理工大学 Unmanned aerial vehicle target tracking method based on Swin transducer
CN115205337A (en) * 2022-07-28 2022-10-18 西安热工研究院有限公司 RGBT target tracking method based on modal difference compensation
CN115239765A (en) * 2022-08-02 2022-10-25 合肥工业大学 Infrared image target tracking system and method based on multi-scale deformable attention
CN115330837A (en) * 2022-08-18 2022-11-11 厦门理工学院 Robust target tracking method and system based on graph attention Transformer network
CN115482375A (en) * 2022-08-25 2022-12-16 南京信息技术研究院 Cross-mirror target tracking method based on time-space communication data driving
CN115620206A (en) * 2022-11-04 2023-01-17 雷汝霖 Training method of multi-template visual target tracking network and target tracking method
CN115690152A (en) * 2022-10-18 2023-02-03 南京航空航天大学 Target tracking method based on attention mechanism
CN115908500A (en) * 2022-12-30 2023-04-04 长沙理工大学 High-performance video tracking method and system based on 3D twin convolutional network
CN115909110A (en) * 2022-12-16 2023-04-04 四川中科朗星光电科技有限公司 Lightweight infrared unmanned aerial vehicle target tracking method based on Simese network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113129341B (en) * 2021-04-20 2021-12-14 广东工业大学 Landing tracking control method and system based on light-weight twin network and unmanned aerial vehicle
US20230033548A1 (en) * 2021-07-26 2023-02-02 Manpreet Singh TAKKAR Systems and methods for performing computer vision task using a sequence of frames

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019137912A1 (en) * 2018-01-12 2019-07-18 Connaught Electronics Ltd. Computer vision pre-fusion and spatio-temporal tracking
CN114550040A (en) * 2022-02-18 2022-05-27 南京大学 End-to-end single target tracking method and device based on mixed attention mechanism
CN114638862A (en) * 2022-03-24 2022-06-17 清华大学深圳国际研究生院 Visual tracking method and tracking device
CN114862844A (en) * 2022-06-13 2022-08-05 合肥工业大学 Infrared small target detection method based on feature fusion
CN114972439A (en) * 2022-06-17 2022-08-30 贵州大学 Novel target tracking algorithm for unmanned aerial vehicle
CN115205337A (en) * 2022-07-28 2022-10-18 西安热工研究院有限公司 RGBT target tracking method based on modal difference compensation
CN115147459A (en) * 2022-07-31 2022-10-04 哈尔滨理工大学 Unmanned aerial vehicle target tracking method based on Swin transducer
CN115239765A (en) * 2022-08-02 2022-10-25 合肥工业大学 Infrared image target tracking system and method based on multi-scale deformable attention
CN115330837A (en) * 2022-08-18 2022-11-11 厦门理工学院 Robust target tracking method and system based on graph attention Transformer network
CN115482375A (en) * 2022-08-25 2022-12-16 南京信息技术研究院 Cross-mirror target tracking method based on time-space communication data driving
CN115690152A (en) * 2022-10-18 2023-02-03 南京航空航天大学 Target tracking method based on attention mechanism
CN115620206A (en) * 2022-11-04 2023-01-17 雷汝霖 Training method of multi-template visual target tracking network and target tracking method
CN115909110A (en) * 2022-12-16 2023-04-04 四川中科朗星光电科技有限公司 Lightweight infrared unmanned aerial vehicle target tracking method based on Simese network
CN115908500A (en) * 2022-12-30 2023-04-04 长沙理工大学 High-performance video tracking method and system based on 3D twin convolutional network

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
An IoU-aware Siamese network for real-time visual tracking;Bingbing Wei 等;《Neurocomputing》;第527卷;13-26 *
Swintrack: A simple and strong baseline for transformer tracking;Lin L 等;《Advances in Neural Information Processing Systems》;第35卷;16743-16754 *
Transformer Tracking;Xin Chen 等;《CVPR 2021》;8126-8135 *
基于FasterMDNet的视频目标跟踪算法;王玲 等;《计算机工程与应用》(第14期);123-130 *
多特征融合的粒子滤波红外单目标跟踪;程文 等;《电脑知识与技术》(第14期);178-180+185 *

Also Published As

Publication number Publication date
CN116402858A (en) 2023-07-07

Similar Documents

Publication Publication Date Title
CN116402858B (en) Transformer-based space-time information fusion infrared target tracking method
Han et al. Active object detection with multistep action prediction using deep q-network
Nandhini et al. Detection of Crime Scene Objects using Deep Learning Techniques
CN111862145A (en) Target tracking method based on multi-scale pedestrian detection
Wang et al. Adaptive fusion CNN features for RGBT object tracking
CN116309725A (en) Multi-target tracking method based on multi-scale deformable attention mechanism
CN117252904B (en) Target tracking method and system based on long-range space perception and channel enhancement
Fan et al. Complementary tracking via dual color clustering and spatio-temporal regularized correlation learning
CN116563337A (en) Target tracking method based on double-attention mechanism
CN114724185A (en) Light-weight multi-person posture tracking method
CN113554679A (en) Anchor-frame-free target tracking algorithm for computer vision application
CN115205336A (en) Feature fusion target perception tracking method based on multilayer perceptron
CN115239765A (en) Infrared image target tracking system and method based on multi-scale deformable attention
Cheng et al. Tiny object detection via regional cross self-attention network
Zhu et al. Srdd: a lightweight end-to-end object detection with transformer
Wang et al. EMAT: Efficient feature fusion network for visual tracking via optimized multi-head attention
CN116597267B (en) Image recognition method, device, computer equipment and storage medium
Zhou et al. Retrieval and localization with observation constraints
CN112883928A (en) Multi-target tracking algorithm based on deep neural network
CN113011359A (en) Method for simultaneously detecting plane structure and generating plane description based on image and application
CN117576149A (en) Single-target tracking method based on attention mechanism
CN116797799A (en) Single-target tracking method and tracking system based on channel attention and space-time perception
Huang et al. A spatial–temporal contexts network for object tracking
CN114119999B (en) Iterative 6D pose estimation method and device based on deep learning
Zhang et al. Promptvt: Prompting for efficient and accurate visual tracking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant