CN116563343A - RGBT target tracking method based on twin network structure and anchor frame self-adaptive thought - Google Patents

RGBT target tracking method based on twin network structure and anchor frame self-adaptive thought Download PDF

Info

Publication number
CN116563343A
CN116563343A CN202310575583.XA CN202310575583A CN116563343A CN 116563343 A CN116563343 A CN 116563343A CN 202310575583 A CN202310575583 A CN 202310575583A CN 116563343 A CN116563343 A CN 116563343A
Authority
CN
China
Prior art keywords
rgb
features
tracking
network
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310575583.XA
Other languages
Chinese (zh)
Inventor
秦玉文
陈建明
豆嘉真
钟丽云
邸江磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202310575583.XA priority Critical patent/CN116563343A/en
Publication of CN116563343A publication Critical patent/CN116563343A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/098Distributed learning, e.g. federated learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10048Infrared image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the field of image processing, and discloses an RGBT target tracking method based on a twin network structure and an anchor frame self-adaptive idea, which is used for solving the problem that the traditional RGBT target tracking method is difficult to realize robust tracking under the conditions of low visibility or poor illumination condition and the like. The model comprises a feature extraction network based on a twin network structure, a cross-modal information complementary fusion network and a tracking prediction network based on an anchor frame self-adaptive thought; the invention designs the feature extraction network based on the twin network structure by utilizing the complementarity and consistency of the visible light and thermal infrared image information, thereby enhancing the representation capability of the network to the target; meanwhile, a fusion scheme with complementary cross-modal information is designed, so that the robustness of the tracker of the tracking model in a complex scene is enhanced; tracking prediction networks based on anchor frame adaptation ideas allow for greater flexibility in the tracker. The method can realize tracking of the target with complex background, and has higher tracking precision and better efficiency.

Description

RGBT target tracking method based on twin network structure and anchor frame self-adaptive thought
Technical Field
The invention belongs to the field of image processing, and particularly relates to an RGBT target tracking method based on a twin network structure and an anchor frame self-adaptive idea.
Background
RGBT tracking tasks aim to exploit the complementary advantages of visible light data and thermal infrared data to achieve visual target tracking in complex environments, with the aim of determining the location and size of a given target in various scenarios. As a basic and challenging task in the field of computer vision, target tracking technology has been widely applied in numerous practical fields such as intelligent security, traffic control, medical treatment and diagnosis, human-computer interaction, and modern military. Although significant progress has been made in related research and applications, most of the existing object trackers are realized based on single-mode data, and the robustness and reliability of the existing object trackers are limited in complex environments, such as a target tracker based on visible light data is difficult to realize a tracking effect with strong robustness under the condition of low visibility or poor illumination condition. In recent years, a large number of RGBT tracking methods have been proposed to solve these problems, but tracking drift is generally caused by an inability to efficiently mine target feature information contained in multi-modal information.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide an RGBT target tracking method based on a twin network structure and an anchor frame self-adaptive idea, wherein the RGBT target tracking method can track a target with a complex background, and has higher tracking precision and better efficiency.
The technical scheme for solving the technical problems is as follows:
an RGBT target tracking method based on a twin network structure and an anchor frame self-adaptive idea comprises the following steps:
(S1), constructing a data set: screening data from the disclosed RGB data set and RGBT target tracking data set as required to obtain a corresponding pre-training data set and training data set;
(S2), constructing a network: the system comprises a feature extraction network based on a twin network structure, a cross-modal information complementary fusion network and a tracking prediction network based on an anchor frame self-adaptive idea;
(S3) pre-training the feature extraction network based on the twin network structure by the pre-training data set obtained in the step (S1), and training by adopting a gradient descent method until the loss value is basically converged; then, the tracking model is finely adjusted by utilizing the training data set obtained in the step S1, the learning rate is reduced for training, and a random gradient descent method is adopted for training until the loss value is basically converged, so that a trained network is obtained;
and S4, acquiring a target template to be tracked in the visible light image and the thermal infrared image, calculating a search area of a subsequent frame, and tracking the visible light and the infrared video sequence by using a trained network to obtain a tracking result.
Preferably, in step (S2), the construction of the feature extraction network based on the twin network structure, the cross-modal information complementary fusion network, and the tracking prediction network based on the anchor frame adaptive idea includes the following steps:
(S2-1), constructing a feature extraction network: the feature extraction network is a feature extraction network based on a twin network structure, adopts a deep and multi-branch structure and comprises two parts of feature extraction and feature enhancement; the feature extraction part consists of a part containing 4 pieces of modified ResNet-50, the feature enhancement part contains two image enhancement modules based on an attention mechanism,
(S2-2) constructing a cross-modal information complementary fusion network: the cross-modal feature fusion scheme is characterized in that cross-modal feature fusion is realized through 4 1 multiplied by 1 convolution modules; the fused results are subjected to cross-correlation operation to obtain a response chart for predicting a target tracking result;
(S2-3) constructing a tracking prediction network based on an anchor frame self-adaptive idea: the tracking prediction network based on the anchor frame self-adaptive thought comprises two tracking prediction heads with the same structure. Each detection head comprises 3 branches, which are respectively used for predicting classification branches of each position category in the response chart; a regression branch for calculating a target bounding box for the location; the method is used for calculating the centrality score of each position and eliminating the centrality branches of the abnormal values.
Preferably, in step (S3), the pre-training of the feature extraction network based on the twin network structure and the fine tuning of the tracking model, comprises the steps of:
(S3-1) the feature extraction network comprises 4 ResNet-50 with the same structure, a pre-training data set based on visible light data is used for pre-training one feature extraction network, the input image size of a pre-training model is 127 multiplied by 127, a random gradient descent method is adopted for optimizing a tracking model until the model converges, and a trained pre-training model is stored;
initializing partial parameters of feature extraction in a feature extraction network by using the pre-training model saved in the step (S3-2), freezing the first two layers of all ResNet-50, trimming the tracking model by using the training data set obtained in the step (S1), reducing the learning rate, training, and training by adopting a random gradient descent method until the loss value is basically converged; and obtaining a trained network.
Preferably, in step (S4), the target template to be tracked in the visible light image and the thermal infrared image is acquired and tracked, including the following steps:
(S4-1) the template acquisition of the target is that the template of the target is the target in the initial frame of the video sequence in the initial stage of tracking, and the subsequent frame is a candidate frame;
(S4-2) inputting a model, namely, an image block cut out of a first frame and a frame to be detected of the two-mode video sequence, wherein the input sizes of an RGB template image, an RGB candidate image, a thermal infrared template image and a thermal infrared candidate image are unified, and are respectively set to 127 multiplied by 127 pixels and 255 multiplied by 255 pixels;
(S4-3) passing the RGB template image, the RGB candidate image, the thermal infrared template image, and the thermal infrared candidate image obtained in the steps (S4-1) and (S4-2) through a feature extraction part and a feature enhancement part of a feature extraction network, respectively, to obtain RGB template features, RGB candidate features, thermal infrared template features, and thermal infrared candidate features, RGB template enhancement features, RGB candidate enhancement features, thermal infrared template enhancement features, and thermal infrared candidate enhancement features, respectively;
(S4-4) respectively fusing RGB template features and thermal infrared template enhancement features, fusing RGB candidate features and thermal infrared candidate enhancement features, fusing thermal infrared template features and RGB template enhancement features and fusing thermal infrared candidate features and RGB candidate enhancement features through a fusion network with complementary cross-modal information on the basis of the step (S4-3) to respectively obtain RGB template features, RGB candidate features, thermal infrared template features and thermal infrared candidate features with complementary cross-modal information enhancement;
(S4-5) combining the newly generated RGB template features, RGB candidate features, thermal infrared template features, and thermal infrared candidate features; performing mutual correlation operation on the two parts to obtain a response chart for tracking prediction;
(S4-6), finally inputting the response diagram into a tracking prediction network based on the anchor frame self-adaptive thought, and completing the prediction of the position by generating a 6D vector t= (cls, cen, l, t, r, b), wherein cls represents a classification score, cen represents a centrality score, and l+r and t+b represent the width and height of the target prediction in the current frame.
Preferably, in step (S2-1), we design the feature extraction network in combination with the feature pyramid structure, and make necessary modifications to the ResNet-50 network, remove the downsampling operations in the last 2 convolutions (Conv 4 and Conv 5) in the original ResNet-50 to provide more detailed spatial detail for the tracker' S predictions, and replace the convolution kernels in Conv4 and Conv5 with hole convolutions with hole rates of 2 and 4 to improve the receptive field range. Finally, the number of channels of the output feature map of the last 3 convolution modules of ResNet-50 is reduced to 256 by 1X 1 convolution, and the features are aggregated by channel dimension.
Preferably, in step (S2-1), the feature enhancement section includes a template image feature enhancement module based on channel attention and a candidate image feature enhancement module based on channel-space attention.
Preferably, in step (S2-2), the deep cross-correlation operation may be defined as:
M rgb =X n rgb ★Z n rgb (1)
M t =X n t ★X n t (2)
wherein ∈ represents a channel-by-channel cross-correlation operation;newly generated RGB template features, RGB candidate features, thermal infrared template features and thermal infrared candidate features; m is M rgb 、M t The generated visible light response map and thermal infrared response map are shown, respectively.
Preferably, in step (S3-1), the total loss function trained by the algorithm can be expressed as:
wherein ,λ1 、λ 2 Is a super parameter and is used for balancing the centrality loss and the regression loss;respectively representing cross entropy loss functions for visible light mode and thermal infrared mode classification; />Representing regression loss for predicting the position of a prediction block;/>Representing loss of centrality; regression lossCan be defined as:
wherein Liou Is the intersection ratio between the true value and the prediction frame and can be obtained by g (i,j) And (5) calculating.Representing the distance of point (i, j) to 4 edges of the real value. Loss of centrality of the centrality branch->Can be expressed as:
wherein ,the centrality score representing the location point (i, j).
Preferably, during the tracking of step (S4-6), we suppress the predicted target large deformation in the classification score graph at scale and scale penalties:
where k represents the super parameter, r represents the aspect ratio, r 'represents the aspect ratio of the last frame, s represents the target scale, s' represents the target scale of the last frame. At the same time, the cosine window penalty H is used to suppress large displacements. The final generation of cls can be expressed as:
cls=(1-α)cls*penalty+αH (7)
wherein α represents a hyper-parameter. Obtaining cls of two detection heads through the steps rgb and clst The last step is to obtain the index of the peak position by the peak adaptive selection module:
P=argmax(cls rgb ,cls t ) (8)
wherein argmax represents a position index operation that compares the peak scores in two input arrays and returns a larger peak.
Preferably, the model weight with the minimum loss in the model training process in the step (S3) is selected, and the accurate position of the target in the current frame is output.
Compared with the prior art, the invention has the following beneficial effects:
1. according to the RGBT target tracking method based on the twin network structure and the anchor frame self-adaptive thought, the twin network structure is introduced into RGBT target tracking, and a deep and multi-branch feature extraction network is constructed through the twin network structure, so that semantic information of a visible light image and a thermal infrared image can be conveniently fully mined. Meanwhile, a template image enhancement module and a candidate image enhancement module are designed to enhance the characterization capability of the network on target shallow information and improve tracking precision.
2. According to the RGBT target tracking method based on the twin network structure and the anchor frame self-adaptive thought, a cross-modal information complementary fusion scheme is designed in the tracking model, and the robustness of the tracking model in complex scenes such as low visibility, interference of similar objects or shielding is enhanced.
3. The RGBT target tracking method based on the twin network structure and the anchor frame self-adaptive idea designs the tracking prediction network based on the anchor frame self-adaptive idea, has stronger flexibility for targets with large scale change or deformation, reduces the calculation complexity of a tracker, and further enhances the tracking efficiency of a tracking model.
Drawings
Fig. 1 is a flowchart of an RGBT target tracking method based on the idea of twin network architecture and anchor frame adaptation of the present invention.
Fig. 2 is a schematic diagram of an image enhancement module in a feature extraction network part of the RGBT target tracking method based on the twin network structure and anchor frame adaptive concept of the present invention, including a template image feature enhancement module and a candidate image feature enhancement module.
Fig. 3 is a schematic diagram of a tracking prediction network based on the anchor frame adaptive idea in the RGBT target tracking method based on the twin network structure and the anchor frame adaptive idea of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.
Referring to fig. 1, the RGBT target tracking method based on the twin network structure and anchor frame adaptive concept of the present invention includes the following steps:
(S1), constructing a data set: screening data from the disclosed RGB data set and RGBT target tracking data set as required to obtain a corresponding pre-training data set and training data set;
(S2), constructing a network: the system comprises a feature extraction network based on a twin network structure, a cross-modal information complementary fusion network and a tracking prediction network based on an anchor frame self-adaptive idea;
(S3) pre-training the feature extraction network based on the twin network structure by the pre-training data set obtained in the step (S1), and training by adopting a gradient descent method until the loss value is basically converged; then, the tracking model is finely adjusted by utilizing the training data set obtained in the step S1, the learning rate is reduced for training, and a random gradient descent method is adopted for training until the loss value is basically converged, so that a trained network is obtained;
and S4, acquiring a target template to be tracked in the visible light image and the thermal infrared image, calculating a search area of a subsequent frame, and tracking the visible light and the infrared video sequence by using a trained network to obtain a tracking result.
Referring to fig. 1, in step (S2), the construction of a feature extraction network based on a twin network structure, a fusion network complementary to cross-modal information, and a tracking prediction network based on an anchor frame adaptive idea includes the steps of:
(S2-1), constructing a feature extraction network: the feature extraction network is a feature extraction network based on a twin network structure, adopts a deep and multi-branch structure and comprises two parts of feature extraction and feature enhancement; the feature extraction part consists of 4 modified ResNet-50, and the feature enhancement part comprises two image enhancement modules based on an attention mechanism;
(S2-2), a converged network of cross-modal information complementarity: the cross-modal feature fusion scheme is characterized in that cross-modal feature fusion is realized through 4 1 multiplied by 1 convolution modules; the fused results are subjected to cross-correlation operation to obtain a response chart for predicting a target tracking result;
(S2-3), a tracking prediction network based on an anchor frame adaptive idea: the tracking prediction network based on the anchor frame self-adaptive thought comprises two tracking prediction heads with the same structure. Each detection head comprises 3 branches, which are respectively used for predicting classification branches of each position category in the response chart; a regression branch for calculating a target bounding box for the location; the method is used for calculating the centrality score of each position and eliminating the centrality branches of the abnormal values.
Referring to fig. 1, in step (S3), the pre-training of the feature extraction network based on the twin network structure and the fine tuning of the tracking model, includes the steps of:
(S3-1) the feature extraction network comprises 4 ResNet-50 with the same structure, a pre-training data set based on visible light data is used for pre-training one feature extraction network, the input image size of a pre-training model is 127 multiplied by 127, a random gradient descent method is adopted for optimizing a tracking model until the model converges, and a trained pre-training model is stored;
(S3-2) initializing partial parameters of feature extraction in a feature extraction network by using the pre-training model stored in the step S2-1, freezing the first two layers of all ResNet-50, fine-tuning a tracking model by using the training data set obtained in the step S1, reducing the learning rate for training, and training by adopting a random gradient descent method until the loss value is basically converged; and obtaining a trained network.
Referring to fig. 1, in step (S4), a target template to be tracked in a visible light image and a thermal infrared image is acquired and tracked, including the steps of:
(S4-1) the template acquisition of the target is that the template of the target is the target in the initial frame of the video sequence in the initial stage of tracking, and the subsequent frame is a candidate frame;
(S4-2) inputting a model, namely, an image block cut out of a first frame and a frame to be detected of the two-mode video sequence, wherein the input sizes of an RGB template image, an RGB candidate image, a thermal infrared template image and a thermal infrared candidate image are unified, and are respectively set to 127 multiplied by 127 pixels and 255 multiplied by 255 pixels;
(S4-3) passing the RGB template image, the RGB candidate image, the thermal infrared template image, and the thermal infrared candidate image obtained in the steps (S4-1) and (S4-2) through a feature extraction part and a feature enhancement part of a feature extraction network, respectively, to obtain RGB template features, RGB candidate features, thermal infrared template features, and thermal infrared candidate features, RGB template enhancement features, RGB candidate enhancement features, thermal infrared template enhancement features, and thermal infrared candidate enhancement features, respectively;
(S4-4) respectively fusing RGB template features and thermal infrared template enhancement features, fusing RGB candidate features and thermal infrared candidate enhancement features, fusing thermal infrared template features and RGB template enhancement features and fusing thermal infrared candidate features and RGB candidate enhancement features through a fusion network with complementary cross-modal information on the basis of the step (S4-3) to respectively obtain RGB template features, RGB candidate features, thermal infrared template features and thermal infrared candidate features with complementary cross-modal information enhancement;
(S4-5) combining the newly generated RGB template features, RGB candidate features, thermal infrared template features, and thermal infrared candidate features; performing mutual correlation operation on the two parts to obtain a response chart for tracking prediction;
(S4-6), finally inputting the response diagram into a tracking prediction network based on an anchor frame self-adaptive thought, and respectively completing the prediction of the target position in two modes by generating a 6D vector t= (cls, cen, l, t, r, b), wherein cls represents a classification score, cen represents a centrality score, and l+r and t+b represent the width and height of the target prediction in the current frame.
Only the characteristic diagram output of the last 3 convolution blocks is extracted for the backbone structure of the ResNet-50 based on the twin network, and the characteristics of the target bottom layer information, such as color, texture and the like, obtained by the network are not fully utilized, so that the robustness tracking effect is difficult to realize under the scenes of coping with shielding, scale change, similar interference and the like of the algorithm.
For this purpose, the present embodiment designs a template image feature enhancement module based on the attention mechanism, see fig. 2. The method comprises the steps that a template image characteristic enhancement module is designed based on channel attention, and the template characteristic images from RGB mode and thermal infrared mode are spliced according to channel dimension to obtain joint characteristics; and then carrying out global average pooling operation (Global Avg Pooling) and convolution operation on the combined features to reduce the vitamins into a weight matrix. Finally, the weights are weighted to the previous features channel by channel through multiplication, and the recalibration of the original features in the channel dimension is completed.
Referring to fig. 2, the present embodiment combines the spatial-channel attention mechanisms to design candidate image feature enhancement modules. In the module, firstly, the candidate feature graphs from two modes are spliced according to the channel dimension to obtain a combined feature, then the combined feature is firstly generated into a weight matrix through a channel attention module, and modeling of the combined feature in the channel dimension is completed through channel-by-channel multiplication; then, through a spatial attention module, completing spatial modeling of target features, and generating an enhanced joint feature map; and finally, dividing the enhanced joint feature map according to the channel dimension to respectively generate enhanced RGB candidate features and thermal infrared candidate features.
Through the arrangement, the ResNet-50 network and the feature pyramid structure are utilized to fully utilize the deep and multi-scale features of the target, and meanwhile, the image enhancement module is constructed by combining the attention mechanism to further strengthen the representation of the network on the shallow features of the target, so that the tracking robustness of the algorithm in a complex scene can be improved.
In addition, the total loss function for algorithm training can be expressed as:
wherein ,λ1 、λ 2 Is a super parameter and is used for balancing the centrality loss and the regression loss;respectively representing cross entropy loss functions for visible light mode and thermal infrared mode classification; />Representing regression loss for predicting the position of the prediction frame; />Representing loss of centrality; regression lossCan be defined as:
wherein Liou Is the intersection ratio between the true value and the prediction frame and can be obtained by g (i,j) And (5) calculating.Representing the distance of point (i, j) to 4 edges of the real value. Loss of centrality of the centrality branch->Can be expressed as:
wherein ,the centrality score representing the location point (i, j).
In addition, we penalize the scale and scale in the classification score graph to suppress the predicted target large deformation:
where k represents the super parameter, r represents the aspect ratio, r 'represents the aspect ratio of the last frame, s represents the target scale, s' represents the target scale of the last frame. At the same time, the cosine window penalty H is used to suppress large displacements. The final generation of cls can be expressed as:
cls=(1-α)cls*penalty+αH (7)
wherein α represents a hyper-parameter. Obtaining cls of two detection heads through the steps rgb and clst The last step is to obtain the index of the peak position by the peak adaptive selection module:
P=argmax(cls rgb ,cls t ) (8)
wherein argmax represents a position index operation that compares the peak scores in two input arrays and returns a larger peak.
Referring to fig. 3, the RGBT target tracking method of the present invention is described in the following specific case:
firstly, respectively obtaining RGB template characteristics, RGB candidate characteristics, thermal infrared template characteristics and thermal infrared candidate characteristics, RGB template enhancement characteristics, RGB candidate enhancement characteristics, thermal infrared template enhancement characteristics and thermal infrared candidate enhancement characteristics through a characteristic extraction part and a characteristic enhancement part of a characteristic extraction network by an input RGB template image, RGB candidate image, thermal infrared template image and thermal infrared candidate image; then, the features from different modes are spliced and reduced in dimension according to the number of channels, so that the cross-mode features are fused, and the fused results are subjected to cross-correlation operation to obtain response diagrams for predicting RGB mode and thermal infrared mode target tracking results respectively; and finally, respectively inputting the generated response graphs of the two modes into a tracking prediction network based on an anchor frame self-adaptive thought to finish the prediction and the position positioning of the bimodal target class, and carrying out recalibration on the results generated by the two pre-measuring heads through a designed peak self-adaptive selection module to generate an optimal tracking scheme.
The foregoing is illustrative of the present invention and is not to be construed as limiting thereof, but rather as various changes, modifications, substitutions, combinations, and simplifications which may be made therein without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (10)

1. An RGBT target tracking method based on a twin network structure and an anchor frame self-adaptive idea is characterized by comprising the following steps:
(S1), constructing a data set: screening data from the disclosed RGB data set and RGBT target tracking data set as required to obtain a corresponding pre-training data set and training data set;
(S2), constructing a network: the system comprises a feature extraction network based on a twin network structure, a cross-modal information complementary fusion network and a tracking prediction network based on an anchor frame self-adaptive idea;
(S3) pre-training the feature extraction network based on the twin network structure by the pre-training data set obtained in the step (S1), and training by adopting a gradient descent method until the loss value is basically converged; then, the tracking model is finely adjusted by utilizing the training data set obtained in the step S1, the learning rate is reduced for training, and a random gradient descent method is adopted for training until the loss value is basically converged, so that a trained network is obtained;
and S4, acquiring a target template to be tracked in the visible light image and the thermal infrared image, calculating a search area of a subsequent frame, and tracking the visible light and the infrared video sequence by using a trained network to obtain a tracking result.
2. The RGBT target tracking method based on the idea of twin network structure and anchor frame adaptation according to claim 1, wherein in step (S2), the construction of the feature extraction network based on twin network structure, the fusion network with complementary cross-modal information, and the tracking prediction network based on the idea of anchor frame adaptation comprises the steps of:
(S2-1), constructing a feature extraction network: the feature extraction network is a feature extraction network based on a twin network structure, adopts a deep and multi-branch structure and comprises two parts of feature extraction and feature enhancement; the feature extraction part consists of 4 modified ResNet-50, and the feature enhancement part comprises two image enhancement modules based on an attention mechanism;
(S2-2) constructing a cross-modal information complementary fusion network: the cross-modal feature fusion scheme is characterized in that cross-modal feature fusion is realized through 4 1 multiplied by 1 convolution modules; the fused results are subjected to cross-correlation operation to obtain a response chart for predicting a target tracking result;
(S2-3) constructing a tracking prediction network based on an anchor frame self-adaptive idea: the tracking prediction network based on the anchor frame self-adaptive thought comprises two tracking prediction heads with the same structure. Each detection head comprises 3 branches, which are respectively used for predicting classification branches of each position category in the response chart; a regression branch for calculating a target bounding box for the location; the method is used for calculating the centrality score of each position and eliminating the centrality branches of the abnormal values.
3. RGBT target tracking method based on the idea of twin network architecture and anchor frame adaptation according to claim 1, characterized in that in step (S3) the pre-training of the feature extraction network based on twin network architecture and the fine tuning of the tracking model comprises the steps of:
(S3-1) the feature extraction network comprises 4 ResNet-50 with the same structure, a pre-training data set based on visible light data is used for pre-training one feature extraction network, the input image size of a pre-training model is 127 multiplied by 127, a random gradient descent method is adopted for optimizing a tracking model until the model converges, and a trained pre-training model is stored;
initializing partial parameters of feature extraction in a feature extraction network by using the pre-training model saved in the step (S3-2), freezing the first two layers of all ResNet-50, trimming the tracking model by using the training data set obtained in the step (S1), reducing the learning rate, training, and training by adopting a random gradient descent method until the loss value is basically converged; and obtaining a trained network.
4. The RGBT target tracking method based on the idea of twin network architecture and anchor frame adaptation according to claim 1, wherein in step (S4), a target template to be tracked in a visible light image and a thermal infrared image is acquired and tracked, comprising the steps of:
(S4-1) the template acquisition of the target is that the template of the target is the target in the initial frame of the video sequence in the initial stage of tracking, and the subsequent frame is a candidate frame;
(S4-2) inputting a model, namely, an image block cut out of a first frame and a frame to be detected of the two-mode video sequence, wherein the input sizes of an RGB template image, an RGB candidate image, a thermal infrared template image and a thermal infrared candidate image are unified, and are respectively set to 127 multiplied by 127 pixels and 255 multiplied by 255 pixels;
(S4-3) passing the RGB template image, the RGB candidate image, the thermal infrared template image, and the thermal infrared candidate image obtained in the steps (S4-1) and (S4-2) through a feature extraction part and a feature enhancement part of a feature extraction network, respectively, to obtain RGB template features, RGB candidate features, thermal infrared template features, and thermal infrared candidate features, RGB template enhancement features, RGB candidate enhancement features, thermal infrared template enhancement features, and thermal infrared candidate enhancement features, respectively;
(S4-4) respectively fusing RGB template features and thermal infrared template enhancement features, fusing RGB candidate features and thermal infrared candidate enhancement features, fusing thermal infrared template features and RGB template enhancement features and fusing thermal infrared candidate features and RGB candidate enhancement features through a fusion network with complementary cross-modal information on the basis of the step (S4-3) to respectively obtain RGB template features, RGB candidate features, thermal infrared template features and thermal infrared candidate features with complementary cross-modal information enhancement;
(S4-5) combining the newly generated RGB template features, RGB candidate features, thermal infrared template features, and thermal infrared candidate features; performing mutual correlation operation on the two parts to obtain a response chart for tracking prediction;
(S4-6), finally inputting the response diagram into a tracking prediction network based on an anchor frame self-adaptive thought, and respectively completing the prediction of the target position in two modes by generating a 6D vector t= (cls, cen, l, t, r, b), wherein cls represents a classification score, cen represents a centrality score, and l+r and t+b represent the width and height of the target prediction in the current frame.
5. The RGBT target tracking method based on the idea of twin network architecture and anchor frame adaptation according to claim 2, characterized in that in step (S2-1), we design the feature extraction network in combination with the feature pyramid architecture and make necessary improvements to the ResNet-50 network, eliminating the downsampling operations in the last 2 convolved blocks (Conv 4 and Conv 5) in the original ResNet-50 to provide more detailed spatial detail for the tracker' S predictions, and using the hole convolutions with hole rates of 2 and 4 instead of the convolution kernels in Conv4 and Conv5 to improve the receptive field range. Finally, the number of channels of the output feature map of the last 3 convolution modules of ResNet-50 is reduced to 256 by 1X 1 convolution, and the features are aggregated by channel dimension.
6. The RGBT target tracking method based on the idea of twin network architecture and anchor frame adaptation according to claim 2, wherein in step (S2-1), the feature enhancement part comprises a template image feature enhancement module based on channel attention and a candidate image feature enhancement module based on channel-space attention.
7. The RGBT target tracking method based on the idea of twin network architecture and anchor frame adaptation according to claim 2, characterized in that in step (S2-2), the deep cross correlation operation is defined as:
M rgb =X n rgb ★Z n rgb (1)
M t =X n t ★X n t (2)
wherein ∈ represents a channel-by-channel cross-correlation operation;newly generated RGB template features, RGB candidate features, thermal infrared template features and thermal infrared candidate features; m is M rgb 、M t The generated visible light response map and thermal infrared response map are shown, respectively.
8. The RGBT target tracking method based on the idea of twin network architecture and anchor frame adaptation as claimed in claim 3, characterized in that in step (S3-1) the total loss function of the algorithm training can be expressed as:
wherein ,λ1 、λ 2 Is a super parameter and is used for balancing the centrality loss and the regression loss;(m=rgb, t) represents cross entropy loss functions for visible and thermal infrared mode classification, respectively; />(m=rgb, t) represents regression loss for predicting the position of the prediction block; />(m=rgb, t) represents a loss of centrality; regression loss->(m=rgb, t) can be defined as:
wherein Liou Is the intersection ratio between the true value and the prediction frame and can be obtained by g (i,j) And (5) calculating.Representing the distance of point (i, j) to 4 edges of the real value. Loss of centrality of the centrality branch->(m=rgb, t) can be expressed as:
wherein ,the centrality score representing the location point (i, j).
9. The RGBT target tracking method based on the adaptive idea of twin network architecture and anchor frame of claim 4, wherein during tracking in step (S4-6), we suppress the predicted target large deformation in the classification score graph at scale and scale penalties:
where k represents the super parameter, r represents the aspect ratio, r 'represents the aspect ratio of the last frame, s represents the target scale, s' represents the target scale of the last frame. At the same time, the cosine window penalty H is used to suppress large displacements. The final generation of cls can be expressed as:
cls=(1-α)cls*penalty+αH (7)
wherein α represents a hyper-parameter. Obtaining cls of two detection heads through the steps rgb and clst The last step is to obtain the index of the peak position by the peak adaptive selection module:
P=argmax(cls rgb ,cls t ) (8)
wherein argmax represents a position index operation that compares the peak scores in two input arrays and returns a larger peak.
10. The RGBT target tracking method based on the twinning network structure and the anchor frame adaptive idea according to claim 4, wherein the model weight of the minimum loss in the model training process of the step (S3) is selected, and the accurate position of the target in the current frame is output.
CN202310575583.XA 2023-05-22 2023-05-22 RGBT target tracking method based on twin network structure and anchor frame self-adaptive thought Pending CN116563343A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310575583.XA CN116563343A (en) 2023-05-22 2023-05-22 RGBT target tracking method based on twin network structure and anchor frame self-adaptive thought

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310575583.XA CN116563343A (en) 2023-05-22 2023-05-22 RGBT target tracking method based on twin network structure and anchor frame self-adaptive thought

Publications (1)

Publication Number Publication Date
CN116563343A true CN116563343A (en) 2023-08-08

Family

ID=87491347

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310575583.XA Pending CN116563343A (en) 2023-05-22 2023-05-22 RGBT target tracking method based on twin network structure and anchor frame self-adaptive thought

Country Status (1)

Country Link
CN (1) CN116563343A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116912649A (en) * 2023-09-14 2023-10-20 武汉大学 Infrared and visible light image fusion method and system based on relevant attention guidance

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116912649A (en) * 2023-09-14 2023-10-20 武汉大学 Infrared and visible light image fusion method and system based on relevant attention guidance
CN116912649B (en) * 2023-09-14 2023-11-28 武汉大学 Infrared and visible light image fusion method and system based on relevant attention guidance

Similar Documents

Publication Publication Date Title
CN111797716B (en) Single target tracking method based on Siamese network
CN110335290B (en) Twin candidate region generation network target tracking method based on attention mechanism
CN109800689B (en) Target tracking method based on space-time feature fusion learning
CN108090443B (en) Scene text detection method and system based on deep reinforcement learning
CN111696137B (en) Target tracking method based on multilayer feature mixing and attention mechanism
CN112347861B (en) Human body posture estimation method based on motion feature constraint
CN110544269A (en) twin network infrared target tracking method based on characteristic pyramid
CN113706581B (en) Target tracking method based on residual channel attention and multi-level classification regression
CN111462191A (en) Non-local filter unsupervised optical flow estimation method based on deep learning
CN108830170A (en) A kind of end-to-end method for tracking target indicated based on layered characteristic
CN113807188A (en) Unmanned aerial vehicle target tracking method based on anchor frame matching and Simese network
Liu et al. APSNet: Toward adaptive point sampling for efficient 3D action recognition
CN116563343A (en) RGBT target tracking method based on twin network structure and anchor frame self-adaptive thought
Tao et al. An adaptive frame selection network with enhanced dilated convolution for video smoke recognition
CN115187786A (en) Rotation-based CenterNet2 target detection method
CN116912804A (en) Efficient anchor-frame-free 3-D target detection and tracking method and model
CN116563355A (en) Target tracking method based on space-time interaction attention mechanism
CN111882581A (en) Multi-target tracking method for depth feature association
Zhang et al. Spatial-information guided adaptive context-aware network for efficient RGB-D semantic segmentation
CN114495170A (en) Pedestrian re-identification method and system based on local self-attention inhibition
Weilharter et al. Atlas-mvsnet: Attention layers for feature extraction and cost volume regularization in multi-view stereo
CN117576753A (en) Micro-expression recognition method based on attention feature fusion of facial key points
CN111160354B (en) Ship image segmentation method based on joint image information under sea and sky background
CN117576149A (en) Single-target tracking method based on attention mechanism
Duan [Retracted] Deep Learning‐Based Multitarget Motion Shadow Rejection and Accurate Tracking for Sports Video

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination