CN116563343A

CN116563343A - RGBT target tracking method based on twin network structure and anchor frame self-adaptive thought

Info

Publication number: CN116563343A
Application number: CN202310575583.XA
Authority: CN
Inventors: 秦玉文; 陈建明; 豆嘉真; 钟丽云; 邸江磊
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2023-05-22
Filing date: 2023-05-22
Publication date: 2023-08-08

Abstract

The invention belongs to the field of image processing, and discloses an RGBT target tracking method based on a twin network structure and an anchor frame self-adaptive idea, which is used for solving the problem that the traditional RGBT target tracking method is difficult to realize robust tracking under the conditions of low visibility or poor illumination condition and the like. The model comprises a feature extraction network based on a twin network structure, a cross-modal information complementary fusion network and a tracking prediction network based on an anchor frame self-adaptive thought; the invention designs the feature extraction network based on the twin network structure by utilizing the complementarity and consistency of the visible light and thermal infrared image information, thereby enhancing the representation capability of the network to the target; meanwhile, a fusion scheme with complementary cross-modal information is designed, so that the robustness of the tracker of the tracking model in a complex scene is enhanced; tracking prediction networks based on anchor frame adaptation ideas allow for greater flexibility in the tracker. The method can realize tracking of the target with complex background, and has higher tracking precision and better efficiency.

Description

RGBT target tracking method based on twin network structure and anchor frame self-adaptive thought

Technical Field

The invention belongs to the field of image processing, and particularly relates to an RGBT target tracking method based on a twin network structure and an anchor frame self-adaptive idea.

Background

RGBT tracking tasks aim to exploit the complementary advantages of visible light data and thermal infrared data to achieve visual target tracking in complex environments, with the aim of determining the location and size of a given target in various scenarios. As a basic and challenging task in the field of computer vision, target tracking technology has been widely applied in numerous practical fields such as intelligent security, traffic control, medical treatment and diagnosis, human-computer interaction, and modern military. Although significant progress has been made in related research and applications, most of the existing object trackers are realized based on single-mode data, and the robustness and reliability of the existing object trackers are limited in complex environments, such as a target tracker based on visible light data is difficult to realize a tracking effect with strong robustness under the condition of low visibility or poor illumination condition. In recent years, a large number of RGBT tracking methods have been proposed to solve these problems, but tracking drift is generally caused by an inability to efficiently mine target feature information contained in multi-modal information.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an RGBT target tracking method based on a twin network structure and an anchor frame self-adaptive idea, wherein the RGBT target tracking method can track a target with a complex background, and has higher tracking precision and better efficiency.

The technical scheme for solving the technical problems is as follows:

an RGBT target tracking method based on a twin network structure and an anchor frame self-adaptive idea comprises the following steps:

(S1), constructing a data set: screening data from the disclosed RGB data set and RGBT target tracking data set as required to obtain a corresponding pre-training data set and training data set;

(S2), constructing a network: the system comprises a feature extraction network based on a twin network structure, a cross-modal information complementary fusion network and a tracking prediction network based on an anchor frame self-adaptive idea;

(S3) pre-training the feature extraction network based on the twin network structure by the pre-training data set obtained in the step (S1), and training by adopting a gradient descent method until the loss value is basically converged; then, the tracking model is finely adjusted by utilizing the training data set obtained in the step S1, the learning rate is reduced for training, and a random gradient descent method is adopted for training until the loss value is basically converged, so that a trained network is obtained;

and S4, acquiring a target template to be tracked in the visible light image and the thermal infrared image, calculating a search area of a subsequent frame, and tracking the visible light and the infrared video sequence by using a trained network to obtain a tracking result.

Preferably, in step (S2), the construction of the feature extraction network based on the twin network structure, the cross-modal information complementary fusion network, and the tracking prediction network based on the anchor frame adaptive idea includes the following steps:

(S2-1), constructing a feature extraction network: the feature extraction network is a feature extraction network based on a twin network structure, adopts a deep and multi-branch structure and comprises two parts of feature extraction and feature enhancement; the feature extraction part consists of a part containing 4 pieces of modified ResNet-50, the feature enhancement part contains two image enhancement modules based on an attention mechanism,

(S2-2) constructing a cross-modal information complementary fusion network: the cross-modal feature fusion scheme is characterized in that cross-modal feature fusion is realized through 4 1 multiplied by 1 convolution modules; the fused results are subjected to cross-correlation operation to obtain a response chart for predicting a target tracking result;

(S2-3) constructing a tracking prediction network based on an anchor frame self-adaptive idea: the tracking prediction network based on the anchor frame self-adaptive thought comprises two tracking prediction heads with the same structure. Each detection head comprises 3 branches, which are respectively used for predicting classification branches of each position category in the response chart; a regression branch for calculating a target bounding box for the location; the method is used for calculating the centrality score of each position and eliminating the centrality branches of the abnormal values.

Preferably, in step (S3), the pre-training of the feature extraction network based on the twin network structure and the fine tuning of the tracking model, comprises the steps of:

(S3-1) the feature extraction network comprises 4 ResNet-50 with the same structure, a pre-training data set based on visible light data is used for pre-training one feature extraction network, the input image size of a pre-training model is 127 multiplied by 127, a random gradient descent method is adopted for optimizing a tracking model until the model converges, and a trained pre-training model is stored;

initializing partial parameters of feature extraction in a feature extraction network by using the pre-training model saved in the step (S3-2), freezing the first two layers of all ResNet-50, trimming the tracking model by using the training data set obtained in the step (S1), reducing the learning rate, training, and training by adopting a random gradient descent method until the loss value is basically converged; and obtaining a trained network.

Preferably, in step (S4), the target template to be tracked in the visible light image and the thermal infrared image is acquired and tracked, including the following steps:

(S4-1) the template acquisition of the target is that the template of the target is the target in the initial frame of the video sequence in the initial stage of tracking, and the subsequent frame is a candidate frame;

(S4-2) inputting a model, namely, an image block cut out of a first frame and a frame to be detected of the two-mode video sequence, wherein the input sizes of an RGB template image, an RGB candidate image, a thermal infrared template image and a thermal infrared candidate image are unified, and are respectively set to 127 multiplied by 127 pixels and 255 multiplied by 255 pixels;

(S4-3) passing the RGB template image, the RGB candidate image, the thermal infrared template image, and the thermal infrared candidate image obtained in the steps (S4-1) and (S4-2) through a feature extraction part and a feature enhancement part of a feature extraction network, respectively, to obtain RGB template features, RGB candidate features, thermal infrared template features, and thermal infrared candidate features, RGB template enhancement features, RGB candidate enhancement features, thermal infrared template enhancement features, and thermal infrared candidate enhancement features, respectively;

(S4-4) respectively fusing RGB template features and thermal infrared template enhancement features, fusing RGB candidate features and thermal infrared candidate enhancement features, fusing thermal infrared template features and RGB template enhancement features and fusing thermal infrared candidate features and RGB candidate enhancement features through a fusion network with complementary cross-modal information on the basis of the step (S4-3) to respectively obtain RGB template features, RGB candidate features, thermal infrared template features and thermal infrared candidate features with complementary cross-modal information enhancement;

(S4-5) combining the newly generated RGB template features, RGB candidate features, thermal infrared template features, and thermal infrared candidate features; performing mutual correlation operation on the two parts to obtain a response chart for tracking prediction;

(S4-6), finally inputting the response diagram into a tracking prediction network based on the anchor frame self-adaptive thought, and completing the prediction of the position by generating a 6D vector t= (cls, cen, l, t, r, b), wherein cls represents a classification score, cen represents a centrality score, and l+r and t+b represent the width and height of the target prediction in the current frame.

Preferably, in step (S2-1), we design the feature extraction network in combination with the feature pyramid structure, and make necessary modifications to the ResNet-50 network, remove the downsampling operations in the last 2 convolutions (Conv 4 and Conv 5) in the original ResNet-50 to provide more detailed spatial detail for the tracker' S predictions, and replace the convolution kernels in Conv4 and Conv5 with hole convolutions with hole rates of 2 and 4 to improve the receptive field range. Finally, the number of channels of the output feature map of the last 3 convolution modules of ResNet-50 is reduced to 256 by 1X 1 convolution, and the features are aggregated by channel dimension.

Preferably, in step (S2-1), the feature enhancement section includes a template image feature enhancement module based on channel attention and a candidate image feature enhancement module based on channel-space attention.

Preferably, in step (S2-2), the deep cross-correlation operation may be defined as:

M _rgb ＝X ⁿ _rgb ★Z ⁿ _rgb (1)

M _t ＝X ⁿ _t ★X ⁿ _t (2)

wherein ∈ represents a channel-by-channel cross-correlation operation;newly generated RGB template features, RGB candidate features, thermal infrared template features and thermal infrared candidate features; m is M _rgb 、M _t The generated visible light response map and thermal infrared response map are shown, respectively.

Preferably, in step (S3-1), the total loss function trained by the algorithm can be expressed as:

wherein ,λ₁ 、λ ₂ Is a super parameter and is used for balancing the centrality loss and the regression loss;respectively representing cross entropy loss functions for visible light mode and thermal infrared mode classification; />Representing regression loss for predicting the position of a prediction block；/>Representing loss of centrality; regression lossCan be defined as:

wherein L_iou Is the intersection ratio between the true value and the prediction frame and can be obtained by g _(i，j) And (5) calculating.Representing the distance of point (i, j) to 4 edges of the real value. Loss of centrality of the centrality branch->Can be expressed as:

wherein ,the centrality score representing the location point (i, j).

Preferably, during the tracking of step (S4-6), we suppress the predicted target large deformation in the classification score graph at scale and scale penalties:

where k represents the super parameter, r represents the aspect ratio, r 'represents the aspect ratio of the last frame, s represents the target scale, s' represents the target scale of the last frame. At the same time, the cosine window penalty H is used to suppress large displacements. The final generation of cls can be expressed as:

cls＝(1-α)cls*penalty+αH (7)

wherein α represents a hyper-parameter. Obtaining cls of two detection heads through the steps _rgb and cls_t The last step is to obtain the index of the peak position by the peak adaptive selection module:

P＝argmax(cls _rgb ,cls _t ) (8)

wherein argmax represents a position index operation that compares the peak scores in two input arrays and returns a larger peak.

Preferably, the model weight with the minimum loss in the model training process in the step (S3) is selected, and the accurate position of the target in the current frame is output.

Compared with the prior art, the invention has the following beneficial effects:

1. according to the RGBT target tracking method based on the twin network structure and the anchor frame self-adaptive thought, the twin network structure is introduced into RGBT target tracking, and a deep and multi-branch feature extraction network is constructed through the twin network structure, so that semantic information of a visible light image and a thermal infrared image can be conveniently fully mined. Meanwhile, a template image enhancement module and a candidate image enhancement module are designed to enhance the characterization capability of the network on target shallow information and improve tracking precision.

2. According to the RGBT target tracking method based on the twin network structure and the anchor frame self-adaptive thought, a cross-modal information complementary fusion scheme is designed in the tracking model, and the robustness of the tracking model in complex scenes such as low visibility, interference of similar objects or shielding is enhanced.

3. The RGBT target tracking method based on the twin network structure and the anchor frame self-adaptive idea designs the tracking prediction network based on the anchor frame self-adaptive idea, has stronger flexibility for targets with large scale change or deformation, reduces the calculation complexity of a tracker, and further enhances the tracking efficiency of a tracking model.

Drawings

Fig. 1 is a flowchart of an RGBT target tracking method based on the idea of twin network architecture and anchor frame adaptation of the present invention.

Fig. 2 is a schematic diagram of an image enhancement module in a feature extraction network part of the RGBT target tracking method based on the twin network structure and anchor frame adaptive concept of the present invention, including a template image feature enhancement module and a candidate image feature enhancement module.

Fig. 3 is a schematic diagram of a tracking prediction network based on the anchor frame adaptive idea in the RGBT target tracking method based on the twin network structure and the anchor frame adaptive idea of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

Referring to fig. 1, the RGBT target tracking method based on the twin network structure and anchor frame adaptive concept of the present invention includes the following steps:

Referring to fig. 1, in step (S2), the construction of a feature extraction network based on a twin network structure, a fusion network complementary to cross-modal information, and a tracking prediction network based on an anchor frame adaptive idea includes the steps of:

(S2-1), constructing a feature extraction network: the feature extraction network is a feature extraction network based on a twin network structure, adopts a deep and multi-branch structure and comprises two parts of feature extraction and feature enhancement; the feature extraction part consists of 4 modified ResNet-50, and the feature enhancement part comprises two image enhancement modules based on an attention mechanism;

(S2-2), a converged network of cross-modal information complementarity: the cross-modal feature fusion scheme is characterized in that cross-modal feature fusion is realized through 4 1 multiplied by 1 convolution modules; the fused results are subjected to cross-correlation operation to obtain a response chart for predicting a target tracking result;

(S2-3), a tracking prediction network based on an anchor frame adaptive idea: the tracking prediction network based on the anchor frame self-adaptive thought comprises two tracking prediction heads with the same structure. Each detection head comprises 3 branches, which are respectively used for predicting classification branches of each position category in the response chart; a regression branch for calculating a target bounding box for the location; the method is used for calculating the centrality score of each position and eliminating the centrality branches of the abnormal values.

Referring to fig. 1, in step (S3), the pre-training of the feature extraction network based on the twin network structure and the fine tuning of the tracking model, includes the steps of:

(S3-2) initializing partial parameters of feature extraction in a feature extraction network by using the pre-training model stored in the step S2-1, freezing the first two layers of all ResNet-50, fine-tuning a tracking model by using the training data set obtained in the step S1, reducing the learning rate for training, and training by adopting a random gradient descent method until the loss value is basically converged; and obtaining a trained network.

Referring to fig. 1, in step (S4), a target template to be tracked in a visible light image and a thermal infrared image is acquired and tracked, including the steps of:

(S4-6), finally inputting the response diagram into a tracking prediction network based on an anchor frame self-adaptive thought, and respectively completing the prediction of the target position in two modes by generating a 6D vector t= (cls, cen, l, t, r, b), wherein cls represents a classification score, cen represents a centrality score, and l+r and t+b represent the width and height of the target prediction in the current frame.

Only the characteristic diagram output of the last 3 convolution blocks is extracted for the backbone structure of the ResNet-50 based on the twin network, and the characteristics of the target bottom layer information, such as color, texture and the like, obtained by the network are not fully utilized, so that the robustness tracking effect is difficult to realize under the scenes of coping with shielding, scale change, similar interference and the like of the algorithm.

For this purpose, the present embodiment designs a template image feature enhancement module based on the attention mechanism, see fig. 2. The method comprises the steps that a template image characteristic enhancement module is designed based on channel attention, and the template characteristic images from RGB mode and thermal infrared mode are spliced according to channel dimension to obtain joint characteristics; and then carrying out global average pooling operation (Global Avg Pooling) and convolution operation on the combined features to reduce the vitamins into a weight matrix. Finally, the weights are weighted to the previous features channel by channel through multiplication, and the recalibration of the original features in the channel dimension is completed.

Referring to fig. 2, the present embodiment combines the spatial-channel attention mechanisms to design candidate image feature enhancement modules. In the module, firstly, the candidate feature graphs from two modes are spliced according to the channel dimension to obtain a combined feature, then the combined feature is firstly generated into a weight matrix through a channel attention module, and modeling of the combined feature in the channel dimension is completed through channel-by-channel multiplication; then, through a spatial attention module, completing spatial modeling of target features, and generating an enhanced joint feature map; and finally, dividing the enhanced joint feature map according to the channel dimension to respectively generate enhanced RGB candidate features and thermal infrared candidate features.

Through the arrangement, the ResNet-50 network and the feature pyramid structure are utilized to fully utilize the deep and multi-scale features of the target, and meanwhile, the image enhancement module is constructed by combining the attention mechanism to further strengthen the representation of the network on the shallow features of the target, so that the tracking robustness of the algorithm in a complex scene can be improved.

In addition, the total loss function for algorithm training can be expressed as:

wherein ,λ₁ 、λ ₂ Is a super parameter and is used for balancing the centrality loss and the regression loss;respectively representing cross entropy loss functions for visible light mode and thermal infrared mode classification; />Representing regression loss for predicting the position of the prediction frame; />Representing loss of centrality; regression lossCan be defined as:

wherein ,the centrality score representing the location point (i, j).

In addition, we penalize the scale and scale in the classification score graph to suppress the predicted target large deformation:

cls＝(1-α)cls*penalty+αH (7)

P＝argmax(cls _rgb ,cls _t ) (8)

Referring to fig. 3, the RGBT target tracking method of the present invention is described in the following specific case:

firstly, respectively obtaining RGB template characteristics, RGB candidate characteristics, thermal infrared template characteristics and thermal infrared candidate characteristics, RGB template enhancement characteristics, RGB candidate enhancement characteristics, thermal infrared template enhancement characteristics and thermal infrared candidate enhancement characteristics through a characteristic extraction part and a characteristic enhancement part of a characteristic extraction network by an input RGB template image, RGB candidate image, thermal infrared template image and thermal infrared candidate image; then, the features from different modes are spliced and reduced in dimension according to the number of channels, so that the cross-mode features are fused, and the fused results are subjected to cross-correlation operation to obtain response diagrams for predicting RGB mode and thermal infrared mode target tracking results respectively; and finally, respectively inputting the generated response graphs of the two modes into a tracking prediction network based on an anchor frame self-adaptive thought to finish the prediction and the position positioning of the bimodal target class, and carrying out recalibration on the results generated by the two pre-measuring heads through a designed peak self-adaptive selection module to generate an optimal tracking scheme.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof, but rather as various changes, modifications, substitutions, combinations, and simplifications which may be made therein without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. An RGBT target tracking method based on a twin network structure and an anchor frame self-adaptive idea is characterized by comprising the following steps:

2. The RGBT target tracking method based on the idea of twin network structure and anchor frame adaptation according to claim 1, wherein in step (S2), the construction of the feature extraction network based on twin network structure, the fusion network with complementary cross-modal information, and the tracking prediction network based on the idea of anchor frame adaptation comprises the steps of:

3. RGBT target tracking method based on the idea of twin network architecture and anchor frame adaptation according to claim 1, characterized in that in step (S3) the pre-training of the feature extraction network based on twin network architecture and the fine tuning of the tracking model comprises the steps of:

4. The RGBT target tracking method based on the idea of twin network architecture and anchor frame adaptation according to claim 1, wherein in step (S4), a target template to be tracked in a visible light image and a thermal infrared image is acquired and tracked, comprising the steps of:

5. The RGBT target tracking method based on the idea of twin network architecture and anchor frame adaptation according to claim 2, characterized in that in step (S2-1), we design the feature extraction network in combination with the feature pyramid architecture and make necessary improvements to the ResNet-50 network, eliminating the downsampling operations in the last 2 convolved blocks (Conv 4 and Conv 5) in the original ResNet-50 to provide more detailed spatial detail for the tracker' S predictions, and using the hole convolutions with hole rates of 2 and 4 instead of the convolution kernels in Conv4 and Conv5 to improve the receptive field range. Finally, the number of channels of the output feature map of the last 3 convolution modules of ResNet-50 is reduced to 256 by 1X 1 convolution, and the features are aggregated by channel dimension.

6. The RGBT target tracking method based on the idea of twin network architecture and anchor frame adaptation according to claim 2, wherein in step (S2-1), the feature enhancement part comprises a template image feature enhancement module based on channel attention and a candidate image feature enhancement module based on channel-space attention.

7. The RGBT target tracking method based on the idea of twin network architecture and anchor frame adaptation according to claim 2, characterized in that in step (S2-2), the deep cross correlation operation is defined as:

M _rgb ＝X ⁿ _rgb ★Z ⁿ _rgb (1)

M _t ＝X ⁿ _t ★X ⁿ _t (2)

8. The RGBT target tracking method based on the idea of twin network architecture and anchor frame adaptation as claimed in claim 3, characterized in that in step (S3-1) the total loss function of the algorithm training can be expressed as:

wherein ,λ₁ 、λ ₂ Is a super parameter and is used for balancing the centrality loss and the regression loss;(m=rgb, t) represents cross entropy loss functions for visible and thermal infrared mode classification, respectively; />(m=rgb, t) represents regression loss for predicting the position of the prediction block; />(m=rgb, t) represents a loss of centrality; regression loss->(m=rgb, t) can be defined as:

wherein L_iou Is the intersection ratio between the true value and the prediction frame and can be obtained by g _(i，j) And (5) calculating.Representing the distance of point (i, j) to 4 edges of the real value. Loss of centrality of the centrality branch->(m=rgb, t) can be expressed as:

wherein ,the centrality score representing the location point (i, j).

9. The RGBT target tracking method based on the adaptive idea of twin network architecture and anchor frame of claim 4, wherein during tracking in step (S4-6), we suppress the predicted target large deformation in the classification score graph at scale and scale penalties:

cls＝(1-α)cls*penalty+αH (7)

P＝argmax(cls _rgb ,cls _t ) (8)

10. The RGBT target tracking method based on the twinning network structure and the anchor frame adaptive idea according to claim 4, wherein the model weight of the minimum loss in the model training process of the step (S3) is selected, and the accurate position of the target in the current frame is output.