CN112418203B

CN112418203B - Robustness RGB-T tracking method based on bilinear convergence four-stream network

Info

Publication number: CN112418203B
Application number: CN202011251625.7A
Authority: CN
Inventors: 梅峻熙; 康彬; 颜俊; 吴晓欢
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2022-08-30
Anticipated expiration: 2040-11-11
Also published as: CN112418203A

Abstract

The invention provides an RGB-T target tracking method based on a bilinear convergence graph convolution network, which comprises the following steps of: step S1: dividing the features into template embedding pairs and candidate embedding pairs, wherein the template embedding pairs consist of first frame regions of visible light images and infrared images; step S2: intercepting images with the same size as the template embedding pair in the candidate embedding pair, and extracting features through a convolutional neural network to form four multi-channel feature maps; step S3: training the feature map by using a graph convolution neural network to obtain a final feature map; step S4: carrying out bilinear convergence operation on the final characteristic diagram to obtain a score value of one degree of identity; step S5: repeating the steps of S2-S4, splicing the score values obtained each time into a matrix, and positioning the position of the target according to the position of the maximum score value; the whole tracking effect is realized; the invention makes the intrinsic element interaction between the feature maps not fully mined.

Description

Robustness RGB-T tracking method based on bilinear convergence four-flow network

Technical Field

The invention relates to a graph tracking method, in particular to an RGB-T tracking method, and belongs to the technical field of visual tracking.

Background

With the rapid development of the internet of things, the thermal infrared camera has become an economical and practical product and is successfully applied to an advanced driving assistance system and an intelligent vehicle/road system. The camera can capture the thermal infrared radiation emitted by the object with the temperature above absolute zero, and is more suitable for night monitoring. Thus, the combined use of an RGB camera and a thermal infrared camera has two advantages: 1) the thermal infrared camera has stronger robustness to illumination change and can provide supplementary data for visible light spectrum captured under the condition of weak light; 2) the grayscale feature of the RGB camera helps to solve the cross-over problem in thermal infrared camera based monitoring. Therefore, the use of RGB features and thermal infrared features in RGB-T tracking can effectively address the challenge of inclement weather.

In RGB-T tracking, the RGB and thermal video sequences are obtained in pairs (see fig. 1 where the car is heavily occluded, very much distinguishing the car from the background in the thermal infrared image). To solve the multi-model fusion problem, exploring the complementarity of RGB and thermal information, the most advanced methods can be briefly divided into three categories. The first is a particle fusion based RGB-T tracker. And the second method is to establish a multi-graph fusion model to effectively explore the spatial relationship between the RGB and the hot target blocks. A third type of multimodal fusion relies on sparse representations. The above methods all use manual features for multi-model fusion. Compared with manual features, the deep convolution features can extract deep semantic information with unchanged translation and lightweight of the target, and have strong robustness. The Siamese network is a research hotspot of visual tracking based on the RGB camera with a simple network structure and a fast tracking speed. In the RGB tracking based on the sieme network, Bertinetto et al first designed the sieme network structure, and the current tracking result is obtained by sequentially calculating the similarity between the template image and each candidate image in the search area. Cross-correlation is typically employed as a similarity measure. To further improve the efficiency of Bertinetto work, the following study can be briefly divided into three areas: 1) attention-based connected networks, which effectively use back-propagation gradients and channel attention mechanisms, focus the target appearance on information sub-regions: 2) a local mode based Siamese network that can explore the spatial relationship between different target blocks; 3) the Siamese network based on the RPN introduces the regional suggestion network into the Siamese network, thereby avoiding the time-consuming multi-scale estimation step. None of the above work is easily extended to RGB-T tracking due to the following challenges: 1) existing RGB trackers explore the relationships between different target blocks in the Siamese network and introduce attention mechanisms, but these work all in a single image domain (RGB domain).

Disclosure of Invention

The invention aims to provide a robust RGB-T tracking method based on a bilinear convergent four-stream network, which overcomes the defects that the tracking is only carried out in a single image domain (RGB domain) and the inherent part-feature interaction existing in a multi-source embedded pair cannot be utilized, so that the inherent element interaction between feature maps cannot be fully mined.

The purpose of the invention is realized as follows: a robustness RGB-T tracking method based on bilinear convergence four-stream network comprises the following steps:

step S1: dividing the characteristic embedding into a template embedding pair and a candidate embedding pair, wherein each embedding pair is respectively composed of two flow directions to construct a four-flow convolutional neural network structure, and the template embedding pair is composed of a first frame group Truth area of visible light and infrared images;

step S2: intercepting images with the same size as the template embedding pair in the candidate embedding pair, and extracting features together with the template embedding pair through a convolutional neural network to form four multi-channel feature maps;

step S3: training the characteristic diagram obtained in the S2 by utilizing a graph convolution neural network to obtain a final characteristic diagram;

step S4: performing bilinear fusion on the final feature map in the step S3, obtaining two bilinear vectors through two layers of fully-connected networks, and performing inner product operation on the two bilinear vectors to finally obtain a score value of an identity degree;

step S5: repeating the steps of S2-S4, splicing the obtained scores each time into a matrix of similarity scores, and positioning the position of the target according to the position of the maximum score; the whole tracking effect is realized;

as a further technical solution of the present invention, in step S2, the selected convolutional neural network structure is a VGG-16 network, and in order to make the extracted features more robust, features of different layers of the VGG-16 are selected, position information of a lower layer is combined with semantic information of a higher layer, and finally, a feature map of four multi-channels fusing information of multiple layers is output.

As a further limitation of the present invention, in step S3, by using the characteristics of the multi-channel feature map in S2, nodes of the graph convolutional neural network are constructed according to the spatial arrangement sequence of the feature map pixels, two adjacent nodes are connected to form an edge of the graph convolutional neural network, and the structure of the graph can be expressed as: phi (phi) of ₁ (v, epsilon), wherein v represents a node set of the graph, epsilon represents an edge set of the graph, and then the feature graph with stronger feature expression capability is generated after the two-layer graph convolution neural network.

As a further improvement of the present invention, in step S4, a bilinear fusion method is adopted, and the method for exploring the pairwise correlation between feature channels by using the outer product specifically includes: respectively carrying out bilinear convergence on the final feature maps of the first two streams and the final feature maps of the second two streams in the S3 to obtain two feature maps with the sizes respectively being A e R ^M×K×C And B ∈ R ^M×K×C Then reconstruct A and B into a matrix

And

multiplying each position of the two tensors by an outer product and combining all the products together, the resulting bilinear vector can be expressed as:

wherein

Wherein,

the (j-1) · C + i-th element in the vector u is represented as

A one-dimensional vector representing the reconstruction of the profile for the ith channel,

i and j denote the ith row and the jth column of the bilinear matrix, respectively. And C is the total channel number of the feature map. Because the bilinear vector u at this time is a high-dimensional vector, the parameter quantity of the whole model is reduced, so that the memory consumption is reduced, and the tracking speed is increased; and finally obtaining a bilinear vector after dimensionality reduction by using a two-layer fully-connected network structure, and then carrying out inner product operation on the two bilinear vectors after dimensionality reduction to obtain a score value of one identification degree.

As a further improvement of the present invention, in step S5, in the candidate embedding pair, the regions with the same size are sequentially cut out and template embedded in the order from left to right and from top to bottom, then steps S2-S4 are repeated, each score value is also spliced into a similarity score map in the same order, and Q (Z, X) represents the similarity score map, and the final expression is:

and k is the number of times of intercepting the template embedding pairs with the same size in the candidate embedding pairs, namely the number of the obtained total similarity score values, wherein the sum is the template embedding pair and the intercepted candidate image embedding pairs respectively, and the element in the matrix is the similarity score value obtained in each step.

Compared with the prior art, the technical scheme adopted by the invention has the following technical effects:

1. the invention fully utilizes the characteristics of the infrared image and can provide supplementary data for the visible image under the condition of weak light; therefore, the RGB-T tracking can effectively solve the challenges of factors such as severe weather and shielding;

2. the four-flow graph convolutional network structure based on bilinear convergence fully utilizes the inherent part-feature interaction existing in multi-source embedded pairs, and the internal element interaction between different feature graphs can be fully utilized, so that the learned features have higher robustness, and the tracking precision is improved;

3. according to the method, the similarity between the sample and the candidate sample is evaluated without using cross correlation, but a Logist loss based on an inner product is adopted to train a feature embedding pair end to end and a graph convolution network based on bilinear convergence, so that the real score between the intercepted image and the target template can be accurately evaluated, the tracking effect is better, and the generalization capability is stronger.

Drawings

Fig. 1 is a challenging scene diagram of an existing RGB234 data set.

FIG. 2 is an overall flow diagram of a method of an embodiment of the present invention.

FIG. 3 is a graph of the overall tracking performance of a GTOT dataset when practicing the present invention, wherein (a) the precision plot, (b) the success plot, the distance accuracy score and the AUC score are shown in the legends of the two graphs, respectively.

FIG. 4 is an overall trace performance of the RGBT234 data set when the present invention is implemented, wherein (a) the precision plot and (b) the success plot.

Fig. 5 is a graph of qualitative results of six video pairs in which (a) Diamond video pair, (b) Elecbike3 video pair, (c) Fog video pair, (d) Kite4 video pair, (e) manasterrain video pair, and (f) rightthreepeope video pair, are implemented by the present invention.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the attached drawings:

as shown in fig. 2, the present embodiment provides a robust RGB-T tracking method based on a bilinear fusion four-stream network, which includes the following steps:

step S1: the embedding of the features is divided into template embedding pairs and candidate embedding pairs, and each embedding pair consists of two flow directions respectivelyConstructing a four-stream convolutional neural network structure, wherein the template embedding pair consists of a first frame group Truth area of visible light images and a first frame group Truth area of infrared images, and the template embedding pair Z ₁ And Z ₂ Has a size of 112 × 112; candidate region embedding pair X ₁ And X ₂ Has a size of 224 × 224; selecting a VGG-16 network as a convolutional neural network, selecting feature maps of 9 th, 10 th, 12 th and 13 th layers of the VGG-16 in order to make the extracted features more robust, and combining position information of a lower layer with semantic information of a higher layer; the size of all feature maps is adjusted to 14 × 14, and the number of feature maps in each layer is 512; splicing the four layer feature maps together, and finally outputting the four layer feature maps fused with a plurality of levels of information, wherein the dimension of each channel is 2048 dimensions; then, the characteristics are used as the input of a graph convolution network;

step S3: training the characteristic diagram obtained in the S2 by using a graph convolution neural network to obtain a final characteristic diagram; respectively carrying out bilinear convergence on the final feature maps of the first two streams and the final feature maps of the second two streams to respectively obtain a high-dimensional bilinear vector u, then using a two-layer fully-connected network structure, wherein the number of neurons in a first hidden layer is 1024, and the number of neurons in a second hidden layer is 256, so that the dimensionality of the bilinear vector u after dimension reduction is 256, and then carrying out inner product operation on the bilinear vectors u after dimension reduction to finally obtain a similarity score value; representing the acquaintance degree of the template embedding pair and the intercepted image area; in the training phase, an ADAM optimization algorithm with a learning rate of 0.01 is adopted. The model was trained for 50 epochs with a batch size of 64. In the training process, we first train FS-Siamese using videos in the ImageNet large scale visual recognition challenge (ILSVRC2015) dataset. Then we use the first 5 frames of the thermal video sequence in RGBT234 for fine-tuning;

step S4: will finish the step S3Carrying out bilinear fusion on the symbolic graph, specifically: the sizes of the two characteristic graphs are respectively A epsilon to R ^6×6×256 And B ∈ R ^6×6×256 Then A and B are reconstructed as a matrix A ∈ R ^36×256 And B ∈ R ^36×256 Multiplying each position of the two tensors by an outer product and combining all the products together to obtain a bilinear vector

Wherein u ∈ R ^65536×1 . Obtaining two final bilinear vectors u1 ═ u2 ∈ R through two layers of full-connection networks ^1024×1 Then, the two bilinear vectors are subjected to inner product operation to finally obtain a Score value Score of one identity degree, which is u1 u 2;

step S5: similar to the idea of sliding window, the method sequentially intercepts and embeds the areas with the same size in the candidate embedding pair from left to right and from top to bottom, then repeats the steps S2-S4, intercepts with the size of 8 steps, and then splices each score value into a similarity score map according to the same sequence:

wherein Q (Z, X) is epsilon to R ^17×17 Since the size of the candidate image is 224 × 224, Q (Z, X) needs to be upsampled to the same size as the candidate image by an interpolation method, and then the position where the highest score is located is the center position of the target object. Thereby realizing the tracking effect on the object.

To test the effectiveness of the network structure, comprehensive experiments were performed on two widely used RGB-T data sets: such as the GTOT dataset and the RGBT234 dataset shown in fig. 3-4. Compared with the most advanced method at present, the FS-Siemese network of the invention can obtain excellent performance on two data sets. We evaluated the tracking performance using four objective indices (position error, overlay score, precision plot and success plot).

Overall tracking performance on GTOT datasets as shown in figure 3, it is clear from the tests that the method of the invention provides the best accuracy performance, in particular the distance accuracy score of the method of the invention is above 5% over ECO-RGBT. The tracking performance in fig. 3(a) may verify the validity of the proposed fusion module. The method of the present invention also gives the highest AUC score in fig. 3(b), which is 1% higher than the best RGB-T tracker SGT. This may illustrate that the method of the present invention may use bounding box scaling to locate the target. The performance on the RGBT234 data set is shown in FIG. 4. RGBT234 contains more video pairs and more challenging factors. Thus, the test of tracking performance it provides is convincing. From fig. 4(a) we can clearly see that the distance accuracy score of the present invention method is significantly higher than the other 13 comparison methods. Similarly, the method of the present invention also obtains the first position in success plot, as shown in fig. 4 (b). The method of the invention is 1.5% higher than trackers based on correlation filters, such as MDNet + RGBT and ECO + RGBT, in AUC score. The validity of the network structure proposed by the invention is further verified.

Finally 6 scenes were chosen as an example to demonstrate the qualitative tracking performance in fig. 5, where 3 video sequences were randomly chosen from each scene, where moving objects are often occluded by the trunk. Most advanced methods tend to lose target after severe occlusion. From fig. 5(a), we can clearly see that the method of the present invention can still track the target regardless of partial occlusion or heavy occlusion. The object and the adjacent pedestrian move together, causing a serious background clutter in fig. 5 (b). In this case, the method of the invention can achieve the same effect as ECO-RGBT and provide good tracking performance. As shown in fig. 5(c), it contained severe haze. In addition to this challenging factor, it also relates to occlusion and background clutter situations. From testing, we can clearly see that the method of the present invention can still locate the target using a suitable bounding box. The kite sequence is a very challenging sequence because the targets are really small. In the kite sequence, other methods will start to drift to some extent after the 300 th frame, while the present method can still track the kite throughout the video frame. As shown in fig. 5 (d). Fig. 5(e) and (f) suffer from low illumination in rainy and nighttime scenes. From these two examples, it can be seen that the inventive method can efficiently supplement the RGB sequence with thermal information.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A robustness RGB-T tracking method based on bilinear convergent four-stream network is characterized by comprising the following steps:

step S2: intercepting images with the same size as the template embedding pair in the candidate embedding pair, and extracting features from the images and the template embedding pair through a convolutional neural network to form four multi-channel feature maps;

step S4: performing bilinear fusion on the final feature map in the step S3, obtaining two bilinear vectors through a two-layer fully-connected network, and performing inner product operation on the two bilinear vectors to finally obtain a score value of an identity degree; in step S4, a bilinear fusion method is adopted, and the specific step of exploring the pairwise correlation between feature channels by using the outer product is: respectively carrying out bilinear convergence on the final feature maps of the first two streams and the final feature maps of the second two streams in the S3 to obtain two feature maps with the sizes respectively being A e R ^M×K×C And B ∈ R ^M×K×C Then reconstruct A and B into a matrix

And

wherein

Wherein,

the (j-1) · C + i-th element in the vector u is represented as

i and j respectively represent the ith row and the jth column of the bilinear matrix; c is the total channel number of the characteristic diagram;

step S5: repeating the steps of S2-S4, splicing the obtained scores each time into a matrix of similarity scores, and positioning the position of the target according to the position of the maximum score; the whole tracking effect is realized.

2. The robust RGB-T tracking method according to claim 1, wherein in step S2, the selected convolutional neural network structure is a VGG-16 network, and features of different layers of the VGG-16 are selected, and position information of a lower layer is combined with semantic information of a higher layer, and finally, a feature map of four channels fusing information of multiple layers is output.

3. The robust RGB-T tracking method based on bilinear fusion four-stream network as claimed in claim 2, wherein in step S3, the nodes of the graph convolution neural network are constructed according to the spatial arrangement sequence of the feature map pixels by using the characteristics of the multi-channel feature map in S2, two adjacent nodes are connected to form the edges of the graph convolution neural network, and the graph structure can be expressed as: phi ₁ (v, epsilon), wherein v represents a node set of the graph, epsilon represents an edge set of the graph, and then the feature graph with stronger feature expression capability is generated after the two-layer graph convolution neural network.

4. The robust RGB-T tracking method based on bilinear fusion four-stream network as claimed in claim 3, wherein in step S5, the candidate embedding pairs sequentially intercept and template embed the same size of regions from left to right and from top to bottom, then repeat steps S2-S4, and each score value is also spliced into a similarity score graph in the same order, and Q (Z, X) represents the final expression:

and k is the number of times of intercepting the template embedding pairs with the same size in the candidate embedding pairs, namely the number of the obtained total similarity score values, wherein the sum is the template embedding pair and the intercepted candidate image embedding pair respectively, and the element in the matrix is the similarity score value obtained in each step.