CN116543021A

CN116543021A - Siamese network video single-target tracking method based on feature fusion

Info

Publication number: CN116543021A
Application number: CN202310596182.2A
Authority: CN
Inventors: 董宇欣; 刘皓; 史志平; 孙采萱; 张立国; 江俊慧; 李思照
Original assignee: Yangtze University; Harbin Engineering University
Current assignee: Yangtze University; Harbin Engineering University
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2023-08-04

Abstract

A single-target tracking method of Siamese network video based on feature fusion, in particular to a single-target tracking method of Siamese network monitoring video based on feature fusion, which aims to solve the problems that the single-target tracking algorithm of the Siamese network has lower tracking capability when facing complex environment and obvious background interference exists near the tracked target, the target cannot be accurately tracked, and the tracking area output when tracking some specific targets is not accurate enough. Training a constructed model by using a template area image set and a search area image set, respectively outputting feature images of the template image and the search image, wherein the model sequentially comprises a ResNet-50 network and a twin feature fusion network based on a mixed attention mechanism, inputting the feature images of the template image and the search image into an RPN network for similarity comparison, outputting a prediction area with highest similarity with the template image in the search image, and realizing tracking of a single target. Belonging to the field of target tracking.

Description

Siamese network video single-target tracking method based on feature fusion

Technical Field

The invention relates to a target tracking method, in particular to a single target tracking method of Siamese network monitoring video based on feature fusion, and belongs to the field of target tracking.

Background

The single-target tracking technology plays an important role in the military fields of remote striking, enemy detection and the like and the daily life fields of searching for lost children, pursuing illegal vehicles and the like. The existing Siamese network single-target tracking algorithm has been widely studied and applied. The Siamese network single-target tracking algorithm takes a convolutional neural network as a backbone network to perform feature extraction, and the extracted feature map is directly subjected to target classification and regression processing. Because the semantic information contained in the feature information of the images extracted by the convolution layers of the convolutional neural network is different, the image features learned by the Siamese network single-target tracking algorithm often lack the context information of the tracked images, and the appearance and semantic information of the tracked objects cannot be fully utilized, so that the tracking capability of the Siamese network single-target tracking algorithm is slightly insufficient and the tracking precision is lower when the Siamese network single-target tracking algorithm faces to complex environments such as light changes, shielding among scenes, camera shake and the like.

Meanwhile, in order to pursue higher real-time tracking speed, an offline training model is usually selected for tracking, and offline training often selects a more general feature extraction network to extract features, so that the tracking area output by the Siamese network single-target tracking algorithm is not accurate enough when tracking some specific targets. On the other hand, when obvious background interference exists in the vicinity of a tracked target in a tracking image, the characteristics extracted by the single-target tracking algorithm are more easily concentrated on an interference object in the background, so that the Siamese network single-target tracking algorithm cannot accurately track the target.

Disclosure of Invention

The invention aims to solve the problems that the Siamese network single target tracking algorithm has lower tracking capability when facing complex environments and obvious background interference exists near a tracked target, the target cannot be accurately tracked, and the tracking area output when tracking some specific targets is not accurate enough, and further provides a Siamese network video single target tracking method based on feature fusion.

The technical scheme adopted by the invention is as follows:

it comprises the following steps:

s1, acquiring a section of monitoring video, extracting each frame of image in the monitoring video, carrying out target labeling on all the extracted images, preprocessing all the labeled images, cutting each frame of image into a pair of images with fixed size, dividing each cut frame of image into a template image and a search image, forming a template area image set by all the template images, and forming a search area image set by all the search images;

s2, constructing a model, wherein the model sequentially comprises a ResNet-50 network and a twin feature fusion network based on a mixed attention mechanism, inputting a template region image set and a search region image set into the model for training, and respectively outputting feature graphs of the template image and the search image until the upper limit of iteration times is met, so as to obtain a trained model, wherein the specific process is as follows:

S21, inputting a template region image set into a model in a template branch, and sequentially processing each template image by a ResNet-50 network and a twin feature fusion network based on a mixed attention mechanism to output a feature map;

s22, in a searching branch, inputting a next frame of searching image corresponding to each template image into a model, and sequentially processing a ResNet-50 network and a twin feature fusion network based on a mixed attention mechanism to output a feature map;

s23, inputting the feature image output by the S21 and the feature image output by the S22 into an RPN network for similarity comparison, and outputting a prediction area with highest similarity with the template image in the search image;

s3, acquiring a monitoring video to be tracked, selecting a certain target in a first frame image of the monitoring video to track, outputting a predicted area of the target selected in the first frame image in a second frame image by using the trained model in S2 based on the first frame image of the selected target and the second frame image of the monitoring video, and obtaining the position of the target in the second frame image;

and outputting a predicted area of the target in the second frame image in the third frame image by using the trained model in the S2 based on the second frame image containing the target and the third frame image of the monitoring video, and the like, so as to realize tracking of the single target.

Further, the size of the template image in S1 is 127×127, and the size of the search image is 255×255.

Further, the ResNet-50 network based on the mixed attention mechanism in the S2 sequentially comprises a convolution block 1, a pooling layer, a convolution block 2, a convolution block 3, a convolution block 4 and a convolution block 5, wherein the convolution block 1 comprises one convolution layer, the convolution block 2 sequentially comprises three convolution layers, the convolution block 3 sequentially comprises four convolution layers, the convolution block 4 sequentially comprises six convolution layers, and the convolution block 5 sequentially comprises three convolution layers;

a channel attention mechanism and a space attention mechanism are sequentially arranged between the convolution blocks 2, 3, 4 and 5;

the outputs of the convolution blocks 3, 4 and 5 are taken as the final outputs of the ResNet-50 network based on a mixed attention mechanism, and the corresponding scales of the outputs of different convolution blocks are different.

Further, the twin feature fusion network in S2 sequentially includes a convolution block a, a convolution block b, a convolution block c and a pyramid feature extraction network, the convolution block a includes a 1×1 convolution layer, the convolution block b sequentially includes a 1×1 convolution layer and a bilinear interpolation layer, the convolution block c sequentially includes a 1×1 convolution layer and a bilinear interpolation layer, the pyramid feature extraction network has three layers, each layer sequentially includes a 1×1 convolution layer, a 3×3 convolution layer, a 1×1 convolution layer and a transposed convolution layer.

Further, the RPN network in S23 includes three RPN modules, each of which includes an anchor point generator, a convolution feature extractor, and a candidate box regressor.

Further, the specific process of S21 is as follows:

s211, inputting each template image into a ResNet-50 network based on a mixed attention mechanism, inputting each template image into a convolution block 2 after being processed by a convolution block 1 and a pooling layer, and processing the input of the convolution block 2 by the first two convolution layers of the convolution block 2 to obtain a corresponding feature vector X, wherein the total number of feature channels of the feature vector X is set as C because the channel attention mechanism and the spatial attention mechanism are arranged in the convolution block 2;

inputting the feature vector X into a channel attention mechanism and outputting channel weighting feature M of the feature vector X _c ；

Weighting the channel by a characteristic M _c In the input spatial attention mechanism, an output attention force diagram A _s ；

Will pay attention to force diagram A _s Inputting the result of the convolution block 2 into a third convolution layer of the convolution block 2;

sequentially inputting the output result of the convolution block 2 into the convolution block 3, the convolution block 4 and the convolution block 5, and repeatedly executing the processing steps of the convolution block 2 in the convolution block 3, the convolution block 4 and the convolution block 5, wherein the output of different convolution blocks corresponds to different scales, so that three characteristics with different scales are obtained;

S212, respectively inputting three scale features of each template image into a convolution block a, a convolution block b and a convolution block c of the twin feature fusion network, and reducing the dimensions of the input scale features by using a 1X 1 convolution layer in each convolution block to obtain three features after dimension reduction;

regarding each feature as a feature map, taking the feature map output by a convolution block 3 in a ResNet-50 network based on a mixed attention mechanism as a basic feature map, if the dimension of the feature map obtained by a convolution block b or a convolution block c of the twin feature fusion network is smaller than that of the basic feature map, inputting the dimension-reduced feature map obtained by the convolution block b or the convolution block c into a corresponding bilinear interpolation layer, adjusting the dimension of the feature map to that of the basic feature map by bilinear interpolation, taking the adjusted feature map as the final output of the corresponding convolution block, and splicing the feature maps output by the convolution block a, the convolution block b and the convolution block c to obtain a spliced feature map;

respectively inputting the obtained spliced feature images into each layer of a pyramid feature extraction network, processing the spliced feature images by sequentially passing through 1X 1 convolution layers, 3X 3 convolution layers and 1X 1 convolution layers in each layer to generate low-resolution feature images, adding tensors of the low-resolution feature images and positions corresponding to the spliced feature images to obtain multi-scale feature images output by the current layer, and obtaining three multi-scale feature images;

S213, aiming at the three multi-scale feature images of each template image, the channel size of each multi-scale feature image is adjusted by utilizing deconvolution operation, and three new feature images after adjustment are obtained.

Further, the channel weighting feature M _c ：

M _c ＝sigmoid(W(BN(X)))

Where BN represents a batch normalization operation, X represents a feature vector, and W represents an attention weight.

Further, the attention is sought A _s ：

A _s ＝M _c ·(σ(f _max (Q(f _avg (M _c ))))))

Wherein Q represents a convolution operation, f _avg And f _max Respectively mean the average and maximum values of the input channel weighting profile along the channel dimension, σ being the sigmoid function.

Further, the specific process of S22 is:

s221, inputting each search image into a ResNet-50 network based on a mixed attention mechanism, inputting each search image into a convolution block 2 after being processed by a convolution block 1 and a pooling layer, and setting the total number of characteristic channels of the characteristic vector X as C by processing the first two convolution layers of the convolution block 2 after the input of the convolution block 2 is processed by the channel attention mechanism and the space attention mechanism in the convolution block 2;

s222, respectively inputting three scale features of each search image into a convolution block a, a convolution block b and a convolution block c of a twin feature fusion network, and reducing the dimensions of the input scale features by using a 1X 1 convolution layer in each convolution block to obtain three features after dimension reduction;

s223, aiming at three multi-scale feature images of each search image, the channel size of each multi-scale feature image is adjusted by utilizing deconvolution operation, and three new feature images after adjustment are obtained.

Further, the specific process of S23 is:

inputting the three new feature images obtained in the step S213 after adjustment and the three new feature images obtained in the step S222 after adjustment into an RPN network, inputting one new feature image obtained in the step S213 after adjustment and one new feature image obtained in the step S222 after adjustment for each RPN module, carrying out similarity comparison on the two new feature images input, and respectively outputting the feature images with target candidate frames;

because the three RPN modules have the same output size, the output of each RPN module is weighted and summed to obtain the prediction area with the highest similarity with the template image in the search image.

The beneficial effects are that:

the invention uses ResNet-50 network which introduces a mixed attention mechanism (channel attention mechanism and space attention mechanism) to extract the characteristics of the target image, so as to obtain image characteristics of different layers and different semantemes, adopts a twin characteristic fusion module to combine and splice the image characteristics of different layers and different semantemes, and then utilizes a pyramid characteristic extraction network to generate depth image characteristics with multi-layer semantic information, thereby reducing the influence of the tracking environment on the tracking result by more fully utilizing the appearance and semantic information of the tracked target, and showing certain robustness on the appearance and scale change of the target, so that the tracking target and the background information in the target image can be more effectively distinguished when the complex environment and obvious background interference exists near the tracked target and when some specific targets are tracked, not only the accurate tracking area is obtained, but also the tracking capability and the accuracy of the target are improved. Experiments according to the embodiment show that compared with the traditional single-target tracking algorithm based on the Siamese network, the method improves the tracking precision, the anti-interference capability and the robustness.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of a channel attention mechanism;

FIG. 3 is a schematic diagram of a spatial attention mechanism;

FIG. 4 is a schematic diagram of a feature fusion module;

Detailed Description

The first embodiment is as follows: referring to fig. 1-4, a single-target tracking method for a Siamese network video based on feature fusion according to the present embodiment is described, which includes the following steps:

s1, acquiring a section of monitoring video, extracting each frame of image in the monitoring video, carrying out target labeling on all the extracted images, preprocessing all the labeled images, cutting each frame of image into a pair of images with fixed size, dividing each cut frame of image into a template image and a search image, forming a template area image set by all the template images, and forming a search area image set by all the search images.

The size of the template image is 127×127, the size of the search image is 255×255, and the initial channel numbers of the template image and the search image are 3. Because the Siamese network single-target tracking is realized by performing similarity calculation on a tracking target in a first frame image and a subsequent frame image and then selecting a region with highest similarity in each subsequent frame image as a prediction region, in the training process of a Siamese network single-target tracking algorithm, the performance of the model is optimized by continuously performing similarity comparison on images in a template region image set and images in a search region image set.

S2, as shown in FIG. 1, constructing a model, wherein the model sequentially comprises a ResNet-50 network and a twin feature fusion network based on a mixed attention mechanism, inputting a template region image set and a search region image set into the model for training, and respectively outputting feature graphs of the template image and the search image until the upper limit of iteration times is met, so as to obtain a trained model, wherein the specific process is as follows:

ResNet-50 networks based on a mixed-attention mechanism are used for feature extraction. The ResNet-50 network based on the mixed attention mechanism comprises a convolution block 1, a pooling layer, a convolution block 2, a convolution block 3, a convolution block 4 and a convolution block 5 in sequence. The convolution block 1 comprises one convolution layer, the convolution block 2 comprises three convolution layers, namely Conv2_1, conv2_2 and Conv2_3, and a channel attention mechanism and a space attention mechanism are sequentially arranged between the last two convolution layers (Conv2_2 and Conv2_3) of the convolution block 2. The convolution block 3 comprises four convolution layers conv3_1, conv3_2, conv3_3, and conv3_4 in sequence, and a channel attention mechanism and a spatial attention mechanism are sequentially arranged between the last two convolution layers (conv3_3 and conv3_4) of the convolution block 3. The convolution block 4 comprises six convolution layers, namely Conv4_1, conv4_2, conv4_3, conv4_4, conv4_5 and Conv4_6, and a channel attention mechanism and a space attention mechanism are sequentially arranged between the last two convolution layers (Conv4_5 and Conv4_6) of the convolution block 4. The convolution block 5 comprises three convolution layers conv5_1, conv5_2, conv5_3 in sequence, and a channel attention mechanism and a spatial attention mechanism are sequentially arranged between the last two convolution layers (conv5_2 and conv5_3) of the convolution block 5.

The outputs of the convolution block 3, the convolution block 4 and the convolution block 5 are all taken as the final output of the ResNet-50 network based on the mixed attention mechanism, and the corresponding scales of the outputs of the different convolution blocks are different.

ResNet-50 networks based on a mixed-attention mechanism are used for feature extraction of images. According to the invention, the effective distance between the convolution blocks 4 and 5 is reduced to 8 pixels, and the empty convolution is used for enlarging the receptive field of feature extraction at the output layers of the convolution blocks 4 and 5, so that the receptive fields of the ResNet-50 network are different from each other, and the shallow convolution layers (the convolution blocks 1 and 2) mainly extract some low-level information, such as texture, shape and the like, and play an important role in positioning the target position, but the shallow feature information lacks semantic information and is difficult to be used for identifying the target. Features extracted by deep convolutional layers (convolutional blocks 3, 4, and 5) have rich semantic information and may be more helpful in some complex environments, such as target occlusion, illumination changes, and fast motion.

When the ResNet-50 network introducing the mixed attention mechanism performs feature extraction on the target image, the neural network can pay more attention to the feature channels with more abundant channel information and the feature image areas with abundant semantic information in the feature image by the channel attention mechanism and the spatial attention mechanism in the process of learning the feature image, so that the negative influence of other unimportant feature channels on target tracking is reduced.

The twin feature fusion network sequentially comprises a convolution block a, a convolution block b, a convolution block c and a pyramid feature extraction network, wherein the convolution block a comprises a 1×1 convolution layer, the convolution block b sequentially comprises a 1×1 convolution layer and a bilinear interpolation layer, the convolution block c sequentially comprises a 1×1 convolution layer and a bilinear interpolation layer, the pyramid feature extraction network comprises three layers, and each layer sequentially comprises a 1×1 convolution layer, a 3×3 convolution layer, a 1×1 convolution layer and a transposed convolution layer.

S21, inputting a template region image set into a model in a template branch, and sequentially processing each template image by a ResNet-50 network and a twin feature fusion network based on a mixed attention mechanism to output a feature map. The specific process is as follows:

s211, as shown in FIG. 1, inputting each template image into a ResNet-50 network based on a mixed attention mechanism, and inputting each template image into a convolution block 2 after processing the convolution block 1 and a pooling layer, wherein a channel attention mechanism and a space attention mechanism are arranged in the convolution block 2Therefore, the input of the convolution block 2 is processed by the first two convolution layers of the convolution block 2 to obtain a corresponding feature vector X, and the total number of feature channels of the feature vector X is set as C. As shown in fig. 1 and 2, feature vector X is input into the channel attention mechanism, batch normalization BN (Batch Normalization) operation is performed on feature vector X, and scaling factor z of each channel C (C e C) is calculated _c ：

Wherein, gamma _c Represents the scaling factor, x, of channel c during BN processing _c The value representing the current channel c in the feature vector X (two-dimensional vector),sigma, the average value of the current channel c _c Representing the standard deviation of the current channel c, e represents a constant that is not zero, preventing the denominator from being zero, typically a minimum constant of the floating point type.

Will scale factor z _c And (3) withMultiplying to obtain the channel average value s of the current channel c after scaling _c ：

Mean value s of channels _c Input full connection layer, output attention score vector S of feature vector X:

S＝W ₁ ·s+b ₁

wherein W is ₁ And b ₁ The weight and bias of the full connection layer are represented, the weight of the full connection layer is obtained by using an Xavier initialization method, the bias initialization is set to 0, and the bias value is updated continuously along with the model training process. s is the scaled channel mean vector, s= [ s ] ₁ ,s ₂ ,…,s _c ]。

Next, the attention score vector S is converted into an attention weight vector W using a normalization function, as shown in the following equation:

wherein W is _i Attention weight representing the ith feature vector, S _i Attention score representing the ith feature vector, S _j Represents the attention score of the j-th feature vector, j being the total number of feature vectors.

And carrying out weighted summation on the input feature vector X and the corresponding attention weight vector W to obtain a channel attention weight feature vector A:

A＝∑ _i W _i X _i

Finally, a sigmoid activation function is utilized to obtain a channel weighting characteristic M of the characteristic vector X _c ：

M _c ＝sigmoid(A)

The above steps can be simplified to the following formula:

M _c ＝sigmoid(W(BN(X)))

wherein X is a feature vector and W is an attention weight. Channel weighting characteristics M _c Can be regarded as a feature map.

The invention introduces a spatial attention mechanism in a mixed attention mechanism CBAM into a network ResNet-50, calculates the importance of different areas on different channels by modeling an input feature diagram in a spatial dimension in the feature extraction process, and strengthens the response of the important areas, thereby achieving the purpose of improving the model performance.

Channel weighted feature map M outputting channel attention mechanism _c ∈R ^C×H×W (C, H, W represents the number of channels, height and width respectively) in the input spatial attention mechanism, global average pooling is performed on each channel of the channel weighted feature map respectively, and a vector z, z E R containing C elements is obtained for all channels ^C . The a-th element z in the vector z _a Represents the average of the channel weighted feature map over the a-th channel. The calculation process is as follows:

to extract useful information from vector z, vector z is input into the fully-connected layer to obtain a vector f (z), f (z) ∈R containing R elements ^r Where r represents the vector dimension after the channel transformation.

f(z)＝W ₂ δ(W ₁ z)

Wherein W is ₁ And W is ₂ Weight matrix representing full connection layer, W ₁ ∈R ^h×C ，W ₂ ∈R ^r×h In the invention, the weight of the full connection layer is generated by using an Xavier initialization method, and delta represents a sigmoid activation function.

The vector f (z) is sequentially input into two convolution layers, and the first convolution layer maps the vector f (z) into a new feature map M, M epsilon R through a convolution kernel of 1 multiplied by 1 ^d×H×W D represents the number of channels. The second convolution layer maps the feature map M into a new feature map S ', S' E R through a convolution kernel of 3 multiplied by 3, through convolution, batch Normalization (BN) operation and a ReLU activation function ¹ ^×H×W . Carrying out global maximum pooling processing on the feature map S', and obtaining a spatial attention tensor S, S epsilon R by utilizing a sigmoid function ^1×H×W The specific process is as follows:

M＝W ₃ f(b)

S′＝W ₄ ReLU(BN(W ₅ M))

wherein W is ₃ ，W ₄ ，W ₅ Weight matrix respectively representing first convolution layer, second convolution layer and BN layer, W ₃ ∈R ^d ^×r 、W ₄ ∈R ^1×d ，W ₅ ∈R ^d×d Each weight value is randomly generated during the initialization of the neural network, and the parameters are continuously updated along with the deepening of training. The random gradient descent method is used for minimizing the loss function, and the parameters of the weight matrix are optimized through back propagation. And after carrying out global maximum pooling processing on the feature map S', obtaining a spatial attention tensor S according to a sigmoid function.

S＝sigmoid(Maxpooling(S′))

Finally, multiplying the spatial attention tensor S with the input channel weighted feature map to obtain a feature map A after feature redirection _s ，A _s ∈R ^C×H×W ：

A _s ＝S·M _c

Wherein, represents element-wise multiplication.

Finally, the process of the spatial attention module described above can be simplified to:

A _s ＝M _c ·(σ(f _max (Q(f _avg (M _c ))))))

wherein Q represents a convolution operation, f _avg And f _max Respectively represent the average and maximum values of the weighted feature map of the input channel along the channel dimension, sigma is the sigmoid function, A _s Is an attention force diagram of the output of the spatial attention mechanism.

Will pay attention to force diagram A _s In the third convolution layer (conv2_3) of the input convolution block 2, the result of the convolution block 2 is output.

In the above description, taking the convolution block 2 as an example, the result output by the convolution block 2 is input into the convolution block 3, the processing steps of the convolution block 2 are repeatedly executed, the result of the convolution block 3 is output, the result of the convolution block 3 is input into the convolution block 4, the processing steps of the convolution block 2 are repeatedly executed, the result of the convolution block 4 is output, the result of the convolution block 4 is input into the convolution block 5, the processing steps of the convolution block 2 are repeatedly executed, and the result of the convolution block 5 is output, so that the characteristics that each template image is output through the convolution block 3, the convolution block 4 and the convolution block 5 respectively are obtained, and the characteristics of three different scales are obtained because the corresponding scales of the output of different convolution blocks are different, and the processing of the ResNet-50 network based on the mixed attention mechanism of each template image is completed.

S212, respectively inputting three scale features of each template image into a convolution block a, a convolution block b and a convolution block c of the twin feature fusion network, and reducing the dimensions of the input scale features by using a 1X 1 convolution layer (Conv1X 1) in each convolution block to obtain three features after dimension reduction.

Each feature is considered a feature map. And (3) taking the feature map output by the convolution block 3 (Conv 3_4) in the ResNet-50 network based on the mixed attention mechanism as a basic feature map, if the dimension of the feature map obtained by the convolution block b or the convolution block c of the twin feature fusion network is smaller than that of the basic feature map, inputting the dimension-reduced feature map obtained by the convolution block b or the convolution block c into a corresponding bilinear interpolation layer, adjusting the dimension of the feature map to the dimension of the basic feature map by using bilinear interpolation (Bilinear Interpolation), so that the dimensions of all the features are the same in space dimension, taking the adjusted feature map as the final output of the corresponding convolution block, and splicing the feature maps output by the convolution block a, the convolution block b and the convolution block c to obtain a spliced feature map.

Let Y be _i The method is characterized in that the method is three-dimensional feature graphs which need to be subjected to feature fusion, and a twin feature fusion module can be expressed by the following formula:

Y _f ＝θ _f {Z _i (Y _i )}

Wherein Y is _f Representing a mosaic feature map, Z _i Representing the transformation function (referring to the operations such as convolution, interpolation and the like and no actual formulas) before fusing each scale characteristic diagram, theta _f Representing a feature fusion function (referring to the stitching operation described above).

After feature fusion operation, in order to independently send the fused feature map (spliced feature map) into each subsequent RPN module for tracking operation, the invention constructs an improved pyramid feature extraction structure according to the unique network structure of the SiamRPN++ algorithm, namely, 3 additional layers are added after the feature fusion operation, and the three additional layers form the pyramid feature extraction structure, which means that the pyramid feature extraction structure has three layers, and each layer sequentially comprises a 1×1 convolution layer, a 3×3 convolution layer, a 1×1 convolution layer and a transposed convolution layer. Thereby generating 3 independent multi-scale fusion feature maps. The additional layers are improved on the basis of the bottleneck residual block. Because the operations such as lifting and splicing exist in the feature fusion module, the size of the feature map after feature fusion cannot meet the size requirement of the subsequent RPN module on the input feature map, in order to enable the size of the fusion feature map obtained after the feature fusion operation to be matched with the input size in the subsequent RPN module, the invention adds a transposition convolution operation on the basis of the original additional layer, thereby achieving the purpose of changing the size of the feature map after feature fusion.

First, the obtained spliced feature map is input into each layer of the pyramid feature extraction network, and the spliced feature map is processed by a 1×1 convolution layer (conv1×1), a 3×3 convolution layer (conv3×3), and a 1×1 convolution layer (conv1×1) in this order to generate a low-resolution feature map. The first Conv1×1 aims at performing dimension reduction processing on the feature map so as to reduce the calculation amount of the subsequent Conv3×3, and the second Conv1×1 aims at performing dimension increase operation on the feature map after dimension reduction so as to adjust the feature map to the same size as the input spliced feature map, so that a low-resolution feature map is obtained. And adding tensors (namely, shortcut operation) of the positions corresponding to the low-resolution feature map and the input spliced feature map to obtain a multi-scale feature map output by the current layer, and aiming at a three-layer structure of the pyramid feature extraction network, obtaining three multi-scale feature maps. The specific process is as follows:

Y′ _p ＝θ _P (Y _f )

wherein Y' _p Representing a multi-scale feature map, θ _P Representing a pyramid feature extraction network,and (5) representing the layer number of the multi-scale feature map to which the pyramid feature extraction network belongs.

S213, for three multi-scale feature images of each template image, carrying out dimension lifting adjustment on the channel size of each multi-scale feature image by utilizing deconvolution operation, and adjusting the channel size of each multi-scale feature image to obtain three new feature images after adjustment. The channel size of the new feature map after adjustment is determined by the set input feature map size in the RPN modules, and the input channel numbers of the three RPN modules in the invention are respectively set to 256,512 and 1024.

The ResNet-50 network introducing the attention mechanism can better distinguish tracking targets and background information in target images when the target images are subjected to feature extraction, feature channels and feature areas with more abundant information in the target images are extracted, and meanwhile, a feature fusion module performs feature fusion on feature images of all convolution layers extracted by the ResNet-50 to obtain depth features with a plurality of semantic information layers, so that the processed feature images can be distinguished from the background information while the appearance and semantic information of the current tracking targets can be fully utilized, and finally, the robustness of an algorithm to the appearance, the scale change and the background interference of the targets is improved. The invention improves the tracking precision and the anti-interference capability of the single-target tracking algorithm based on the Siamese network.

Part of the code for the above procedure is as follows:

s22, in a searching branch, inputting a next frame of searching image corresponding to each template image into a model, sequentially processing a ResNet-50 network and a twin feature fusion network based on a mixed attention mechanism, and outputting a feature map, wherein the specific process is as follows:

the processing procedure of each search image in the model is identical to the procedure of S21 described above.

S221, inputting each search image into a ResNet-50 network based on a mixed attention mechanism, inputting each search image into a convolution block 2 after processing of the convolution block 1 and a pooling layer, and processing the input of the convolution block 2 through the first two convolution layers of the convolution block 2 to obtain a corresponding feature vector X, wherein the total number of feature channels of the feature vector X is set as C because the channel attention mechanism and the spatial attention mechanism are arranged in the convolution block 2.

Inputting the feature vector X into a channel attention mechanism and outputting channel weighting feature M of the feature vector X _c 。

Weighting the channel by a characteristic M _c In the input spatial attention mechanism, an output attention force diagram A _s 。

Will pay attention to force diagram A _s In the third convolution layer of the input convolution block 2, the result of the convolution block 2 is output.

And sequentially inputting the results output by the convolution block 2 into the convolution block 3, the convolution block 4 and the convolution block 5, and repeatedly executing the processing steps of the convolution block 2 in the convolution block 3, the convolution block 4 and the convolution block 5, wherein the output of different convolution blocks corresponds to different scales, so that three characteristics with different scales are obtained.

S222, respectively inputting three scale features of each search image into a convolution block a, a convolution block b and a convolution block c of the twin feature fusion network, and reducing the dimensions of the input scale features by using a 1X 1 convolution layer in each convolution block to obtain three features after dimension reduction.

And taking each feature as a feature map, taking the feature map output by the convolution block 3 in the ResNet-50 network based on the mixed attention mechanism as a basic feature map, if the dimension of the feature map obtained by the convolution block b or the convolution block c of the twin feature fusion network is smaller than that of the basic feature map, inputting the dimension-reduced feature map obtained by the convolution block b or the convolution block c into a corresponding bilinear interpolation layer, adjusting the dimension of the feature map to that of the basic feature map by using bilinear interpolation, taking the adjusted feature map as the final output of the corresponding convolution block, and splicing the feature maps output by the convolution block a, the convolution block b and the convolution block c to obtain a spliced feature map.

And respectively inputting the obtained spliced feature images into each layer of the pyramid feature extraction network, sequentially processing the spliced feature images by using a 1X 1 convolution layer, a 3X 3 convolution layer and the 1X 1 convolution layer in each layer to generate a low-resolution feature image, and adding tensors of the low-resolution feature image and the corresponding positions of the spliced feature images to obtain a multi-scale feature image output by the current layer, thereby obtaining three multi-scale feature images.

S23, inputting the feature map output in the S21 and the feature map output in the S22 into an RPN network for similarity comparison, outputting a prediction area with highest similarity with a template image in a search image, wherein the RPN network comprises three RPN modules, and each RPN module comprises an Anchor point Generator (Anchor Generator), a convolution feature extractor (ConvolutionalFeature Extractor) and a candidate frame Regressor (Box Regressor). The specific process is as follows:

inputting the three new feature images obtained in the step S213 after adjustment and the three new feature images obtained in the step S222 after adjustment into an RPN network, inputting one new feature image obtained in the step S213 after adjustment and one new feature image obtained in the step S222 after adjustment for each RPN module, carrying out similarity comparison on the two new feature images input, and respectively outputting the feature images with target candidate frames; because the three RPN modules have the same output size, the output of each RPN module is weighted and summed to obtain the prediction area with the highest similarity with the template image in the search image.

The specific design of the Siamese RPN module used in the invention is the same as the SiamRPN++ algorithm and is directly cited in the prior art. The RPN module is used for providing candidate frames for the subsequent tracking modules and continuously updating the candidate frames in the tracking process so as to improve the tracking accuracy and robustness.

Examples

The invention uses data sets COCO, laSOT and YOUTUBEBB to train the algorithm, and preprocesses the data sets according to the preprocessing mode same as the SiamRPN++ single-target tracking algorithm.

In the algorithm training process, the size of the target template is set to 127×127, and the size of the search template is set to 256×256. The algorithm is trained by using models (ResNet-50 network, twin feature fusion network and RPN network based on a mixed attention mechanism). During training, model parameters were continuously optimized by a random gradient descent method, and the model was trained on 1 NvidiaGTX2080TI with the batch size set to 8. The algorithm is the same as the SiamRPN++ single-target tracking algorithm, 20 epochs are trained, the first 10 epochs fix parameters of the twin feature fusion network, train and optimize parameters at the back of the model, and the last 3 blocks of the twin feature fusion network are cancelled and fixed by the algorithm at the back 10 epochs, so that the whole model is trained.

In order to check the effectiveness of the single-target tracking algorithm provided by the invention, the invention is compared with several advanced single-target tracking algorithms based on Siamese network in recent years on the data sets OTB100 and VOT 2018.

OTB dataset: is a classical visual target tracking dataset published by Wu et al in 2013. It is a video-based tracking and assessment dataset containing 100 video sequences, covering a variety of different target categories, different scenes and environments, and containing a number of challenging factors. Each video sequence of the OTB dataset contains a moving object and provides a true annotation of the position, size and shape of the object. The video sequences in the dataset are all standard VGA resolution, 640 x 480 pixels.

VOT data set: is a widely used and standardized visual target tracking data set and evaluation platform, and aims to provide a unified reference for comparison and evaluation of tracking algorithms. The dataset and assessment platform was developed jointly by multiple European computer vision research teams and was continuously updated and maintained. The VOT dataset mainly comprises two parts: video sequence data and an evaluation tool. Where the video sequence data comes from different sources such as movies, television, networks and self-recordings, containing various objects and scenes such as moving cars, people, animals, natural landscapes, etc.

The evaluation method comprises the following steps: in order to verify the effectiveness of the invention, an improved mixed attention mechanism and an original SiamRPN++ algorithm are subjected to an ablation experiment, the invention and a plurality of single-target tracking algorithms (SiamFC, siamRPN, daSiamRPN, siamRPN ++) based on Siamese networks are subjected to a test and comparison experiment, and meanwhile, the experimental result is quantitatively and qualitatively analyzed.

First, siamRPN++ is used as a reference algorithm, and only an improved channel attention mechanism (CA module) is added on the basis, and the result is shown in Table 1. The SiamRPN++ single-target tracking algorithm added with the channel attention mechanism can be found to bring about 1.1% of precision improvement on the VOT2018 data set, and compared with the SiamRPN++ single-target tracking algorithm, the tracking robustness of the algorithm can be found to be improved to a certain extent, and the tracking loss times of the algorithm are reduced by 3 times compared with the SiamRPN++ algorithm.

Then, based on the SiamRPN++ single-target tracking algorithm, only a spatial attention mechanism (SAModule) is added, as shown in Table 1. The SiamRPN++ single-target tracking algorithm after adding the spatial attention mechanism can be found to bring about 0.8% improvement in accuracy on the VOT2018 dataset.

Finally, a channel attention mechanism and a space attention mechanism are added simultaneously on the basis of a SiamRPN++ single-target tracking algorithm, and experimental results prove that the SiamRPN++ algorithm which introduces the improved mixed attention mechanism provided by the invention obtains better accuracy on a VOT2018 data set compared with the SiamRPN++ single-target tracking algorithm, and has certain improvement in the aspects of precision, robustness and the like.

Table 1 comparison of experimental results in the VOT2018 dataset

In order to ensure the objectivity of the experimental result, four tracking algorithms which are more advanced in recent years are introduced to be compared with the algorithm provided by the invention, wherein the four tracking algorithms comprise SiamFC, siamRPN, daSiamRPN and SiamRPN++, the tracking images of 100 sequences of the OTB100 data set are tracked, and evaluation indexes for comparing the performances of the tracking algorithms comprise tracking success rate and tracking accuracy. For algorithms that allow real-time tracking, a single evaluation can generate a success rate and accuracy chart. FIG. 1 shows the results of testing the accuracy and success rate of five different target tracking algorithms (ff-SiamRPN++ -a, siamFC, siamRPN, daSiamRPN, siamRPN ++) on the data set OTB 100.

Table 2 shows the tracking accuracy and success rate of the proposed single-target tracking algorithm ff-sialprn++ -a and the other four Siamese network-based single-target tracking algorithms on the OTB100 dataset. The invention introduces a attention mechanism and a twin feature fusion module respectively on the basis of the SiamRPN++ single-target tracking algorithm, and results show that compared with the SiamRPN++ single-target tracking algorithm, the accuracy and success rate of the improved algorithm provided by the invention are respectively improved by 2.7% and 1.8%, and the accuracy and success rate of the invention respectively reach 0.914 and 0.671 in an OTB-100 data set. Of course, the attention module also has certain drawbacks, such as V in Table 2 _FPS As shown in the column, the present invention, which incorporates the attention mechanism and the feature fusion module, has some loss in tracking speed, since the feature fusion module and some convolution blocks present in the attention mechanism increase the computational effort of the model. However, the improved SiamRPN++ -a algorithm still meets the real-time requirement on the real-time reasoning speed.

Table 2 accuracy and success rate of five algorithms on OTB100 dataset

In order to more intuitively compare the tracking performance of the algorithm provided by the invention with that of other algorithms, the section visually displays the representative tracking results of a part of five single-target tracking algorithms in the test data set. From the OTB100 dataset 4 image sequences with different interference factors were selected to show the tracked results. Five different Siamese network-based single-target tracking algorithms (siamrpn++ -a, siamFC, siamRPN, daSiamRPN, siamRPN ++) are shown, with the tracking results of the different algorithms being represented by rectangular boxes of different colors. As shown in FIG. 2, it can be seen from the tracking result graph that the Siamese network single-target tracking algorithm ff-SiamRPN++ -a based on the twin feature fusion provided by the invention still can keep better tracking capability when the tracked target encounters similar target interference, background interference and fast movement in the tracking process.

Claims

1. A Siamese network video single-target tracking method based on feature fusion is characterized by comprising the following steps of: it comprises the following steps:

2. The Siamese network video single-target tracking method based on feature fusion as claimed in claim 1, wherein the method is characterized by comprising the following steps: the size of the template image in the S1 is 127 multiplied by 127, and the size of the search image is 255 multiplied by 255.

3. The Siamese network video single-target tracking method based on feature fusion as claimed in claim 2, wherein the method is characterized by comprising the following steps: the ResNet-50 network based on the mixed attention mechanism in the S2 sequentially comprises a convolution block 1, a pooling layer, a convolution block 2, a convolution block 3, a convolution block 4 and a convolution block 5, wherein the convolution block 1 comprises one convolution layer, the convolution block 2 sequentially comprises three convolution layers, the convolution block 3 sequentially comprises four convolution layers, the convolution block 4 sequentially comprises six convolution layers, and the convolution block 5 sequentially comprises three convolution layers;

4. A Siamese network video single-target tracking method based on feature fusion as claimed in claim 3, wherein the method is characterized by comprising the following steps: the S2 twin feature fusion network sequentially comprises a convolution block a, a convolution block b, a convolution block c and a pyramid feature extraction network, wherein the convolution block a comprises a 1×1 convolution layer, the convolution block b sequentially comprises a 1×1 convolution layer and a bilinear interpolation layer, the convolution block c sequentially comprises a 1×1 convolution layer and a bilinear interpolation layer, the pyramid feature extraction network comprises three layers, and each layer sequentially comprises a 1×1 convolution layer, a 3×3 convolution layer, a 1×1 convolution layer and a transposed convolution layer.

5. The Siamese network video single-target tracking method based on feature fusion in claim 4, wherein the method comprises the following steps: the RPN network in S23 includes three RPN modules, each of which includes an anchor point generator, a convolution feature extractor, and a candidate box regressor.

6. The Siamese network video single-target tracking method based on feature fusion in claim 5, wherein the method comprises the following steps: the specific process of the S21 is as follows:

inputting the feature vector X into a channel attention mechanism, and outputting a channel weighting feature Mc of the feature vector X;

inputting the channel weighting characteristic Mc into a spatial attention mechanism, and outputting an attention map As;

attention As is input into the third convolution layer of convolution block 2, and the result of convolution block 2 is output;

7. The Siamese network video single-target tracking method based on feature fusion in claim 6, wherein the method comprises the following steps: the channel weighting characteristics M _c ：

M _c ＝sigmoid(W(BN(X)))

8. The Siamese network video single-target tracking method based on feature fusion as claimed in claim 7, wherein the method is characterized by comprising the following steps: the attention is directed to A _s ：

A _s ＝M _c ·(σ(f _max (Q(f _avg (M _c ))))))

9. The Siamese network video single-target tracking method based on feature fusion as claimed in claim 8, wherein the method is characterized by comprising the following steps: the specific process of S22 is as follows:

10. The Siamese network video single-target tracking method based on feature fusion as claimed in claim 9, wherein the method is characterized by comprising the following steps: the specific process of S23 is as follows: