CN113609904A

CN113609904A - Single-target tracking algorithm based on dynamic global information modeling and twin network

Info

Publication number: CN113609904A
Application number: CN202110732036.9A
Authority: CN
Inventors: 盛庆华; 黄箭; 周超宇; 李贺贺; 李竹; 殷海兵
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-11-05
Anticipated expiration: 2041-06-30
Also published as: CN113609904B

Abstract

The invention discloses a single target tracking algorithm based on dynamic global information modeling and twin network, comprising the following steps of S1: acquiring a training set and a test set, and preprocessing the training set; step S2: building a network, designing a loss function and training; step S3: the effect of the algorithm is verified with the test set according to the network obtained in steps S1 and S2. By adopting the technical scheme of the invention, a twin network is built as a backbone network, a preprocessed data set is put into training, and then the position of the target object in the video sequence is determined according to the output response graph. The invention combines the dynamic global information modeling algorithm, can dynamically adjust the network receptive field according to the size and the length-width ratio of the object, and can model any two points of information in the characteristic diagram, thereby greatly improving the tracking accuracy and ensuring the real-time property under the condition of improving the network complexity by a small amplitude.

Description

Single-target tracking algorithm based on dynamic global information modeling and twin network

Technical Field

The invention relates to the field of artificial intelligence and single target tracking, in particular to a single target tracking algorithm based on dynamic global information modeling and a twin network, which can be used for improving the accuracy and the real-time performance of single target tracking.

Background

Visual target tracking is an important branch of the computer vision field. The purpose of this is to automatically, accurately and quickly locate the position of an object in a subsequent frame given the first frame position of the object by human. The main challenges of this task are object occlusion, target blurring, fast movement, deformation, illumination variation. On the premise of ensuring the real-time performance of the tracker, the target tracking technology is applied to many fields, such as video monitoring, human-computer interaction and unmanned driving.

The current mainstream target tracking algorithm, such as the SiamRPN algorithm, adds a region proposal structure in the network, which is composed of a template branch and a detection branch, and realizes off-line end-to-end training. The Anchor-based strategy is adopted, although the position of an object can be predicted more accurately, due to the existence of the area suggestion structure, the tracker is very sensitive to the number, the size and the length-width ratio of the Anchor frame, and is very difficult to adjust the parameters and labor-consuming; the siamf fc + + algorithm is based on the Anchor-free strategy, does not rely on excessive prior knowledge, and adaptively regresses the position of the target. However, it also has problems: the method neglects the requirement difference of targets with different scales and length-width ratios on the receptive field, and does not model global context information, which can cause the tracking failure in complex scenes such as changeable object shapes, fast movement and occlusion.

Disclosure of Invention

Aiming at the defects of difficult parameter adjustment, fixed receptive field, lack of global information modeling and the like of the traditional mainstream single-target tracking algorithm, the invention designs a single-target tracking algorithm based on dynamic global information modeling and twin network so as to improve the success rate and the accuracy rate of target tracking in a difficult scene.

The technical scheme for realizing the invention is as follows:

a single target tracking algorithm based on dynamic global information modeling and twin networks is characterized by comprising the following steps:

step S1: acquiring a training set and a test set, and preprocessing the training set;

step S2: building a network, designing a loss function and training;

step S3: according to the networks obtained in the steps S1 and S2, parameters such as the test accuracy, the success rate, the expected average overlapping rate and the like of the test set are used for verifying the algorithm effect;

the step S1 further includes:

step S11: and respectively cutting the original pictures of the training set into new pictures of 127 multiplied by 127 and 255 multiplied by 255 by taking the target object as the center. Complementing the part of the new picture exceeding the original picture by using channel mean pixels;

step S12: and converting the picture information of the training set generated in the S11, including the picture path, the coordinates of the upper left corner of the target object and the coordinates of the lower right corner of the target object, into a JSON format for storage. In the training stage, pictures can be read according to the stored picture paths, and the size and position information of the target object can be acquired at the same time;

the step S2 further includes:

step S21: establishing twin subnetworks, wherein a modified network model ResNet50 network is used as a feature extraction skeleton network, the model is subjected to 7 × 7 convolution, then is subjected to a maximum value pooling layer of 3 × 3, and finally is subjected to 3 large convolution groups, the 3 convolution groups respectively comprise 3, 4 and 6 Bottleneck modules, and each Bottleneck sequentially comprises convolution of 1 × 1, 3 × 3 and 1 × 1;

step S22: class-regression and cross-over ratio (IOU) predictor subnetwork building;

the step S22 further includes:

step S221: constructing a dynamic global information modeling module, wherein the module consists of 3 parts, namely split, fuse and select; in the split part, the input features X are respectivelyConvolving by 1 × 3, 3 × 3 and 3 × 1 in parallel to obtain a characteristic U_A、U_BAnd U_CThe convolution kernels of the 3 branches differ in size, characteristic U_A、U_BAnd U_CAnd fusing according to the weight. The purpose of doing so is to make the tracker more flexible, and to be able to dynamically select the receptive field according to the size and aspect ratio of the target; in the fuse part, modeling is carried out on two pixel point information with any distance in the same frame of image to obtain a vector s, the number of channels of the vector s is reduced from 256 to 32, and a vector z is obtained. The main function of the part is to convert the global information output by modeling and assign the global information to different channels, and the global information is expressed by a formula:

where delta represents the non-linear activation function,

representing batch regularization. Wherein W ∈ R^32×256. The vector z is subjected to softmax operator to obtain a characteristic U_A、U_BAnd U_CSoft attention vector v_A、v_BAnd v_C(ii) a In select part, soft attention vector v_A、v_BAnd v_CRespective sum feature matrix U_A、U_BAnd U_CMultiplying and then summing to obtain a characteristic V;

step S222: constructing classification-regression branches, wherein the structures of 2 branches are the same and consist of 4 convolution kernels with the size of 3 multiplied by 3;

step S223: and constructing an intersection ratio prediction module which is used for predicting the intersection ratio (IOU) between the real box and the prediction box. Expressed by way of disclosure as follows:

wherein B is^*Representing a real box, B representing a prediction box;

step S224: classification prediction module structureThe module outputs a characteristic diagram p_clsThe method has the main function of predicting the probability that the region of the feature point mapped back to the original image is the target;

step S225: regression prediction module construction, which outputs feature map p_regThe main function is to predict the position of the target frame;

step S23: network training, inputting a pair of template graph and search graph with the sizes of 127 × 127 and 255 × 255 to a ResNet50 model sharing weight, and outputting a characteristic matrix A₁And A₂The sizes are respectively 15 multiplied by 15 and 31 multiplied by 31, and the number of channels is 256; a. the₁And A₂After passing through a dynamic global information modeling module, performing cross convolution to obtain a classification branch input A_clsAnd regression Branch input A_regThe sizes are all 25 × 25, and the number of channels is 256. A. the_regObtaining a characteristic map p after the convolution group of the regression branch_iouAnd p_reg，p_iouSize 25 × 25, number of channels 1, which predicts the intersection ratio between real and predicted frames, p_regSize 25 × 25, number of channels 4, which indicates the position of the prediction block. A. the_clsObtaining a characteristic map p after classifying branch convolution groups_clsThe size of the prediction model is 25 multiplied by 25, the number of channels is 1, and the probability that the region of the feature point which is mapped back to the original image is the target is predicted;

step S24: and training the neural network by taking the loss function as a reference through a back propagation algorithm by calculating the loss function, optimizing parameters such as corresponding weight bias and the like, better tracking the convolutional neural network, and finally realizing perfect fitting of the training sample. The loss function consists of 3 parts:

where I {. is an indicator function whose value is 1 if the condition is satisfied and 0 otherwise.

Represents the state of point (i, j) on the feature map:

for the classification task and the prediction cross-over ratio task, a binary cross entropy loss function is adopted,

and

and represents a genuine label.

The interior is a discrete value representing the state of point (i, j) on the feature map

While

Is a continuous value representing the intersection ratio between the predicted real box and the predicted box.

Finally, we combine these 3 losses to get the total loss function as follows:

in the process of training the model, respectively setting lambda ₁1 and λ₂＝1.2。

The step S3 further includes:

step S31: acquiring a test video, analyzing the video into picture frames, naming all the pictures according to time (for example, naming the first frame picture as 00001.jpg), and placing the pictures under the same folder;

step S32: inputting the first frame picture as a template picture into a network;

step S33: inputting the subsequent frame pictures into a network as a search graph, and predicting the position of a target in each frame picture;

step S34: and predicting the target position of each frame to obtain the tracking success rate and the tracking accuracy rate, wherein the success rate needs to calculate the overlapping rate of the prediction frame and the real target frame, and the accuracy rate needs to calculate the center offset distance of the prediction frame and the real target frame.

In the above technical scheme, the specific implementation process is as follows:

(1) and (5) network building. The invention adopts a modified convolutional neural network model ResNet50 network as a feature extraction network framework, which is firstly convoluted by 7 x 7, then is convoluted by a maximum value pooling layer of 3 x 3, and finally is convoluted by 3 large convolution groups, wherein the 3 convolution groups respectively comprise 3, 4 and 6 Bottleneck modules, and each Bottleneck sequentially comprises convolutions of 1 x 1, 3 x 3 and 1 x 1. Inputting a pair of pictures with the sizes of 127 multiplied by 127 and 255 multiplied by 255 to a network framework sharing weight values, enabling the output 2 feature maps to enter a classification-regression and cross-over ratio (IOU) prediction sub-network, and finally outputting a classification map, a regression map and an IOU prediction map through dynamic receptive field fusion and global information modeling of the sub-network;

(2) and (4) network formation. The feature extraction network selects a modified convolutional neural network model ResNet50, and a classification-regression and cross-over ratio (IOU) prediction sub-network comprises a dynamic global information modeling module, a classification-regression branch, a classification prediction module, a regression prediction module and a cross-over ratio prediction module;

(3) and (5) network training. Inputting n pairs of template graphs and search graphs with the sizes of 127 × 127 and 255 × 255 to a ResNet50 model sharing weights to obtain a feature graph, then inputting 2 feature graphs into a classification-regression-intersection-ratio (IOU) prediction sub-network, and finally obtaining a classification graph, a regression graph and an IOU prediction graph. To achieve the goals of shorter training time and better training, the pre-training parameters for the ImageNet dataset are loaded by the ResNet50 before training the network.

(4) And training the convolutional neural network by calculating a loss function and taking the loss function as a reference through a back propagation algorithm, optimizing parameters such as corresponding weight bias and the like, so that the convolutional neural network can better extract characteristics, and finally realizing perfect fitting of a training sample.

(5) In the testing stage, a mainstream target tracking test set, such as VOT2019, OTB100, GOT-10k, is used. Different test sets are used and it is ensured that the pictures used for the test do not appear in the training set. And (4) verifying on a plurality of mainstream algorithms to obtain information such as success rate, accuracy and the like.

Compared with the prior art, the invention has the following technical effects:

(1) the invention adopts the modified ResNet-50 as a network framework to obtain richer information. The design of no priori box has avoided the setting of complicated super parameter, also lets the tracker more nimble, and network structure is simple, but obtains good effect.

(2) The dynamic global information modeling module adopted by the invention provides convolution kernels with different sizes and proportions of X dimension and Y dimension, so that the tracker can adaptively select the convolution kernels according to different targets to meet different receptive field requirements. Meanwhile, the embedded global information modeling module can capture long-distance dependence of a target to complete global information modeling.

(3) The IOU prediction branch sharing the weight with the regression branch is introduced, the weight of the prediction frame far away from the target is reduced, and the robustness of the branch classification is indirectly improved.

(4) As shown in fig. 4, 5, and 6, the tracker exceeds the existing mainstream algorithms such as siamf, SiamRPN, etc. on the test data sets including VOT2019, OTB100, GOT-10k, and can run to 65FPS to ensure real-time performance.

Drawings

FIG. 1 is a flow chart of training according to one embodiment of the present invention;

FIG. 2 is a diagram of an overall network architecture according to one embodiment of the method of the present invention;

FIG. 3 is a block diagram of a dynamic global information modeling architecture in accordance with one embodiment of the method of the present invention;

FIG. 4 is a graph comparing Expected Average Overlap (EAO) and other algorithms measured on test set VOT2019 according to an embodiment of the present invention;

FIG. 5 is a graph comparing accuracy, success rate and other algorithms measured on test set OTB100 in accordance with one embodiment of the present invention;

FIG. 6 is a graph comparing the success rate measured on the test set GOT-10k and other algorithms according to an embodiment of the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described below with reference to the accompanying drawings by referring to specific examples. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.

The invention relates to a single target tracking algorithm based on dynamic global information modeling and a twin network, wherein the twin network is built as a backbone network, a preprocessed data set is put into training, and then the position of a target object in a video sequence is determined according to an output response graph. The invention combines the dynamic global information modeling algorithm, can dynamically adjust the network receptive field according to the size and the length-width ratio of the object, and can model any two points of information in the global characteristic diagram, thereby greatly improving the tracking accuracy and ensuring the real-time property under the condition of improving the network complexity by a small amplitude.

As shown in fig. 1, the present invention includes 3 major steps, step S1: acquiring a training set and a test set, and preprocessing the training set; step S2: building a network, designing a loss function and training; step S3: verifying the effect of the algorithm by using a test set according to the network obtained in the steps S1 and S2; the following is a detailed description of the specific process of the tracking algorithm:

step S2: building a network, designing a loss function and training;

the step S1 further includes:

the step S2 further includes:

the step S22 further includes:

step S221: constructing a dynamic global information modeling module, wherein the module consists of 3 parts, namely split, fuse and select; in the split part, the input features X are subjected to 1 × 3, 3 × 3 and 3 × 1 convolution respectively in parallel to obtain features U_A、U_BAnd U_CThe convolution kernels of the 3 branches differ in size, characteristic U_A、U_BAnd U_CAnd fusing according to the weight. The purpose of doing so is to make the tracker more flexible, and to be able to dynamically select the receptive field according to the size and aspect ratio of the target; in the fuse part, modeling is carried out on two pixel point information with any distance in the same frame of image to obtain a vector s, the number of channels of the vector s is reduced from 256 to 32, and a vector z is obtained. The main function of this part is to pass throughAnd converting the global information output by modeling and then assigning the global information to different channels, wherein the global information is expressed as follows:

where delta represents the non-linear activation function,

wherein B is^*Representing a real box, B representing a prediction box;

step S224: the classification prediction module is constructed and outputs a feature map p_clsThe method has the main function of predicting the probability that the region of the feature point mapped back to the original image is the target;

step S23: network training, inputting a pair of template graph and search graph with the sizes of 127 × 127 and 255 × 255 to a ResNet50 model sharing weight, and outputting a characteristic matrix A₁And A₂Size and diameter15 multiplied by 15 and 31 multiplied by 31 respectively, and the number of channels is 256; a. the₁And A₂After passing through a dynamic global information modeling module, performing cross convolution to obtain a classification branch input A_clsAnd regression Branch input A_regThe sizes are all 25 × 25, and the number of channels is 256. A. the_regObtaining a characteristic map p after the convolution group of the regression branch_iouAnd p_reg，p_iouSize 25 × 25, number of channels 1, which predicts the intersection ratio between real and predicted frames, p_regSize 25 × 25, number of channels 4, which indicates the position of the prediction block. A. the_clsObtaining a characteristic map p after classifying branch convolution groups_clsThe size of the prediction model is 25 multiplied by 25, the number of channels is 1, and the probability that the region of the feature point which is mapped back to the original image is the target is predicted;

Represents the state of point (i, j) on the feature map:

and

and represents a genuine label.

While

Finally, we combine these 3 losses to get the total loss function as follows:

The step S3 further includes:

In conclusion, the modified ResNet-50 is adopted as the network framework to obtain richer information. The design of no prior box avoids the complicated super-parameter setting and also makes the tracker more flexible. Meanwhile, the dynamic global information modeling module provides convolution kernels with X dimensions and Y dimensions in different sizes and proportions, so that the tracker can adaptively select the convolution kernels according to different targets to meet different receptive field requirements. Meanwhile, the embedded global information modeling module can capture long-distance dependence of a target to complete global information modeling.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. A single target tracking algorithm based on dynamic global information modeling and twin networks is characterized in that: the method comprises the following steps:

step S2: building a network, designing a loss function and training;

step S3: and verifying the algorithm effect by using parameters such as the test accuracy, the success rate, the expected average overlapping rate and the like of the test set according to the network obtained in the steps S1 and S2. (ii) a

2. The single target tracking algorithm based on dynamic global information modeling and twin network as claimed in claim 1, wherein: the step S1 further includes:

step S12: and converting the picture information of the training set generated in the S11, including the picture path, the coordinates of the upper left corner of the target object and the coordinates of the lower right corner of the target object, into a JSON format for storage. In the training stage, the picture can be read according to the stored picture path, and the size and the position information of the target object can be acquired at the same time.

3. The single target tracking algorithm based on dynamic global information modeling and twin network as claimed in claim 1, wherein: the step S2 further includes:

the step S22 further includes:

step S221: constructing a dynamic global information modeling module, wherein the module consists of 3 parts, namely split, fuse and select; in the split part, the input features X are subjected to 1 × 3, 3 × 3 and 3 × 1 convolution respectively in parallel to obtain features U_A、U_BAnd U_CThe convolution kernels of the 3 branches differ in size, characteristic U_A、U_BAnd U_CAnd fusing according to the weight. The purpose of doing so is to make the tracker more flexible, and to be able to dynamically select the receptive field according to the size and aspect ratio of the target; in the fuse part, modeling is carried out on two pixel point information with any distance in the same frame of image to obtain a vector s, the number of channels of the vector s is reduced from 256 to 32, and a vector z is obtained. The main function of the part is to convert the global information output by modeling and assign the global information to different channels, and the global information is expressed by a formula:

where delta represents the non-linear activation function,

wherein B is^*Representing a real box, B representing a prediction box;

Represents the state of point (i, j) on the feature map:

and

and represents a genuine label.

While

Finally, we combine these 3 losses to get the total loss function as follows:

in the process of training the model, respectively setting lambda₁1 and λ₂＝1.2。

4. The single target tracking algorithm based on dynamic global information modeling and twin network as claimed in claim 1, wherein: the step S3 further includes: