CN113609904A - Single-target tracking algorithm based on dynamic global information modeling and twin network - Google Patents

Single-target tracking algorithm based on dynamic global information modeling and twin network Download PDF

Info

Publication number
CN113609904A
CN113609904A CN202110732036.9A CN202110732036A CN113609904A CN 113609904 A CN113609904 A CN 113609904A CN 202110732036 A CN202110732036 A CN 202110732036A CN 113609904 A CN113609904 A CN 113609904A
Authority
CN
China
Prior art keywords
network
training
target
global information
picture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110732036.9A
Other languages
Chinese (zh)
Other versions
CN113609904B (en
Inventor
盛庆华
黄箭
周超宇
李贺贺
李竹
殷海兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202110732036.9A priority Critical patent/CN113609904B/en
Publication of CN113609904A publication Critical patent/CN113609904A/en
Application granted granted Critical
Publication of CN113609904B publication Critical patent/CN113609904B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a single target tracking algorithm based on dynamic global information modeling and twin network, comprising the following steps of S1: acquiring a training set and a test set, and preprocessing the training set; step S2: building a network, designing a loss function and training; step S3: the effect of the algorithm is verified with the test set according to the network obtained in steps S1 and S2. By adopting the technical scheme of the invention, a twin network is built as a backbone network, a preprocessed data set is put into training, and then the position of the target object in the video sequence is determined according to the output response graph. The invention combines the dynamic global information modeling algorithm, can dynamically adjust the network receptive field according to the size and the length-width ratio of the object, and can model any two points of information in the characteristic diagram, thereby greatly improving the tracking accuracy and ensuring the real-time property under the condition of improving the network complexity by a small amplitude.

Description

Single-target tracking algorithm based on dynamic global information modeling and twin network
Technical Field
The invention relates to the field of artificial intelligence and single target tracking, in particular to a single target tracking algorithm based on dynamic global information modeling and a twin network, which can be used for improving the accuracy and the real-time performance of single target tracking.
Background
Visual target tracking is an important branch of the computer vision field. The purpose of this is to automatically, accurately and quickly locate the position of an object in a subsequent frame given the first frame position of the object by human. The main challenges of this task are object occlusion, target blurring, fast movement, deformation, illumination variation. On the premise of ensuring the real-time performance of the tracker, the target tracking technology is applied to many fields, such as video monitoring, human-computer interaction and unmanned driving.
The current mainstream target tracking algorithm, such as the SiamRPN algorithm, adds a region proposal structure in the network, which is composed of a template branch and a detection branch, and realizes off-line end-to-end training. The Anchor-based strategy is adopted, although the position of an object can be predicted more accurately, due to the existence of the area suggestion structure, the tracker is very sensitive to the number, the size and the length-width ratio of the Anchor frame, and is very difficult to adjust the parameters and labor-consuming; the siamf fc + + algorithm is based on the Anchor-free strategy, does not rely on excessive prior knowledge, and adaptively regresses the position of the target. However, it also has problems: the method neglects the requirement difference of targets with different scales and length-width ratios on the receptive field, and does not model global context information, which can cause the tracking failure in complex scenes such as changeable object shapes, fast movement and occlusion.
Disclosure of Invention
Aiming at the defects of difficult parameter adjustment, fixed receptive field, lack of global information modeling and the like of the traditional mainstream single-target tracking algorithm, the invention designs a single-target tracking algorithm based on dynamic global information modeling and twin network so as to improve the success rate and the accuracy rate of target tracking in a difficult scene.
The technical scheme for realizing the invention is as follows:
a single target tracking algorithm based on dynamic global information modeling and twin networks is characterized by comprising the following steps:
step S1: acquiring a training set and a test set, and preprocessing the training set;
step S2: building a network, designing a loss function and training;
step S3: according to the networks obtained in the steps S1 and S2, parameters such as the test accuracy, the success rate, the expected average overlapping rate and the like of the test set are used for verifying the algorithm effect;
the step S1 further includes:
step S11: and respectively cutting the original pictures of the training set into new pictures of 127 multiplied by 127 and 255 multiplied by 255 by taking the target object as the center. Complementing the part of the new picture exceeding the original picture by using channel mean pixels;
step S12: and converting the picture information of the training set generated in the S11, including the picture path, the coordinates of the upper left corner of the target object and the coordinates of the lower right corner of the target object, into a JSON format for storage. In the training stage, pictures can be read according to the stored picture paths, and the size and position information of the target object can be acquired at the same time;
the step S2 further includes:
step S21: establishing twin subnetworks, wherein a modified network model ResNet50 network is used as a feature extraction skeleton network, the model is subjected to 7 × 7 convolution, then is subjected to a maximum value pooling layer of 3 × 3, and finally is subjected to 3 large convolution groups, the 3 convolution groups respectively comprise 3, 4 and 6 Bottleneck modules, and each Bottleneck sequentially comprises convolution of 1 × 1, 3 × 3 and 1 × 1;
step S22: class-regression and cross-over ratio (IOU) predictor subnetwork building;
the step S22 further includes:
step S221: constructing a dynamic global information modeling module, wherein the module consists of 3 parts, namely split, fuse and select; in the split part, the input features X are respectivelyConvolving by 1 × 3, 3 × 3 and 3 × 1 in parallel to obtain a characteristic UA、UBAnd UCThe convolution kernels of the 3 branches differ in size, characteristic UA、UBAnd UCAnd fusing according to the weight. The purpose of doing so is to make the tracker more flexible, and to be able to dynamically select the receptive field according to the size and aspect ratio of the target; in the fuse part, modeling is carried out on two pixel point information with any distance in the same frame of image to obtain a vector s, the number of channels of the vector s is reduced from 256 to 32, and a vector z is obtained. The main function of the part is to convert the global information output by modeling and assign the global information to different channels, and the global information is expressed by a formula:
Figure BDA0003140167680000021
where delta represents the non-linear activation function,
Figure BDA0003140167680000022
representing batch regularization. Wherein W ∈ R32×256. The vector z is subjected to softmax operator to obtain a characteristic UA、UBAnd UCSoft attention vector vA、vBAnd vC(ii) a In select part, soft attention vector vA、vBAnd vCRespective sum feature matrix UA、UBAnd UCMultiplying and then summing to obtain a characteristic V;
step S222: constructing classification-regression branches, wherein the structures of 2 branches are the same and consist of 4 convolution kernels with the size of 3 multiplied by 3;
step S223: and constructing an intersection ratio prediction module which is used for predicting the intersection ratio (IOU) between the real box and the prediction box. Expressed by way of disclosure as follows:
Figure BDA0003140167680000031
wherein B is*Representing a real box, B representing a prediction box;
step S224: classification prediction module structureThe module outputs a characteristic diagram pclsThe method has the main function of predicting the probability that the region of the feature point mapped back to the original image is the target;
step S225: regression prediction module construction, which outputs feature map pregThe main function is to predict the position of the target frame;
step S23: network training, inputting a pair of template graph and search graph with the sizes of 127 × 127 and 255 × 255 to a ResNet50 model sharing weight, and outputting a characteristic matrix A1And A2The sizes are respectively 15 multiplied by 15 and 31 multiplied by 31, and the number of channels is 256; a. the1And A2After passing through a dynamic global information modeling module, performing cross convolution to obtain a classification branch input AclsAnd regression Branch input AregThe sizes are all 25 × 25, and the number of channels is 256. A. theregObtaining a characteristic map p after the convolution group of the regression branchiouAnd preg,piouSize 25 × 25, number of channels 1, which predicts the intersection ratio between real and predicted frames, pregSize 25 × 25, number of channels 4, which indicates the position of the prediction block. A. theclsObtaining a characteristic map p after classifying branch convolution groupsclsThe size of the prediction model is 25 multiplied by 25, the number of channels is 1, and the probability that the region of the feature point which is mapped back to the original image is the target is predicted;
step S24: and training the neural network by taking the loss function as a reference through a back propagation algorithm by calculating the loss function, optimizing parameters such as corresponding weight bias and the like, better tracking the convolutional neural network, and finally realizing perfect fitting of the training sample. The loss function consists of 3 parts:
Figure BDA0003140167680000032
Figure BDA0003140167680000041
Figure BDA0003140167680000042
where I {. is an indicator function whose value is 1 if the condition is satisfied and 0 otherwise.
Figure BDA00031401676800000410
Represents the state of point (i, j) on the feature map:
Figure BDA0003140167680000043
for the classification task and the prediction cross-over ratio task, a binary cross entropy loss function is adopted,
Figure BDA0003140167680000044
and
Figure BDA0003140167680000045
and represents a genuine label.
Figure BDA0003140167680000046
The interior is a discrete value representing the state of point (i, j) on the feature map
Figure BDA0003140167680000047
While
Figure BDA0003140167680000048
Is a continuous value representing the intersection ratio between the predicted real box and the predicted box.
Finally, we combine these 3 losses to get the total loss function as follows:
Figure BDA0003140167680000049
in the process of training the model, respectively setting lambda 11 and λ2=1.2。
The step S3 further includes:
step S31: acquiring a test video, analyzing the video into picture frames, naming all the pictures according to time (for example, naming the first frame picture as 00001.jpg), and placing the pictures under the same folder;
step S32: inputting the first frame picture as a template picture into a network;
step S33: inputting the subsequent frame pictures into a network as a search graph, and predicting the position of a target in each frame picture;
step S34: and predicting the target position of each frame to obtain the tracking success rate and the tracking accuracy rate, wherein the success rate needs to calculate the overlapping rate of the prediction frame and the real target frame, and the accuracy rate needs to calculate the center offset distance of the prediction frame and the real target frame.
In the above technical scheme, the specific implementation process is as follows:
(1) and (5) network building. The invention adopts a modified convolutional neural network model ResNet50 network as a feature extraction network framework, which is firstly convoluted by 7 x 7, then is convoluted by a maximum value pooling layer of 3 x 3, and finally is convoluted by 3 large convolution groups, wherein the 3 convolution groups respectively comprise 3, 4 and 6 Bottleneck modules, and each Bottleneck sequentially comprises convolutions of 1 x 1, 3 x 3 and 1 x 1. Inputting a pair of pictures with the sizes of 127 multiplied by 127 and 255 multiplied by 255 to a network framework sharing weight values, enabling the output 2 feature maps to enter a classification-regression and cross-over ratio (IOU) prediction sub-network, and finally outputting a classification map, a regression map and an IOU prediction map through dynamic receptive field fusion and global information modeling of the sub-network;
(2) and (4) network formation. The feature extraction network selects a modified convolutional neural network model ResNet50, and a classification-regression and cross-over ratio (IOU) prediction sub-network comprises a dynamic global information modeling module, a classification-regression branch, a classification prediction module, a regression prediction module and a cross-over ratio prediction module;
(3) and (5) network training. Inputting n pairs of template graphs and search graphs with the sizes of 127 × 127 and 255 × 255 to a ResNet50 model sharing weights to obtain a feature graph, then inputting 2 feature graphs into a classification-regression-intersection-ratio (IOU) prediction sub-network, and finally obtaining a classification graph, a regression graph and an IOU prediction graph. To achieve the goals of shorter training time and better training, the pre-training parameters for the ImageNet dataset are loaded by the ResNet50 before training the network.
(4) And training the convolutional neural network by calculating a loss function and taking the loss function as a reference through a back propagation algorithm, optimizing parameters such as corresponding weight bias and the like, so that the convolutional neural network can better extract characteristics, and finally realizing perfect fitting of a training sample.
(5) In the testing stage, a mainstream target tracking test set, such as VOT2019, OTB100, GOT-10k, is used. Different test sets are used and it is ensured that the pictures used for the test do not appear in the training set. And (4) verifying on a plurality of mainstream algorithms to obtain information such as success rate, accuracy and the like.
Compared with the prior art, the invention has the following technical effects:
(1) the invention adopts the modified ResNet-50 as a network framework to obtain richer information. The design of no priori box has avoided the setting of complicated super parameter, also lets the tracker more nimble, and network structure is simple, but obtains good effect.
(2) The dynamic global information modeling module adopted by the invention provides convolution kernels with different sizes and proportions of X dimension and Y dimension, so that the tracker can adaptively select the convolution kernels according to different targets to meet different receptive field requirements. Meanwhile, the embedded global information modeling module can capture long-distance dependence of a target to complete global information modeling.
(3) The IOU prediction branch sharing the weight with the regression branch is introduced, the weight of the prediction frame far away from the target is reduced, and the robustness of the branch classification is indirectly improved.
(4) As shown in fig. 4, 5, and 6, the tracker exceeds the existing mainstream algorithms such as siamf, SiamRPN, etc. on the test data sets including VOT2019, OTB100, GOT-10k, and can run to 65FPS to ensure real-time performance.
Drawings
FIG. 1 is a flow chart of training according to one embodiment of the present invention;
FIG. 2 is a diagram of an overall network architecture according to one embodiment of the method of the present invention;
FIG. 3 is a block diagram of a dynamic global information modeling architecture in accordance with one embodiment of the method of the present invention;
FIG. 4 is a graph comparing Expected Average Overlap (EAO) and other algorithms measured on test set VOT2019 according to an embodiment of the present invention;
FIG. 5 is a graph comparing accuracy, success rate and other algorithms measured on test set OTB100 in accordance with one embodiment of the present invention;
FIG. 6 is a graph comparing the success rate measured on the test set GOT-10k and other algorithms according to an embodiment of the present invention;
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described below with reference to the accompanying drawings by referring to specific examples. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.
The invention relates to a single target tracking algorithm based on dynamic global information modeling and a twin network, wherein the twin network is built as a backbone network, a preprocessed data set is put into training, and then the position of a target object in a video sequence is determined according to an output response graph. The invention combines the dynamic global information modeling algorithm, can dynamically adjust the network receptive field according to the size and the length-width ratio of the object, and can model any two points of information in the global characteristic diagram, thereby greatly improving the tracking accuracy and ensuring the real-time property under the condition of improving the network complexity by a small amplitude.
As shown in fig. 1, the present invention includes 3 major steps, step S1: acquiring a training set and a test set, and preprocessing the training set; step S2: building a network, designing a loss function and training; step S3: verifying the effect of the algorithm by using a test set according to the network obtained in the steps S1 and S2; the following is a detailed description of the specific process of the tracking algorithm:
step S1: acquiring a training set and a test set, and preprocessing the training set;
step S2: building a network, designing a loss function and training;
step S3: according to the networks obtained in the steps S1 and S2, parameters such as the test accuracy, the success rate, the expected average overlapping rate and the like of the test set are used for verifying the algorithm effect;
the step S1 further includes:
step S11: and respectively cutting the original pictures of the training set into new pictures of 127 multiplied by 127 and 255 multiplied by 255 by taking the target object as the center. Complementing the part of the new picture exceeding the original picture by using channel mean pixels;
step S12: and converting the picture information of the training set generated in the S11, including the picture path, the coordinates of the upper left corner of the target object and the coordinates of the lower right corner of the target object, into a JSON format for storage. In the training stage, pictures can be read according to the stored picture paths, and the size and position information of the target object can be acquired at the same time;
the step S2 further includes:
step S21: establishing twin subnetworks, wherein a modified network model ResNet50 network is used as a feature extraction skeleton network, the model is subjected to 7 × 7 convolution, then is subjected to a maximum value pooling layer of 3 × 3, and finally is subjected to 3 large convolution groups, the 3 convolution groups respectively comprise 3, 4 and 6 Bottleneck modules, and each Bottleneck sequentially comprises convolution of 1 × 1, 3 × 3 and 1 × 1;
step S22: class-regression and cross-over ratio (IOU) predictor subnetwork building;
the step S22 further includes:
step S221: constructing a dynamic global information modeling module, wherein the module consists of 3 parts, namely split, fuse and select; in the split part, the input features X are subjected to 1 × 3, 3 × 3 and 3 × 1 convolution respectively in parallel to obtain features UA、UBAnd UCThe convolution kernels of the 3 branches differ in size, characteristic UA、UBAnd UCAnd fusing according to the weight. The purpose of doing so is to make the tracker more flexible, and to be able to dynamically select the receptive field according to the size and aspect ratio of the target; in the fuse part, modeling is carried out on two pixel point information with any distance in the same frame of image to obtain a vector s, the number of channels of the vector s is reduced from 256 to 32, and a vector z is obtained. The main function of this part is to pass throughAnd converting the global information output by modeling and then assigning the global information to different channels, wherein the global information is expressed as follows:
Figure BDA0003140167680000081
where delta represents the non-linear activation function,
Figure BDA0003140167680000082
representing batch regularization. Wherein W ∈ R32×256. The vector z is subjected to softmax operator to obtain a characteristic UA、UBAnd UCSoft attention vector vA、vBAnd vC(ii) a In select part, soft attention vector vA、vBAnd vCRespective sum feature matrix UA、UBAnd UCMultiplying and then summing to obtain a characteristic V;
step S222: constructing classification-regression branches, wherein the structures of 2 branches are the same and consist of 4 convolution kernels with the size of 3 multiplied by 3;
step S223: and constructing an intersection ratio prediction module which is used for predicting the intersection ratio (IOU) between the real box and the prediction box. Expressed by way of disclosure as follows:
Figure BDA0003140167680000083
wherein B is*Representing a real box, B representing a prediction box;
step S224: the classification prediction module is constructed and outputs a feature map pclsThe method has the main function of predicting the probability that the region of the feature point mapped back to the original image is the target;
step S225: regression prediction module construction, which outputs feature map pregThe main function is to predict the position of the target frame;
step S23: network training, inputting a pair of template graph and search graph with the sizes of 127 × 127 and 255 × 255 to a ResNet50 model sharing weight, and outputting a characteristic matrix A1And A2Size and diameter15 multiplied by 15 and 31 multiplied by 31 respectively, and the number of channels is 256; a. the1And A2After passing through a dynamic global information modeling module, performing cross convolution to obtain a classification branch input AclsAnd regression Branch input AregThe sizes are all 25 × 25, and the number of channels is 256. A. theregObtaining a characteristic map p after the convolution group of the regression branchiouAnd preg,piouSize 25 × 25, number of channels 1, which predicts the intersection ratio between real and predicted frames, pregSize 25 × 25, number of channels 4, which indicates the position of the prediction block. A. theclsObtaining a characteristic map p after classifying branch convolution groupsclsThe size of the prediction model is 25 multiplied by 25, the number of channels is 1, and the probability that the region of the feature point which is mapped back to the original image is the target is predicted;
step S24: and training the neural network by taking the loss function as a reference through a back propagation algorithm by calculating the loss function, optimizing parameters such as corresponding weight bias and the like, better tracking the convolutional neural network, and finally realizing perfect fitting of the training sample. The loss function consists of 3 parts:
Figure BDA0003140167680000091
Figure BDA0003140167680000092
Figure BDA0003140167680000093
where I {. is an indicator function whose value is 1 if the condition is satisfied and 0 otherwise.
Figure BDA0003140167680000094
Represents the state of point (i, j) on the feature map:
Figure BDA0003140167680000095
for the classification task and the prediction cross-over ratio task, a binary cross entropy loss function is adopted,
Figure BDA0003140167680000096
and
Figure BDA0003140167680000097
and represents a genuine label.
Figure BDA0003140167680000098
The interior is a discrete value representing the state of point (i, j) on the feature map
Figure BDA0003140167680000099
While
Figure BDA00031401676800000910
Is a continuous value representing the intersection ratio between the predicted real box and the predicted box.
Finally, we combine these 3 losses to get the total loss function as follows:
Figure BDA00031401676800000911
in the process of training the model, respectively setting lambda 11 and λ2=1.2。
The step S3 further includes:
step S31: acquiring a test video, analyzing the video into picture frames, naming all the pictures according to time (for example, naming the first frame picture as 00001.jpg), and placing the pictures under the same folder;
step S32: inputting the first frame picture as a template picture into a network;
step S33: inputting the subsequent frame pictures into a network as a search graph, and predicting the position of a target in each frame picture;
step S34: and predicting the target position of each frame to obtain the tracking success rate and the tracking accuracy rate, wherein the success rate needs to calculate the overlapping rate of the prediction frame and the real target frame, and the accuracy rate needs to calculate the center offset distance of the prediction frame and the real target frame.
In conclusion, the modified ResNet-50 is adopted as the network framework to obtain richer information. The design of no prior box avoids the complicated super-parameter setting and also makes the tracker more flexible. Meanwhile, the dynamic global information modeling module provides convolution kernels with X dimensions and Y dimensions in different sizes and proportions, so that the tracker can adaptively select the convolution kernels according to different targets to meet different receptive field requirements. Meanwhile, the embedded global information modeling module can capture long-distance dependence of a target to complete global information modeling.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims (4)

1. A single target tracking algorithm based on dynamic global information modeling and twin networks is characterized in that: the method comprises the following steps:
step S1: acquiring a training set and a test set, and preprocessing the training set;
step S2: building a network, designing a loss function and training;
step S3: and verifying the algorithm effect by using parameters such as the test accuracy, the success rate, the expected average overlapping rate and the like of the test set according to the network obtained in the steps S1 and S2. (ii) a
2. The single target tracking algorithm based on dynamic global information modeling and twin network as claimed in claim 1, wherein: the step S1 further includes:
step S11: and respectively cutting the original pictures of the training set into new pictures of 127 multiplied by 127 and 255 multiplied by 255 by taking the target object as the center. Complementing the part of the new picture exceeding the original picture by using channel mean pixels;
step S12: and converting the picture information of the training set generated in the S11, including the picture path, the coordinates of the upper left corner of the target object and the coordinates of the lower right corner of the target object, into a JSON format for storage. In the training stage, the picture can be read according to the stored picture path, and the size and the position information of the target object can be acquired at the same time.
3. The single target tracking algorithm based on dynamic global information modeling and twin network as claimed in claim 1, wherein: the step S2 further includes:
step S21: establishing twin subnetworks, wherein a modified network model ResNet50 network is used as a feature extraction skeleton network, the model is subjected to 7 × 7 convolution, then is subjected to a maximum value pooling layer of 3 × 3, and finally is subjected to 3 large convolution groups, the 3 convolution groups respectively comprise 3, 4 and 6 Bottleneck modules, and each Bottleneck sequentially comprises convolution of 1 × 1, 3 × 3 and 1 × 1;
step S22: class-regression and cross-over ratio (IOU) predictor subnetwork building;
the step S22 further includes:
step S221: constructing a dynamic global information modeling module, wherein the module consists of 3 parts, namely split, fuse and select; in the split part, the input features X are subjected to 1 × 3, 3 × 3 and 3 × 1 convolution respectively in parallel to obtain features UA、UBAnd UCThe convolution kernels of the 3 branches differ in size, characteristic UA、UBAnd UCAnd fusing according to the weight. The purpose of doing so is to make the tracker more flexible, and to be able to dynamically select the receptive field according to the size and aspect ratio of the target; in the fuse part, modeling is carried out on two pixel point information with any distance in the same frame of image to obtain a vector s, the number of channels of the vector s is reduced from 256 to 32, and a vector z is obtained. The main function of the part is to convert the global information output by modeling and assign the global information to different channels, and the global information is expressed by a formula:
Figure FDA0003140167670000022
where delta represents the non-linear activation function,
Figure FDA0003140167670000023
representing batch regularization. Wherein W ∈ R32×256. The vector z is subjected to softmax operator to obtain a characteristic UA、UBAnd UCSoft attention vector vA、vBAnd vC(ii) a In select part, soft attention vector vA、vBAnd vCRespective sum feature matrix UA、UBAnd UCMultiplying and then summing to obtain a characteristic V;
step S222: constructing classification-regression branches, wherein the structures of 2 branches are the same and consist of 4 convolution kernels with the size of 3 multiplied by 3;
step S223: and constructing an intersection ratio prediction module which is used for predicting the intersection ratio (IOU) between the real box and the prediction box. Expressed by way of disclosure as follows:
Figure FDA0003140167670000021
wherein B is*Representing a real box, B representing a prediction box;
step S224: the classification prediction module is constructed and outputs a feature map pclsThe method has the main function of predicting the probability that the region of the feature point mapped back to the original image is the target;
step S225: regression prediction module construction, which outputs feature map pregThe main function is to predict the position of the target frame;
step S23: network training, inputting a pair of template graph and search graph with the sizes of 127 × 127 and 255 × 255 to a ResNet50 model sharing weight, and outputting a characteristic matrix A1And A2The sizes are respectively 15 multiplied by 15 and 31 multiplied by 31, and the number of channels is 256; a. the1And A2After passing through a dynamic global information modeling module, performing cross convolution to obtain a classification branch input AclsAnd regression Branch input AregThe sizes are all 25 × 25, and the number of channels is 256. A. theregObtaining a characteristic map p after the convolution group of the regression branchiouAnd preg,piouSize 25 × 25, number of channels 1, which predicts the intersection ratio between real and predicted frames, pregSize 25 × 25, number of channels 4, which indicates the position of the prediction block. A. theclsObtaining a characteristic map p after classifying branch convolution groupsclsThe size of the prediction model is 25 multiplied by 25, the number of channels is 1, and the probability that the region of the feature point which is mapped back to the original image is the target is predicted;
step S24: and training the neural network by taking the loss function as a reference through a back propagation algorithm by calculating the loss function, optimizing parameters such as corresponding weight bias and the like, better tracking the convolutional neural network, and finally realizing perfect fitting of the training sample. The loss function consists of 3 parts:
Figure FDA0003140167670000031
Figure FDA0003140167670000032
Figure FDA0003140167670000033
where I {. is an indicator function whose value is 1 if the condition is satisfied and 0 otherwise.
Figure FDA0003140167670000034
Represents the state of point (i, j) on the feature map:
Figure FDA0003140167670000035
for the classification task and the prediction cross-over ratio task, a binary cross entropy loss function is adopted,
Figure FDA0003140167670000036
and
Figure FDA0003140167670000037
and represents a genuine label.
Figure FDA0003140167670000038
The interior is a discrete value representing the state of point (i, j) on the feature map
Figure FDA0003140167670000039
While
Figure FDA00031401676700000310
Is a continuous value representing the intersection ratio between the predicted real box and the predicted box.
Finally, we combine these 3 losses to get the total loss function as follows:
Figure FDA00031401676700000311
in the process of training the model, respectively setting lambda11 and λ2=1.2。
4. The single target tracking algorithm based on dynamic global information modeling and twin network as claimed in claim 1, wherein: the step S3 further includes:
step S31: acquiring a test video, analyzing the video into picture frames, naming all the pictures according to time (for example, naming the first frame picture as 00001.jpg), and placing the pictures under the same folder;
step S32: inputting the first frame picture as a template picture into a network;
step S33: inputting the subsequent frame pictures into a network as a search graph, and predicting the position of a target in each frame picture;
step S34: and predicting the target position of each frame to obtain the tracking success rate and the tracking accuracy rate, wherein the success rate needs to calculate the overlapping rate of the prediction frame and the real target frame, and the accuracy rate needs to calculate the center offset distance of the prediction frame and the real target frame.
CN202110732036.9A 2021-06-30 2021-06-30 Single-target tracking algorithm based on dynamic global information modeling and twin network Active CN113609904B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110732036.9A CN113609904B (en) 2021-06-30 2021-06-30 Single-target tracking algorithm based on dynamic global information modeling and twin network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110732036.9A CN113609904B (en) 2021-06-30 2021-06-30 Single-target tracking algorithm based on dynamic global information modeling and twin network

Publications (2)

Publication Number Publication Date
CN113609904A true CN113609904A (en) 2021-11-05
CN113609904B CN113609904B (en) 2024-03-29

Family

ID=78303871

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110732036.9A Active CN113609904B (en) 2021-06-30 2021-06-30 Single-target tracking algorithm based on dynamic global information modeling and twin network

Country Status (1)

Country Link
CN (1) CN113609904B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114648703A (en) * 2022-04-08 2022-06-21 安徽工业大学 Fruit automatic picking method based on improved SimFC
CN114821431A (en) * 2022-05-05 2022-07-29 南京大学 Real-time multi-class multi-target tracking method in tunnel

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020016903A1 (en) * 1991-07-08 2002-02-07 Nguyen Le Trong High-performance, superscalar-based computer system with out-of-order instruction execution and concurrent results distribution
CN111179307A (en) * 2019-12-16 2020-05-19 浙江工业大学 Visual target tracking method for full-volume integral and regression twin network structure
CN112509008A (en) * 2020-12-15 2021-03-16 重庆邮电大学 Target tracking method based on intersection-to-parallel ratio guided twin network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020016903A1 (en) * 1991-07-08 2002-02-07 Nguyen Le Trong High-performance, superscalar-based computer system with out-of-order instruction execution and concurrent results distribution
CN111179307A (en) * 2019-12-16 2020-05-19 浙江工业大学 Visual target tracking method for full-volume integral and regression twin network structure
CN112509008A (en) * 2020-12-15 2021-03-16 重庆邮电大学 Target tracking method based on intersection-to-parallel ratio guided twin network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114648703A (en) * 2022-04-08 2022-06-21 安徽工业大学 Fruit automatic picking method based on improved SimFC
CN114821431A (en) * 2022-05-05 2022-07-29 南京大学 Real-time multi-class multi-target tracking method in tunnel

Also Published As

Publication number Publication date
CN113609904B (en) 2024-03-29

Similar Documents

Publication Publication Date Title
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
CN112396002B (en) SE-YOLOv 3-based lightweight remote sensing target detection method
CN111523521B (en) Remote sensing image classification method for double-branch fusion multi-scale attention neural network
CN112990211B (en) Training method, image processing method and device for neural network
CN113065558A (en) Lightweight small target detection method combined with attention mechanism
CN108537824B (en) Feature map enhanced network structure optimization method based on alternating deconvolution and convolution
CN112651438A (en) Multi-class image classification method and device, terminal equipment and storage medium
CN111539343B (en) Black smoke vehicle detection method based on convolution attention network
CN111126278B (en) Method for optimizing and accelerating target detection model for few-class scene
CN112884742A (en) Multi-algorithm fusion-based multi-target real-time detection, identification and tracking method
CN111275171B (en) Small target detection method based on parameter sharing multi-scale super-division reconstruction
CN113706581B (en) Target tracking method based on residual channel attention and multi-level classification regression
CN110826411B (en) Vehicle target rapid identification method based on unmanned aerial vehicle image
CN113609904A (en) Single-target tracking algorithm based on dynamic global information modeling and twin network
CN112149620A (en) Method for constructing natural scene character region detection model based on no anchor point
CN113688830B (en) Deep learning target detection method based on center point regression
CN111797841A (en) Visual saliency detection method based on depth residual error network
CN116704273A (en) Self-adaptive infrared and visible light dual-mode fusion detection method
CN113160117A (en) Three-dimensional point cloud target detection method under automatic driving scene
CN112766102A (en) Unsupervised hyperspectral video target tracking method based on space-spectrum feature fusion
CN115410087A (en) Transmission line foreign matter detection method based on improved YOLOv4
CN115187786A (en) Rotation-based CenterNet2 target detection method
CN117034100A (en) Self-adaptive graph classification method, system, equipment and medium based on hierarchical pooling architecture
CN112508863B (en) Target detection method based on RGB image and MSR image double channels
CN111079900B (en) Image processing method and device based on self-adaptive connection neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant