CN115424177A - Twin network target tracking method based on incremental learning - Google Patents

Twin network target tracking method based on incremental learning Download PDF

Info

Publication number
CN115424177A
CN115424177A CN202211073134.7A CN202211073134A CN115424177A CN 115424177 A CN115424177 A CN 115424177A CN 202211073134 A CN202211073134 A CN 202211073134A CN 115424177 A CN115424177 A CN 115424177A
Authority
CN
China
Prior art keywords
network
target
tracking
frame
student
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211073134.7A
Other languages
Chinese (zh)
Inventor
汲清波
陈奎丞
候长波
李子琦
吴江江
孔德强
戚宇飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN202211073134.7A priority Critical patent/CN115424177A/en
Publication of CN115424177A publication Critical patent/CN115424177A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a twin network target tracking method based on incremental learning, which applies the incremental learning to the updating process of a target tracking network model; firstly, copying RPN of a tracking network SiamRPN + + into a student model, and using a high-reliability target generated in the tracking process as a small sample set for on-line training; then, learning a small sample set generated by the previous frame of network by using an incremental learning mode, and training a student model by using a domain expansion and knowledge distillation mode for the model; finally, carrying out dynamic weighted fusion by using target information generated by the student network model and target information generated by the teacher network model to update the position; the invention utilizes the incremental learning mode to enable the model of the off-line training to have the self-adaptive learning capability, thereby not only effectively utilizing the historical information of the target in the tracking process, but also avoiding carrying out large-scale off-line training on the model, and improving the capability of the twin network algorithm for processing the deformation condition of the target in the tracking process.

Description

Twin network target tracking method based on incremental learning
Technical Field
The invention relates to a twin network target tracking method based on incremental learning, and belongs to the technical field of target tracking of computer vision.
Background
Target tracking refers to specifying a target from an initial frame and continuously predicting the motion state of the target in subsequent video frames in a video sequence. The motion state of the target object generally refers to the position and size of the target object in a new video frame, and is represented by a rectangular frame.
The current phase tracking algorithms can be roughly divided into two types. One kind applies the idea of correlation filtering to the tracking field, and is mainly divided into a kernel cycle structure algorithm and a kernel correlation filtering algorithm. The algorithm converts the solving of the tracker template into Fourier domain point multiplication calculation through time domain complex operation, so that the calculated amount is greatly reduced, the speed of the tracker is obviously improved, and the precision is not ideal. The related filtering algorithm is to use the features extracted by the neural network to replace the manually marked features, so that the accuracy can be improved to a certain extent, but the calculation efficiency can be greatly reduced in the model updating process. The other is a tracking algorithm represented by deep learning, and a recently developed twin network tracking algorithm has good performance in both precision and speed. The full convolution twin network SiamFC is the mountaineering work of the algorithm, the SiamFC obtains the feature extraction capability of an object through large-scale off-line training, then the similarity of a region to be searched and a template picture is calculated during tracking, and the position with the highest response is the estimated position of a target. Aiming at the SimFC framework, a series of improved algorithms are proposed by later people, and the improved algorithms comprise methods of introducing complementary twin network branches, introducing an attention mechanism, carrying out graph convolution neural network, adopting reinforcement learning to adjust model parameters and the like. For example, li et al propose SimRPN, which introduces an RPN structure into the SimFC, obtains the location of the target via a classification branch and obtains an accurate estimate of the target scale via a regression branch. Thereafter DaSiamRPN improves the discrimination of twin networks by mining negative sample pairs during the training phase. Li et al introduced the ResNet network into the twin network and proposed the SiamRPN + + algorithm to obtain the then best performance on multiple target tracking data sets. The twin region convolutional neural network Siam R-CNN proposed by VOIGTLAENDE et al utilizes re-detection to improve the tracking success rate. The UpdateNet updates the template by using the initial frame, the historical frame and the current frame together, fully utilizes the target change appearance information and improves the robustness of the tracking process. The DiMP algorithm adds samples in a data enhancement mode, learns the convolution network of online updating branches by gradient descent, and intermittently updates the model, so that the comprehensive performance of the algorithm is further improved. Ocean adopts anchor free and uses a fast conjugate algorithm to train on-line branches in the reasoning process, and fully excavates the precision and the success rate in the tracking and segmenting direction.
Although the field of object tracking has developed rapidly in recent years, the challenges with tracking problems have made object tracking a difficult task. In the tracked video, the tracking task is very challenging because the target object is constantly undergoing the processes of deformation, occlusion, rotation, scale change, illumination transformation and the like. In a tracked video sequence, the target object is usually determined from the initial frame. The twin network uses the initial frame to generate a target template, and the target tracking is completed by template matching of subsequent frames. In an actual scene, if the tracker only uses the target template of the initial frame, it may cause subsequent tracking to lose the target, because the target may deform or otherwise change during the motion process. The current target tracking technology utilizes the historical information of the target to adjust the model mainly in three ways: fixed updating, adaptive updating and online learning. And (3) fixed updating, namely linear superposition is carried out on the target template in the tracking process, and the updating rate in the updating process is constant, but the method has poor updating effect and lacks flexibility. Adaptive updating, that is, training an appearance change trend of an individual network learning target, for example, an updated network of UpdateNet requires additional offline training to specially learn motion changes of various objects, but the method has many and complex training stages, can only perform corresponding training for an individual test data set, and lacks generalization ability of other scenes. And (4) online learning, namely training an independent feature extraction network on line by using a tracking result at the previous moment, and reasoning by using the network at the next moment. For example, diMP is a method in which an individual small network is trained simultaneously during model inference, and generally, a tracked high-confidence target is selected as a sample, the small network is learned by gradient descent, and the inference speed of the network is guaranteed by indirect training. Aiming at the problems of lack of flexibility, complex training and forgetting of old knowledge of the method, it is necessary to design an effective tracking method for online adjustment of the model.
Disclosure of Invention
In view of the different drawbacks in the prior art, the present invention aims to provide a method for twin network target tracking based on incremental learning: performing domain expansion on the tracking target through a teacher student network, and reducing the difference of feature mapping between an initial target template and a new target tracking result by using feature change loss, thereby improving the classification capability of the network on the foreground and the background of the new target; the distillation loss is utilized to prevent the old knowledge from being forgotten in the model training process, the capability of continuously utilizing the historical appearance characteristic information of the target can be enhanced, the problems of deformation and the like in the target tracking process are solved, and the applicability of the model in different scenes is improved.
In order to achieve the purpose, the technical scheme adopted by the invention is to provide a twin network target tracking method based on incremental learning, and not only new appearance information of a learning target but also past characteristics cannot be forgotten in the model adjusting process through the field expansion training and distillation loss during online training of a student network. The method is characterized by mainly comprising the following steps:
1. based on a target tracking model, the target tracking model comprises a twin network tracking module, a sample generation module, a template linear updating module, an online training module, a domain expansion module, a knowledge distillation module and a selection module, wherein the modules have the functions of:
the twin network tracking module is used for carrying out similarity calculation on the target template and the search area of the current video frame, inputting the similarity calculation into the target template and the search area of the current video frame, manually selecting and setting the initial value of the target template from the first frame, and outputting the tracking boundary frame of the target in the next frame.
The sample library generation module is used for randomly moving a tracking frame by taking a high-confidence target as a center, so that the cross-over ratio of the tracking frame to the tracking target is greater than a certain threshold value and is a positive sample, generally set to be greater than 0.8, the cross-over ratio of the tracking target to the tracking target is less than the certain threshold value and is a negative sample, generally set to be 0.1, the positive-negative ratio of a positive-negative sample library is a dynamic value and is a certain number, and the positive-negative ratio is determined by the tracking response size obtained by an algorithm, namely when a high-confidence tracking result is generated, the positive sample ratio and the number are increased; and when a low-reliability tracking result is generated, the proportion and the quantity of the negative samples are increased. Replacing the sample library when a new result of a next frame is generated;
the template linear updating module is used for carrying out linear weighting on the features of the tracked target in the initial frame and the features of the subsequently tracked high-confidence sample;
the online training module comprises a domain expansion module and a knowledge distillation module. The online training is used for copying the RPN network of the SimRPN + + as a student network in an initial frame, and the original RPN network is frozen as a teacher network for guiding the student network to learn. Converting the small sample library into image data with labels, performing on-line training of the student network, wherein a loss function adopts cross entropy loss to realize optimization of classification branches in the student network; and adding a characteristic loss function and a distillation loss function into the domain expansion branch and the knowledge distillation branch respectively, and continuously tracking the target after training, wherein the module is a prerequisite for a domain expansion module and a knowledge distillation module. The domain expansion module is used for freezing some layers of classification branches of the student network in the training process so as to maintain decision boundaries of the student network. Using the feature change loss to reduce a difference in feature mapping between the initial target template and the new target tracking result; the knowledge distillation module is used for keeping the prediction capability of the student network on the past appearance information of the target. In order to introduce distillation loss as additional supervision in an old model, after the same frame of image is input into a teacher network and a student network at the same time and the results are respectively predicted, the prediction results of the teacher network are used for guiding a training process so as to prevent catastrophic forgetting;
the selection module is used for completing the subsequent calculation operation of the twin network tracking frame and fusing the classification branches of the student network and the teacher network, and the fusion mode can adopt linear weighting.
2. The twin tracking module comprises two branches and three regional suggestion network units (RPN), wherein each branch is a feature extraction unit, the first branch is used for extracting the features of a target template and inputting the features into the target template, and the second branch is used for extracting the features of a search region and inputting the features into the search region; extracting three features from each branch, namely shallow, medium and deep features; the three regional suggestion network units are of a cascade structure, window punishment and other operations are carried out on output after cascade fusion, and a target boundary box of a search region is output, wherein the fusion proportionality constant of each regional suggestion network is obtained through training; the corresponding inputs to the first, second and third area proposal network elements are shallow, medium and deep features of the target template and search area.
3. The working process of the online training module is as follows:
(1) Copying an RPN network of an original model into a student network during an initial frame, and freezing the RPN network of the original model into a teacher network;
(2) Generating a positive and negative sample library from the twin network tracking result, classifying the positive sample as a target and classifying the negative sample as a background, sending the target and the negative sample into a student network for normal training, and only performing one-cycle training, wherein the learning rate is consistent with the original algorithm setting;
(3) A domain expansion module is added in the training process, and the difference of feature mapping between the initial target template and a new target tracking result is reduced by using feature change loss;
4. the working process of the domain expansion module is as follows:
(1) After extracting the characteristics of the main network, the current positive and negative sample library respectively passes through the classification of teacher and student networks and the deep correlation convolution part of regression branches to obtain the characteristics before inputting Head;
(2) Calculating the mean square error of the features before the input of the Head and the features extracted by the backbone network obtained by the two networks, and reducing the difference of feature mapping between an initial target template and a new target tracking result by using the feature change loss, wherein the feature loss function is as follows:
Figure BDA0003830054640000041
in the formula, x is a characteristic diagram of an image extracted by a backbone network;
Figure BDA0003830054640000042
respectively the weights of the teacher model and the student model;
Figure BDA0003830054640000043
representing features in the RPN before the classifier; f (-) Mean Square Error (MSE); alpha is a super parameter and is used for balancing loss.
(3) And after training, storing the parameters of the student model, waiting for the next frame of image to enter, and simultaneously sending the image to a teacher and a student network for normal tracking.
5. The working process of the distillation module is as follows:
(1) When a new video frame is sent to the twin network tracking module for tracking, the student model of the previous frame after on-line training is still used for carrying out relevant calculation, and then the credibility of the tracking results of the teacher network and the student network is compared;
(2) And if the reliability of the teacher network is greater than that of the student network, performing distillation loss training, and if the reliability of the teacher network is less than or equal to that of the student network, not performing distillation loss training. Prediction of teacher network and student networkThe fruits are recorded as p respectively T And p S
Figure BDA0003830054640000044
Where F (-) is a loss function to reduce the difference between the two distributions and β is a hyper parameter to balance the loss. p is a radical of S’ Is p S Is the student's prediction of the last frame of information of the tracked object. y' T And y' S Is p T And p S A variant of (a) is called a "soft tag". The converter is a modified softmax function, in which:
Figure BDA0003830054640000045
in the formula, T is a smoothing parameter and is called temperature. The higher the temperature, the softer the tag, i.e. the flatter the probability distribution.
(3) After distillation loss training is carried out, weighting is carried out on the classification responses of the teacher network and the student network, and the weighted classification responses are sent to a selection module to finish final regression and other operations of the tracking frame.
The invention has the beneficial effects that:
the invention relates to a learning method for performing domain expansion on a tracked target by a teacher student network, which reduces the difference of feature mapping between an initial target template and a new target tracking result by using feature change loss, and further improves the foreground and background classification capability of the network on the new target. The distillation loss is utilized in the knowledge distillation module to prevent the model from forgetting old knowledge after training, the continuous utilization of the historical appearance characteristic information of the target can be enhanced, and then the problems of deformation, shielding and the like in the target tracking process are solved, and the universal capability of the model in different scenes is realized.
Drawings
FIG. 1 is a schematic diagram of an overall structure of a twin network target tracking method based on incremental learning provided by the present invention;
FIG. 2 is a flow chart of the modules of a twin network target tracking method based on incremental learning provided by the invention;
FIG. 3 is a schematic diagram of a teacher-student network architecture provided by the present invention;
FIG. 4 is a schematic diagram of the distillation loss provided by the present invention;
Detailed Description
The invention will be further described with reference to the accompanying drawings and examples.
The embodiment of the invention provides a twin network target tracking method based on incremental learning, and referring to fig. 1 and fig. 2, the method comprises the following steps:
step 1: constructing a SiamRPN + + network framework;
the first branch is used for extracting the characteristics of the target template and inputting the characteristics as the target template, and the second branch is used for extracting the characteristics of the search area and inputting the characteristics as the search area; extracting three features, namely shallow, medium and deep features, from each branch, and outputting the features corresponding to a SiamRPN + + structure, namely the features of the second layer, the third layer and the fourth layer of the ResNet50 of the main network; the three area suggestion network units are of a cascade structure, the output of the three units after cascade fusion outputs a boundary box of a target of a search area through regression operations such as window punishment, and the like, wherein the fusion proportionality constant of each area suggestion network is obtained through training; the corresponding inputs to the first, second and third area proposal network elements are shallow, medium and deep features of the target template and the search area.
The network structure of SiamRPN + +, the ResNet50 network, and the RPN network are well known to those skilled in the art, and are not described in detail in the embodiments of the present invention.
Step 2: adding a sample generation module, a template linear updating module, an online training module, a domain expansion module, a knowledge distillation module and a selection module on the basis of a network structure of SiamRPN + +;
wherein, take the first and second frames as examples: and (4) sending the target template of the first frame and the image of the second frame into the siamrPN + +, and obtaining a normal tracking result and reliability through the calculation of an original algorithm.
And step 3: entering a sample generation module:
and (4) sending the tracked result into a sample generation library, and generating positive and negative samples in a certain proportion according to the confidence coefficient, wherein the confidence coefficient is between 0 and 1.
If the reliability of the tracking result is 0.8, a ratio of 1:2 proportion of positive and negative samples, totally 9, wherein the total number of the samples is set by human, and is generally not more than 20, so that excessive seizure during subsequent online training is prevented. Generating positive and negative samples by randomly moving a tracking frame, wherein the intersection ratio of the tracking frame and a tracked target is greater than a certain threshold, and the confidence threshold of the positive sample is 0.8; if the sampling rate is lower than 0.1, setting the sampling rate as a negative sample, and temporarily storing the positive and negative samples.
And 4, step 4: entering a template linear updating stage;
the new template takes the linear weighting of the target feature tracked from the previous frame and the target template from the initial frame.
Taking the first frame and the second frame as an example, after the step 3 is finished, sending the high-confidence-degree result tracked by the second frame into the backbone network to complete the feature extraction once, and performing linear weighting on the extracted features and the target features of the first frame to serve as a target template of the third frame. The linear weighting formula is as follows:
z=(1-Lr)×z 0 +Lr×z
wherein z is 0 Representing the first frame target feature, z representing the high confidence target feature generated in the previous frame, x representing the multiplication, and Lr representing the update rate, typically set to a constant, e.g., 0.0102.
And 5: entering an online training stage;
firstly, copying an RPN network model of the SimRPN + + network into a student network, freezing an original RPN network into a teacher network, taking a positive and negative sample library generated in the step 3 as data with label information, continuously training the student network by using cross entropy loss, and improving the training mode according to the original SimRPN + + training mode;
the training mode takes the first and second frames as an example, and specifically includes: the template is cut out with the object of the first frame as the center and resized to 127 × 127. Similarly, the second frame image is cropped on the current frame to be twice as large as the template, and then resized to 255 × 255. The loss is cross entropy and regression loss (smooth L1 loss), the learning rate is 0.001, the period is one round, and the training process is optimized by using an SGD optimizer.
The cross entropy and the regression loss are well known to those skilled in the art, and are not described in detail in the embodiments of the present invention.
Step 6: an entry domain expansion module;
and keeping the decision boundary of the classification branch of the student network on the original target characteristic by utilizing the characteristic loss function. As shown in fig. 3. Taking the first frame and the second frame as an example, in the step 5, after the positive and negative sample base of the second frame is subjected to the feature extraction of the main network, the features before the Head is input are obtained through the classification of the teacher and student networks and the deep correlation convolution part of the regression branch.
Calculating the mean square error of the features before the input of the Head and the features extracted by the backbone network obtained by the two networks, and reducing the difference of feature mapping between an initial target template and a new target tracking result by using the feature change loss, wherein the feature loss function is as follows:
Figure BDA0003830054640000071
in the formula, x is a characteristic diagram of an image extracted by a backbone network;
Figure BDA0003830054640000072
respectively the weights of the teacher model and the student model;
Figure BDA0003830054640000073
representing features in the RPN network before the classifier; f (-) Mean Square Error (MSE); alpha is a super parameter and is used for balancing loss.
And 7: entering a selection module;
and 6, according to the comparison of the class responses tracked by the teachers and the students after training in the step 6, determining whether the next training is carried out knowledge distillation treatment or not.
Taking the second frame and the third frame as examples, tracking calculation of a third frame image is carried out on the student network trained by positive and negative samples generated by the second frame and the teacher network, then the confidence degrees of the tracking results of the two networks are compared, and if the confidence degree obtained by the teacher network is greater than that of the student network, the teacher network guides the next training of the student network; otherwise, the teacher network does not guide the next training of the student network.
And 8: entering a knowledge distillation module;
to preserve the predictive power of student networks on past appearance information of tracked targets, distillation loss was introduced in the old model as an extra supervision to prevent catastrophic forgetfulness, as in fig. 4.
When a new video frame is sent to the twin tracking module for tracking, the reliability of the tracking result of the teacher network and the tracking result of the student network are compared by using the previous frame, namely the student model parameters trained online in the steps 5 and 6, in a step 7. If the confidence coefficient of the teacher network is greater than that of the student network, carrying out distillation loss training; otherwise, distillation loss training is not performed.
Taking the first, second and third frames as examples, the prediction results of the third frame by the teacher network and the student network are recorded as p T And p S
Figure BDA0003830054640000074
Where F (-) is a loss function to reduce the difference between the two distributions and β is a hyper parameter to balance the loss. p is a radical of formula s' Is p s Is the student's prediction of the last frame of information of the tracked object. y' T And y' S Is p T And p S A variant of (2), called a "soft tag". The converter is a modified softmax function, in which:
Figure BDA0003830054640000081
in the formula, T is a smoothing parameter and is called temperature. The higher the temperature, the softer the label, i.e. the flatter the probability distribution.
And weighting the classified responses of the teacher network and the student network, and sending the weighted responses to the selection module to complete final regression and other operations of the tracking frame.
Therefore, the whole tracking process can be briefly described as follows: step 1 is initialized once, the RPN network is copied once in the initial frame to be the student model, and the steps 2 to 8 are executed in the subsequent cycle sequence to be the continuous tracking process.
The foregoing is illustrative only of the principles and efficacy of the invention, and is not limiting thereof. It will be appreciated by those skilled in the art that modifications may be made to the examples described above without departing from the spirit and scope of the invention.
In summary, the invention discloses a twin network target tracking method based on incremental learning, which applies the incremental learning to the updating process of a target tracking network model. Firstly, an RPN (regional recommendation network) of a tracking network siamrPN + + is copied into a student model, and a high-reliability target generated in the tracking process is used as a small sample set for on-line training. Then, a small sample set generated by the previous frame of network is learned in an incremental learning mode, and the student model is trained in a domain expansion and knowledge distillation mode on the model. And finally, carrying out dynamic weighted fusion by using the target information generated by the student network model and the target information generated by the teacher network model to update the position. Aiming at the problems that the existing model updating method is lack of flexibility, complex in training and old knowledge is forgotten, the method makes the model which is trained offline have self-adaptive learning capacity by using an incremental learning mode, not only effectively utilizes the historical information of the target in the tracking process, but also avoids large-scale offline training of the model, and improves the capacity of a twin network algorithm for processing the conditions of target deformation and the like in the tracking process.

Claims (7)

1. A method of twin network target tracking based on incremental learning, the method comprising:
on the basis of a twin network frame, target information of a first frame is used as a template, and a current frame is sent to a tracking network to generate a new target position; generating a positive and negative sample library according to a high-confidence target sample generated in the tracking process, and fusing a target template in a linear updating mode;
taking a positive and negative sample library generated by a high-confidence target as a new sample, copying an RPN part of a twin network frame into a student network for incremental training, wherein the incremental training comprises domain expansion and knowledge distillation of the student network, and the purpose of the knowledge distillation is to prevent a model from forgetting old knowledge through distillation loss;
and after the next frame of image is input, tracking again by using the student network and the teacher network trained in the previous frame, namely the original network, judging the credibility of the targets generated by the teacher and the student network respectively according to the classification response values, and performing linear weighted fusion on the classification characteristic graphs of the student network and the teacher network to generate a high-confidence target sample again. And further updating the sample library, training the increment of the student network, and realizing continuous tracking.
2. The twin network target tracking method based on incremental learning of claim 1, wherein the twin network frame is SiamRPN + +.
3. The twin network target tracking method based on incremental learning according to claim 1, wherein the generating of the positive and negative sample banks specifically comprises:
taking a high-confidence target as a center, randomly moving a tracking frame to enable the cross-over ratio of the tracking target to be larger than a certain threshold value to be a positive sample, generally setting the cross-over ratio to be larger than 0.8, and setting the cross-over ratio to be smaller than the certain threshold value to be a negative sample, generally setting the cross-over ratio to be 0.1, wherein the positive-negative ratio and the sample number of a positive-negative sample library are dynamic values, and are determined by the tracking response size obtained by an algorithm, namely when a high-confidence tracking result is generated, the positive sample ratio and the number are increased; when a low-reliability tracking result is generated, the proportion and the quantity of negative samples are increased, and when a new result of the next frame is generated, the sample library is replaced.
4. The twin network target tracking method based on incremental learning according to claim 1, wherein the generated target template linear updating method specifically comprises:
z=(1-Lr)×z 0 +Lr×z
wherein z is 0 Representing the first frame target feature, z representing the high confidence target feature generated in the previous frame, x representing the multiplication, and Lr representing the update rate, typically set to a constant, e.g., 0.0102.
5. The twin network target tracking method based on incremental learning according to claim 1, wherein the RPN network domain expansion manner is specifically:
in the initial frame, the RPN network of the SimRPN + + is copied as a student network, the original RPN network is frozen as a teacher network for guiding the student network to learn, in the process, some layers of classification branches of the student network are frozen to maintain the decision boundary, and in addition, the difference of feature mapping between a target template and a new target tracking result is reduced by using feature change loss:
Figure FDA0003830054630000021
in the formula, x is a characteristic diagram of an image extracted by a backbone network;
Figure FDA0003830054630000022
weights of the teacher model and the student model of the RPN part are respectively;
Figure FDA0003830054630000023
representing features in the RPN before the classifier; f (·) is the Mean Square Error (MSE); alpha is a super parameter and is used for balancing loss.
6. The twin network target tracking method based on incremental learning of claim 1, wherein the knowledge distillation process is as follows:
the teacher network and the student network are input simultaneouslyThe same frame of image, the prediction results are recorded as p T And p S In order to enable the classification branches of the student network to learn new target appearance characteristic information, the traditional cross entropy loss is utilized to optimize
Figure FDA0003830054630000024
In addition, in order to keep the prediction ability of the student network on the past appearance information of the target, p is adopted T To guide the training process; in addition, the following distillation losses were introduced in the old model as an additional supervision to prevent catastrophic forgetfulness:
Figure FDA0003830054630000025
wherein F (-) is a loss function to reduce the difference between the two distributions, and β is a hyper parameter to balance the loss; p is a radical of formula s' Is p s Is a prediction of the last frame of information of the tracked target, y 'by the student network' T And y' S Is p T And p S Is called "soft tag", the converter is a modified softmax function, where:
Figure FDA0003830054630000026
in the formula, T is a smoothing parameter and is referred to as temperature. The higher the temperature, the softer the tag, i.e. the flatter the probability distribution.
7. A twin network target tracking method based on incremental learning is characterized in that the modules and functions comprise:
the sample generation module is used for sending the initial frame template and the current frame image into a network to obtain a tracking result on the basis of the twin network frame, selecting a high-confidence target as a positive sample, and generating a positive and negative small sample library according to a cross-over ratio;
the template linear updating module is used for inputting the initial frame template and the subsequently tracked high-reliability target features into a network in a linear updating mode, adjusting template information to realize updating of the target features, and enabling the information of the initial template in the module to always occupy a leading position;
a training module comprising a domain expansion module and a knowledge distillation module. The training module is used for copying an RPN network of SiamRPN + + as a student network in an initial frame, freezing the original RPN network as a teacher network for guiding the student network to learn, then performing online training on the student network, wherein loss functions are cross entropy loss and regression loss, a characteristic loss function and a distillation loss function are respectively added in a domain expansion branch and a knowledge distillation branch, and a target is tracked after training.
And the selection module is used for finishing the subsequent calculation operation of the twin network output tracking frame and fusing the classification branches of the student network and the teacher network, wherein the fusion mode can adopt linear weighting.
CN202211073134.7A 2022-09-02 2022-09-02 Twin network target tracking method based on incremental learning Pending CN115424177A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211073134.7A CN115424177A (en) 2022-09-02 2022-09-02 Twin network target tracking method based on incremental learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211073134.7A CN115424177A (en) 2022-09-02 2022-09-02 Twin network target tracking method based on incremental learning

Publications (1)

Publication Number Publication Date
CN115424177A true CN115424177A (en) 2022-12-02

Family

ID=84201477

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211073134.7A Pending CN115424177A (en) 2022-09-02 2022-09-02 Twin network target tracking method based on incremental learning

Country Status (1)

Country Link
CN (1) CN115424177A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115797794A (en) * 2023-01-17 2023-03-14 南京理工大学 Knowledge distillation-based satellite video multi-target tracking method
CN116486203A (en) * 2023-04-24 2023-07-25 燕山大学 Single-target tracking method based on twin network and online template updating
CN116883459A (en) * 2023-09-07 2023-10-13 南昌工程学院 Dual knowledge distillation-based teacher and student network target tracking method and system
CN117726884A (en) * 2024-02-09 2024-03-19 腾讯科技(深圳)有限公司 Training method of object class identification model, object class identification method and device
CN117951653A (en) * 2024-01-31 2024-04-30 兰州理工大学 Smooth tracking method based on Student's t process regression

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115797794A (en) * 2023-01-17 2023-03-14 南京理工大学 Knowledge distillation-based satellite video multi-target tracking method
CN116486203A (en) * 2023-04-24 2023-07-25 燕山大学 Single-target tracking method based on twin network and online template updating
CN116486203B (en) * 2023-04-24 2024-02-02 燕山大学 Single-target tracking method based on twin network and online template updating
CN116883459A (en) * 2023-09-07 2023-10-13 南昌工程学院 Dual knowledge distillation-based teacher and student network target tracking method and system
CN116883459B (en) * 2023-09-07 2023-11-07 南昌工程学院 Dual knowledge distillation-based teacher and student network target tracking method and system
CN117951653A (en) * 2024-01-31 2024-04-30 兰州理工大学 Smooth tracking method based on Student's t process regression
CN117726884A (en) * 2024-02-09 2024-03-19 腾讯科技(深圳)有限公司 Training method of object class identification model, object class identification method and device
CN117726884B (en) * 2024-02-09 2024-05-03 腾讯科技(深圳)有限公司 Training method of object class identification model, object class identification method and device

Similar Documents

Publication Publication Date Title
CN110335290B (en) Twin candidate region generation network target tracking method based on attention mechanism
CN115424177A (en) Twin network target tracking method based on incremental learning
CN110929092B (en) Multi-event video description method based on dynamic attention mechanism
CN112651998B (en) Human body tracking algorithm based on attention mechanism and double-flow multi-domain convolutional neural network
CN112884742A (en) Multi-algorithm fusion-based multi-target real-time detection, identification and tracking method
CN113807214B (en) Small target face recognition method based on deit affiliated network knowledge distillation
CN108846850B (en) Target tracking method based on TLD algorithm
CN111126155B (en) Pedestrian re-identification method for generating countermeasure network based on semantic constraint
CN114581486A (en) Template updating target tracking algorithm based on full convolution twin network multilayer characteristics
CN116452862A (en) Image classification method based on domain generalization learning
CN114818963B (en) Small sample detection method based on cross-image feature fusion
CN115578568A (en) Noise correction algorithm driven by small-scale reliable data set
CN115375732A (en) Unsupervised target tracking method and system based on module migration
CN114819091A (en) Multi-task network model training method and system based on self-adaptive task weight
CN112686326B (en) Target tracking method and system for intelligent sorting candidate frame
CN112528077A (en) Video face retrieval method and system based on video embedding
CN111753657A (en) Self-training-based text detector training method and system
US20230031512A1 (en) Surrogate hierarchical machine-learning model to provide concept explanations for a machine-learning classifier
CN116148864A (en) Radar echo extrapolation method based on DyConvGRU and Unet prediction refinement structure
CN115861625A (en) Self-label modifying method for processing noise label
CN113191984B (en) Deep learning-based motion blurred image joint restoration and classification method and system
CN113379794B (en) Single-target tracking system and method based on attention-key point prediction model
CN111259860B (en) Multi-order characteristic dynamic fusion sign language translation method based on data self-driving
CN115187633A (en) Six-degree-of-freedom visual feedback real-time motion tracking method
CN115393400A (en) Video target tracking method for single sample learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination