CN113902773A

CN113902773A - Long-term target tracking method using double detectors

Info

Publication number: CN113902773A
Application number: CN202111119613.3A
Authority: CN
Inventors: 胡昭华; 李奇
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2021-09-24
Filing date: 2021-09-24
Publication date: 2022-01-07

Abstract

The invention discloses a long-term target tracking method using double detectors, which comprises the following steps: extracting fHOG characteristics of an image containing a target background image, inputting the fHOG characteristics into a pre-trained initial filter to calculate a maximum response value containing the target background image, and taking the position of the maximum response value as a target position in a current image; updating each frame of the filter; cutting out HOG characteristics of an image only containing a target according to the target position in the current image, inputting a pre-trained filter, and calculating the maximum response value of the target image; when the maximum response value of the target image is less than the threshold: if the tracking is lost, a re-detection module is started to re-detect the predicted position of the current picture; if the tracking error is larger than the threshold value, the tracking is correct, and the filter is updated; the invention adopts a double-detector system to detect the target again, and can carry out multiple detection on the current search area in a quick detection and depth detection mode when the target is lost, thereby increasing the success rate of detection.

Description

Long-term target tracking method using double detectors

Technical Field

The invention relates to a long-term target tracking method using double detectors, belonging to the technical field of computer vision and image processing.

Background

Object tracking has made great progress in the last decade as one of the important branches of research in the field of computer vision. The method is widely applied to the fields of medical treatment, intelligent transportation, unmanned driving and the like. The main working principle is that the initial state of the target (namely the position and the size of the target in the image) is given only when the first frame of the video, and a computer is required to estimate the latest state of the target in a subsequent video sequence.

The target tracking technology has two main development directions at present: correlation filtering and deep learning. In recent years, a target tracking algorithm based on correlation filtering has been greatly improved. Many researchers have made a lot of improvements on the basis of the MOSSE algorithm, such as Henriques et al (Henriques J F, Caseiro R, Martins P, et al. high-speed tracking with kernel correlation filters [ J ]. IEEE Transactions on Pattern Analysis and Machine Analysis, 2015,37(3): 583-algorithm 596) propose a high-speed tracking algorithm (KCF) using kernel function on the basis, express targets using multi-channel direction gradient Histogram (HOG) features, and solve the problem of linear inseparability using kernel function, greatly improving tracking speed. But because the training data sets of the related filtering algorithm are all generated by cyclic shift, the boundary effect inevitably influences the tracking effect. Danelljan et al (Martin D, Gustav H, Fahad S K, et al. Learing spatialization reconstruction filters for visual tracking [ C ]// Proceedings of the IEEE Conference on International Conference on Computer Vision. Satiago, Chile: IEEE Press,2015: 4310-. However, the method for optimizing the solution increases the complexity of the algorithm and greatly reduces the tracking speed; and moreover, a fixed negative Gaussian matrix is used as a space regularization matrix, and when errors occur in the tracking process, the tracker cannot flexibly respond, so that the performance of the tracker is influenced. In the aspect of long-term tracking, Ma et al (Ma C, Yang X K, Zhang C Y, et al Long-term correlation tracking [ C ]// Proceedings of the IEEE Conference on Computer Vision and Pattern recognition. Boston, MA, USA: IEEE,2015: 5388-.

Disclosure of Invention

The invention aims to provide a long-term target tracking method using double detectors, which aims to solve the defect that when a target is shielded for a long time and appears again, a tracker cannot recognize again to cause tracking failure.

A method of long term object tracking using dual detectors, the method comprising the steps of:

extracting fHOG characteristics of the image containing the target background image, and inputting the fHOG characteristics into a pre-trained initial filter F₁Calculating a maximum response value containing a target background image, and taking the position of the maximum response value as a target position in a current image; and to filter F₁Updating each frame;

cutting out HOG characteristics of the image only containing the target according to the target position in the current image, and inputting a pre-trained filter F₂Calculating the maximum response value of the target image;

judging the maximum response value of the target image and a preset threshold value; when the maximum response value of the target image is less than the threshold: if the tracking is lost, a re-detection module is started to re-detect the predicted position of the current picture; if the value is greater than the threshold value, the tracking is correct, and the filter F is updated₂。

Further, the filter F₁The solution objective function of (a) is:

where the filter F ═ F₁,f₂,...,f_K](ii) a The first term of the formula is a ridge regression term, K represents the total number of channels, x_KRepresenting the characteristics of the Kth sample, f_KDenotes the corresponding kth filter, y being the desired filter; in the second term w is adaptive spatial positiveWeighting, namely updating each frame of image according to the current frame of image information when tracking the image; in the third term, in order to prevent model degradation, an a priori reference weight w of w is introduced^rWherein λ is₁And λ₂Is a regularization parameter.

Further, the re-detection module comprises a support vector machine detector and a depth twin network detector; the support vector machine detector draws dense training samples through the tracked target positions and target scales, and gradually trains the classifier by adding positive and negative labels to the samples according to the overlapping rate of the samples and the targets.

Further, the training method of the support vector machine detector comprises the following steps:

when N samples are collected in one frame of image, the training set is { (v)_i,c_i) 1, 2., N }; v in the formula_iFor the feature vector of the ith sample, class label c_i＝{-1,+1}；

Let the loss function of hyperplane h be l (h)_i)＝max{0,1-c<h_i,v_i}, symbol (symbol)<h,v>Represents the inner product of h and v;

then the objective function is obtained:

wherein

For the gradient of the loss function with respect to the hyperplane h, τ ∈ (0, + ∞) is the hyperparameter that controls the h-update rate.

Further, the support vector machine detector updates hyperplane parameters by using an online passive attack learning algorithm, and the formula for calculating the hyperplane is as follows:

wherein

Further, the training method of the deep twin network detector comprises the following steps:

training a detector according to given target information during a first frame of picture, namely classifying a target and a background by using a k-means clustering method to serve as a target template pool;

obtaining a plurality of candidate regions, respectively calculating Euclidean distance between each candidate region and a target template through a twin network to be used as matching similarity, and selecting a region with the highest similarity as a tracking target; the similarity calculation formula is as follows:

in the formula

A similarity score is represented for each candidate region,

a sample set representing N candidate regions, upsilon (p, s) representing a tracking target p and an ith candidate sample s_iThe similarity score of (a).

Further, the method for judging the maximum response value of the target image and the preset threshold value comprises the following steps:

the maximum response values max _ R are respectively matched with the set updating threshold value T_aAnd re-detecting threshold T_bComparing;

if max _ R < T_bIf the tracking fails, activating a detector part, and detecting the current picture at the predicted position again by the support vector machine detector and the depth twin network detector;

if max _ R > T_aIf the tracking is successful, the support vector machine detector is densely sampled and updated at the predicted target, and the target information processing part is updatedFilter F of₂。

Compared with the prior art, the invention has the following beneficial effects:

the invention adopts a double-detector system to detect the target again, and can carry out multiple detection on the current search area in a quick detection and depth detection mode when the target is lost, thereby increasing the success rate of detection;

in the invention, a mode of self-adaptive spatial regularization weight is adopted, target information is fully highlighted, a target background is obviously inhibited, the influence of a boundary effect is reduced, and a more robust filter can be obtained;

in the optimization solution of the objective function, the objective function is optimized by adopting an alternating direction multiplier method, so that the calculated amount can be effectively reduced, and the running speed of the model is accelerated.

Drawings

FIG. 1 is an overall frame diagram of the present invention;

FIG. 2 is a schematic diagram of a re-detection module according to the present invention;

FIG. 3 is a comparison of long-term tracking effectiveness of the present invention;

FIG. 4 is a comparison graph of detector effectiveness tracking results of the present invention;

FIG. 5 is a graph comparing the overall performance of the present invention tracking on a data set;

FIG. 6 is a comparison of the tracking characteristic results on a data set of the present invention;

fig. 7 is a sample frame of an actual tracking result of the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.

As shown in fig. 1-7, a long-term object tracking method using dual detectors is disclosed, the method comprising the steps of:

the method comprises the following steps: initializing, before performing target tracking, model initialization is performed according to target information given in the first frame,

first, the bag is extractedContaining fHOG characteristics of target background image and training initial filter F of target tracking part₁The system is used for calculating the position of a target in the subsequent tracking process and adopting a mode of updating each frame to adapt to the change of the target; the HOG features of the image containing only the target are then extracted and the filter F is trained₂The filter is used for judging the current tracking state in the tracking process, and meanwhile, the filter is updated only when the current tracking result is reliable, so that the filter is ensured not to introduce a large amount of noise to influence the judgment result. And finally, a target image training support vector machine detector and a depth twin network detector are used, so that the target can be effectively retrieved and the tracker can be corrected when the target is lost, and the robustness in the long-term tracking process is improved.

Step two: calculating the target position, starting from the second frame, the target tracking part in the invention will extract the characteristics of the target search area of the current frame, calculate the response value, take the position of the maximum response value as the target position in the current image, and simultaneously, carry out the filter F₁Updating is performed every frame.

When the filter is solved, an adaptive spatial regularization term is added to inhibit the boundary effect, the target information is highlighted, the target function is optimized by using an alternative direction multiplier method, and the target result is obtained more efficiently;

solving filter F₁The objective function of (a) is:

where the filter F ═ F₁,f₂,...,f_K]. The first term of the formula is a ridge regression term, K represents the total number of channels, x_KRepresenting the characteristics of the Kth sample, f_KRepresenting the corresponding kth filter, y being the desired filter. In the second term, w is an adaptive spatial regularization weight, i.e., each frame of image is updated according to the current frame of image information when being tracked. In the third term, in order to prevent model degradation, an a priori reference weight w of w is introduced^r。λ₁And λ₂Is a regularization parameter.

Because the objective function can not solve the closed-form solution, the alternative direction multiplier method with high calculation efficiency is adopted to carry out optimization solution on the closed-form solution. In the formula (1), the second term is an added adaptive spatial regularization term, and in the process of solving the objective function, a single alternative direction multiplier method is used for solving the spatial weight of the current frame once to realize spatial weight adaptation, so that the current filter is ensured to have larger weight at a target position and smaller weight at a non-target area, and the effect of inhibiting the background is achieved.

Step three: judging a tracking state; and determining whether the detector needs to be started or target information needs to be updated according to the current tracking state. The target information processing part cuts out an image only containing the target at the current position by using the estimated target position obtained in the step two, extracts the characteristics of the image, finally calculates the response value of the target image, obtains the current tracking state by comparing the magnitude relation between the maximum response value and the relevant threshold value, and executes different operations according to the difference of the tracking states:

if the maximum response value is less than T_bAnd starting the detector to detect the target again: if the maximum response value is larger than T_aThe SVM detector is updated to ensure that the latest and most reliable target information is stored in the detector. The method can detect the target again when the target is lost, and can update the target to adapt to the change of the target under the condition of reliable tracking.

Step four: re-detection; the invention mainly solves the problem of long-term tracking, if the target is lost in the tracking process, the current tracking can fail, therefore, a re-detection module is added, the target can be re-detected and the tracker can be corrected, and the smooth tracking can be ensured;

in the third step, if the current estimated target position is not reliable enough, the detector is started to detect again, so that the algorithm execution efficiency is improved. The re-detection module is mainly composed of a support vector machine detector (Det)₁) With depth twin network detector (Det)₂) Two parts, during operation, through the cooperation of the two partsThe target is accurately and efficiently detected.

The support vector machine detector draws dense training samples through the tracked target positions and target scales, and gradually trains the classifier by adding positive and negative labels to the samples according to the overlapping rate of the samples and the targets. During tracking, the detector can quickly make certain detection judgment on the target area of the current picture, and is suitable for relatively simple picture information;

assuming that N samples are collected in one frame of image, the training set is { (v)_i,c_i) 1, 2., N }; v in the formula_iFor the feature vector of the ith sample, class label c_i{ -1, +1 }. Let the loss function of hyperplane h be l (h)_i)＝max{0,1-c<h_i,v_i>Symbol, symbol<h,v>Represents the inner product of h and v. Thus the objective function can be listed as:

in order to more effectively solve the hyperplane during the subsequent updating of the detector, the online passive attack learning algorithm is used for updating the hyperplane parameters of the classifier in the invention:

wherein

In the face of more complicated background information, the detection effect of the support vector machine detector Det1 is obviously reduced, and the twin network detector Det2 is selected to be enabled at the moment. The feature extraction and tracking algorithm based on deep learning has the characteristics that abundant target features can be expressed, so that the method can have strong robustness when complex picture information is faced. But also complex structuresThe tracking speed is greatly reduced. In the invention, a fast and high-accuracy twin network structure is used, and the VGGNet in the convolutional neural network is used for extracting the depth feature of the image, so that clear target features can be extracted while the tracking speed is ensured. Since the detector needs to detect the target, an additional target template pool is added in the network structure, which is used for processing a plurality of candidate target areas generated by the picture and selecting the area closest to the template branch as the target area. The selection range of the candidate region takes the position obtained by the target tracking part as the center and the side length as

Where w and h are the estimated target width and height in the current frame, respectively, and ρ is a weight coefficient controlling the size of the region.

The twin network detector Det2 trains the detector according to the given target information when the first frame of picture, i.e. uses k-means clustering method to classify the target and the background as the target template pool. In the subsequent tracking process, after a plurality of candidate regions are obtained, respectively calculating the Euclidean distance between each candidate region and the target template through the twin network to be used as matching similarity, and selecting the region with the highest similarity as a tracking target. The following formula can be used:

in the formula

A similarity score is represented for each candidate region,

In this embodiment, as shown in fig. 1, the model can be roughly divided into three parts: an object tracking section, an object information processing section and a detector section;

a target tracking part for respectively extracting the target image and the image characteristics containing the background to train the initial filter F when processing the first frame image₁And a filter F₂；

In the target information processing part, the tracking state is judged through the maximum response value, and a simpler histogram of gradient directions (HOG) feature can be selected; in the target tracking part, an improved gradient direction histogram feature, namely a 31-dimensional gradient direction histogram feature (fHOG), is selected, and the feature has better performance in reflecting the edge information of the target image and the local appearance and shape of the image.

Fig. 2 shows the specific composition of the detector section, i.e. the section mainly consists of a support vector machine detector and a twin network detector, denoted Det1 and Det2, respectively. The specific implementation of each part in the algorithm is as follows:

a target tracking section: firstly, a sliding window containing background information is cut out according to given target position information in a first frame, fHOG characteristics are extracted, and a filter F is trained₁. And starting from the second frame, extracting features again according to the target position obtained in the previous frame, calculating a response graph, and predicting the position of the target according to the position of the maximum response value. Solving filter F₁The target function of (2) is shown in formula (1); it is noted that a picture is represented numerically as a matrix, where each value in the matrix is a pixel. The extraction of picture features and the calculation of the response map are both matrix calculations. The resulting matrix of response values we call the response map.

Because the formula cannot solve the closed-form solution due to the space regularization term added in the formula (1), the formula is iteratively optimized by adopting an alternating direction multiplier method to obtain an optimal solution, and finally the latest filter can be solved; the spatial regularization weight added in the formula (1) can effectively highlight target information, reduce the influence of a boundary effect, and perform a single optimization operation on the target information in the optimization solving process to realize the self-adaptation of the spatial weight.

The target information processing section: in this section, first, a target image is cut out from a first frame, the HOG feature of the image is extracted, and a filter F is trained₂. From the second frame image, the target image of the current frame needs to be cut out according to the position information predicted by the target tracking part, the characteristic information of the image is extracted, and the maximum response value max _ R is calculated. Respectively comparing the maximum response values with a set threshold value T_a(update threshold) and T_b(re-detection threshold) for comparison. Determining the action of the detector according to the magnitude relation between the maximum response value obtained by the target information processing part and the correlation threshold value, namely if max _ R < T_bThe detector parts, Det1 and Det2, are activated to re-detect at the predicted position of the current picture; if max _ R > T_aThen at the predicted target dense sampling update Det1 and update the filter F of the target information processing section₂。

A detector section: in processing the first frame of image, the object and its surroundings are first densely sampled to obtain a large number of positive and negative samples about the object for training the initial detector Det1, and the detector Det2 is trained using the object image.

The re-detection module may re-detect when the tracker fails tracking during the tracking process. Wherein, the Det1 updates the positive and negative samples when the tracking reliability is high each time, so as to ensure that the detector has a newer target state; and if the tracking reliability is low, detecting the target in the current frame. Since the deep network is used in Det2, the result detected by the detector is more accurate, but the operation takes more time. In comprehensive consideration, the method selects the Det1 with higher speed and more accurate detection result as the main detector, and starts the Det2 to detect the target again when the result shows that the detection fails.

After the detector detects a new target position, whether the tracking is successful is judged by comparing the magnitude relation between the maximum response value at the position and a set threshold value, if so, the target position obtained by the detector is adopted, and the subsequent tracking is continued. In the tracking process, the problem of target re-retrieval under various target backgrounds can be effectively solved through the cooperative work of the double detectors, and the robustness of long-term tracking is effectively improved.

And (6) evaluating the standard. Experiments were performed in the present invention using an OTB-2015 data set, which contains 100 video sequences, each of which contains a plurality of challenge factors, including: illumination change, target deformation, motion blur, rapid motion, in-plane rotation, out-of-plane rotation, target out-of-view, background clutter, low resolution, and the like; the performance of the algorithm was evaluated using OPE (one pass evaluation) while comparing the present invention (Ours) to several other more advanced trackers (SRDCF, SimFC, PTAV, UDT, CACF, CFNet). Fig. 3 is the result of an experiment performed using the 16 longest video sequences in the data set OTB-2015, each containing more than 1000 frames of images.

It can be seen from the figure that the tracking performance of the invention is ideal and the effect is improved significantly when facing long video sequences.

Fig. 4 is a comparison diagram of actual tracking results of two algorithms without adding the detector (Ours _ ND) and adding the detector (Ours), and it can be seen from data in the diagram that the re-detection module added in the present invention can effectively retrieve the target when the target is lost, thereby improving the tracking robustness.

Fig. 5 is a comparison graph of the overall performance of the present invention on a data set with other algorithms, and it can be seen that the present invention can achieve the best results in both tracking accuracy and tracking success rate.

Fig. 6 is a comparison graph of the results of the present invention and other comparison algorithms on various tracking characteristics, and it can be seen that the present invention can achieve the optimal results on all characteristics.

Fig. 7 is a diagram of an actual tracking result of the present invention and other comparison algorithms, and it can be seen from the diagram that the present invention can track to a target more accurately when facing various challenge factors.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A method for long term object tracking using dual detectors, the method comprising the steps of:

2. Method according to claim 1, characterized in that said filter F₁The solution objective function of (a) is:

where the filter F ═ F₁,f₂,...,f_K](ii) a The first term of the formula is a ridge regression term, K represents the total number of channels, x_KRepresenting the characteristics of the Kth sample, f_KDenotes the corresponding kth filter, y being the desired filter; in the second term, w is an adaptive spatial regularization weight, namely when each frame of image is tracked, each frame of image is updated according to the current frame of image information; in the third term, in order to prevent model degradation, an a priori reference weight w of w is introduced^rWherein λ is₁And λ₂Is a regularization parameter.

3. The method of claim 1, wherein the re-detection module comprises a support vector machine detector and a depth twin network detector; the support vector machine detector draws dense training samples through the tracked target positions and target scales, and gradually trains the classifier by adding positive and negative labels to the samples according to the overlapping rate of the samples and the targets.

4. The method of claim 3, wherein the SVM detector training method comprises the steps of:

then the objective function is obtained:

wherein +_hl(h_i) For the gradient of the loss function with respect to the hyperplane h, τ ∈ (0, + ∞) is the hyperparameter that controls the h-update rate.

5. The method of claim 4, wherein the SVM detector updates hyperplane parameters using an online passive attack learning algorithm, and the formula for computing the hyperplane is as follows:

wherein +_hl(h_i) Gradient as a function of loss with respect to hyperplane hτ ∈ (0, + ∞) is a hyper-parameter that controls the h-update rate.

6. The method of claim 3, wherein the deep twin network detector training method comprises the steps of:

in the formula

A similarity score is represented for each candidate region,

7. The method of claim 1, wherein the maximum response value of the target image and the preset threshold are determined by:

if max _ R > T_aIf the tracing is successful, then the target is predictedTarget intensive sampling updates the SVM detector and updates the filter F of the target information processing section₂。