CN113256680A

CN113256680A - High-precision target tracking system based on unsupervised learning

Info

Publication number: CN113256680A
Application number: CN202110523935.8A
Authority: CN
Inventors: 胡硕; 王洁; 周思恩
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2021-05-13
Filing date: 2021-05-13
Publication date: 2021-08-13

Abstract

The invention relates to a high-precision target tracking system based on unsupervised learning, which comprises an image acquisition module, a target tracking module and a target tracking module, wherein the image acquisition module is used for acquiring video images; the tracking module comprises a tracker 1 and a tracker 2 and is used for obtaining the characteristics of the image and the target rectangular frame; the selection module comprises two full connection layers and a softmax layer, wherein the full connection layers comprise a linear full connection layer and a Relu activation function; and taking the characteristic diagram of the tracker to be selected and the tracker result as input, and outputting the best tracking result through the selector. The invention obtains the result through two different trackers and obtains the optimal result output through the judgment of the selector, and the tracking is continued in the subsequent frames so as to adapt to the target tracking under different scenes.

Description

High-precision target tracking system based on unsupervised learning

Technical Field

The invention relates to the technical field of computer vision tracking, in particular to a high-precision target tracking system based on unsupervised learning.

Background

Target tracking is a basic task in computer vision, the purpose of which is to locate a target object in a video given a bounding box annotation in the first frame. The current target tracking has wide application fields, such as an intelligent transportation system, a medical field, man-machine interaction, athlete match analysis and the like.

Current advanced depth tracking methods typically use pre-processed CNN models for feature extraction, these models are trained in a supervised fashion, require a large number of annotated labels, are expensive and time consuming to manually label, and unlabeled videos are more readily available on the internet. Therefore, it is worth exploring how to visually track using unmarked video sequences.

The current data is extremely easy to obtain in the Internet, the problem of manual labeling is solved by the development of an unsupervised technology, and the method plays a great role in target tracking of deep learning. Unsupervised learning on video has resulted in a great deal of research effort, and in order to learn visual features from unlabeled data, unsupervised methods explore intrinsic information inside images or video from different perspectives as surveillance signals, and train through aspects such as designing loss functions and agent tasks. In the prior art, for example, a patent of an unsupervised related filtering target tracking method and system based on a jigsaw task, a prediction task training network for indexing image block positions is adopted to extract feature capability, and one tracker is mainly trained to adapt to different scenes, but usually, a certain tracker has inherent defects so that the tracker is difficult to adapt to different scenes. Therefore, the prior art has certain defects and shortcomings, and further has a space for further promotion and improvement.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a high-precision target tracking system based on unsupervised learning, which adapts to target tracking in different scenes by selecting two different trackers, and a selection module selects an optimal result for continuous tracking of a subsequent frame according to comparison of tracking results, and has the advantages of simple structure, improved precision and improved robustness.

The technical scheme adopted by the invention is as follows:

the invention provides a high-precision target tracking system based on unsupervised learning, which comprises the following modules:

the image acquisition module is used for acquiring a video image;

the tracking module comprises a tracker 1 and a tracker 2 and is used for obtaining the characteristics of the image and the target rectangular frame;

the selection module comprises two full connection layers and a softmax layer, wherein the full connection layers comprise a linear full connection layer and a Relu activation function; and taking the characteristic diagram of the tracker to be selected and the tracker result as input, and outputting the best tracking result through the selector.

Further, tracker 1 and tracker 2 in the tracking module adopt two different trackers, are suitable for tracking under different scenes and respectively use two different loss functions for training, and specifically include the following steps:

L_m＝∑_iz_i (2)

wherein L is_CIs a loss function selected by the tracker 1, R_TIs a label for cropping the template patch from the initial frame, which is a Gaussian response centered at the center of the initial bounding box, Z_TIs a response graph generated by a second search frame in the back tracking, and utilizes cycle consistency training; l is_mIs the Huber loss function of the tracker 2,

is a reconstructed frame of the video sequence to be displayed,

is a real frame, trained using the consistency of pixel reorganization.

Further, the specific content of the selection module is as follows:

the selection module aims at selecting the most appropriate target result according to the tracking result; in the selection module, the results of two trackers need to be operated simultaneously, so that the preferred selection in the tracker 1 and the tracker 2 can be realized;

(1) acquiring an overlapping value IOU between the candidate box and the pseudo label; after the two trackers track forward to obtain a predicted target position, the predicted target position is used as a pseudo label to track reversely to skip an interval n frame, a new predicted position is obtained in an initial frame, and a new estimated target frame and an annotation frame in the initial frame are used for calculating the degree of coincidence IOU; the label P required by selector training is obtained through the IOU value, and the calculation formula of the label P is as (4)

(2) The selection module consists of two fully-connected layers and a softmax layer, wherein the fully-connected layers comprise a linear fully-connected layer and a Relu activation function; inputting a feature map obtained by a tracker to be selected into a selection module, and respectively obtaining the probability values of the precision estimation of the two target frames by the feature map through the selection module; this selector is trained for this selection module using a cross-entropy loss function, which is equation (5):

L_s＝-∑plna+(1-p)ln(1-a) (5)

wherein a is a probability value of the target box precision estimation obtained by the characteristics through a selector, and p is a label required by the selector for training;

(3) in the tracking stage, a target is positioned by using the tracker, the characteristic diagram and the positioning result obtained by the tracker are used as the input of the selection module, and the result of the optimal tracker can be directly judged and output by the selection module.

Compared with the prior art, the invention has the following beneficial effects:

according to the invention, a combined model is designed to comprise an image acquisition module, a tracking module and a selection module, results obtained by two different trackers are judged by a selector to obtain the optimal result output, and tracking is continued in subsequent frames; the target tracking algorithm faces huge challenges such as motion blur, shielding and the like, and the method has the advantages that the two trackers have different applied target motion scenes, the result of the proper tracker is selected for tracking, the structure is simple, and the precision and the robustness can be effectively improved.

Drawings

FIG. 1 is a block flow diagram of the system of the present invention;

FIG. 2 is a schematic diagram of a training method of the tracker 1;

fig. 3 is a schematic diagram of a training method of the tracker 2.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

As shown in fig. 1, the high-accuracy target tracking system based on unsupervised learning proposed by the present invention includes the following modules:

the image acquisition module is used for acquiring a video image;

The tracking module part comprises two branches, namely a tracker 1 and a tracker 2, wherein the tracker 1 can track a target without sudden change in scene with high precision, good robustness and high speed; the tracker 2 has a memory storage body, so that moving targets such as occlusion or target loss can be tracked with higher precision; the two trackers can be automatically adjusted in different scenes by means of complementary advantages so as to ensure higher target tracking precision. The training steps for both trackers are as follows:

the tracker 1 adopts the idea of cycle consistency as shown in fig. 2 for training, and comprises the following specific steps: randomly selecting three patches in continuous 10 frames of a video, wherein any one patch is set as a first frame template, and the rest patches are set as search patches; giving a target object annotated on a template frame, sequentially carrying out forward tracking twice in the subsequent two frames, then directly carrying out backward tracking on the first frame by using a predicted position in the last frame as an initial target annotation, wherein the initial annotation of the first frame is consistent with the target position predicted by the backward tracking in the first frame in principle, and then utilizing the error of the tracking result to train between the initial annotation and the backward tracking, and the training function is formula (1)

Wherein L is_CIs a loss function selected by the tracker 1, R_TIs the label of the initial frame cropping template patch, which is a Gaussian response centered at the center of the initial bounding box, Z_TIs a response graph generated from the second search frame in back-tracking, using cyclic consistency training.

The tracker 2 uses fine-grained based pixel matching, as shown in fig. 3, using a memory storage for storing a plurality of frames of information, which has the advantage that it can use more feature information. The tracker is designed with long-term and short-term storage, and in a video sequence, the target can be changed with the time, so that when the target is changed during tracking, the target is not utilized by previous good features, and errors can be amplified and even lost during subsequent tracking easily. The use of this memory storage, which is configured to store 5 frames of information, fix the 0 th and 5 th frames as long-term memory, ensures that the previous feature information can always be stored in the long-term video sequence, and then use I_t-5，I_t-3，I_t-1As short-term memory, the latest feature information is obtained. The tracker 2 derives the reference frame (I) by linear combination_t-1) And training the pixel recombination target frame. In particular, for each input frame I_tThere is one triplet (Qt, Kt,vt), i.e., Query, Key, Value. Taking the current frame and multiple past frames in the memory bank as input, using a trained feature encoder to calculate an affinity matrix between Q in the target frame and K in the reference frame, and reconstructing the pixel formula in the t frame as (2)

Wherein,<·，·>is the dot product of two vectors, Q and K are the target frame I_tThe feature representation after the twin network, K corresponds to the features of a plurality of reference frames, Q is the feature of the current target frame, A_tIs a pixel

And

v is the original reference frame.

By reconstructing frames

And the original frame

Training the tracker with the loss therebetween, the loss function being as in equation (4)

L_m＝∑_iz_i (4)

Wherein L is_mIs the Huber loss function of the tracker 2, trained with the consistency of the pixel reorganization.

The purpose of the selection module is to select according to the tracking resultSelecting the most suitable target result; in the selection module, two tracker results need to be run simultaneously, and the implementation in the tracker T can be realized₁,T₂Selecting the Chinese medicines preferentially; the specific contents are as follows:

(1) acquiring an overlapping value IOU between the candidate box and the pseudo label; because unsupervised learning has no pseudo label, after the two trackers track forward to obtain the predicted target position, the predicted target position is used as the pseudo label to track reversely to skip the interval n frames, a new predicted position is obtained in the initial frame, and the overlap ratio IOU is calculated between the new predicted target frame and the annotation frame in the initial frame. The label P calculation formula required by selector training is obtained through the IOU value as (6)

(2) The selection module consists of two full connection layers and a softmax layer, wherein the full connection layers comprise a linear full connection layer and a Relu activation function; and inputting the feature map obtained by the tracker to be selected into a selection module, and respectively obtaining the probability values of the precision estimation of the two target frames by the feature map through the selection module. This selector is trained using a cross entropy loss function. The loss function is of the formula (7)

L_s＝-∑plna+(1-p)ln(1-a) (7)

Wherein a is the probability value of the target box precision estimation of the features obtained by the selector, and p is the label required by the selector training.

The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solution of the present invention by those skilled in the art should fall within the protection scope defined by the claims of the present invention without departing from the spirit of the present invention.

Claims

1. A high-precision target tracking system based on unsupervised learning is characterized in that: the system comprises the following modules:

the image acquisition module is used for acquiring a video image;

2. The unsupervised learning high accuracy-based target tracking system of claim 1, wherein: tracker 1 and tracker 2 in the tracking module adopt two kinds of different trackers, are suitable for and track and use two different loss function training respectively under different scenes, specifically as follows:

L_m＝∑_iz_i (2)

is a reconstructed frame of the video sequence to be displayed,

is a real frame, trained using the consistency of pixel reorganization.

3. The unsupervised learning high accuracy-based target tracking system of claim 1, wherein: the specific contents of the selection module are as follows:

L_s＝-∑p ln a+(1-p)ln(1-a) (5)