CN112116630A

CN112116630A - Target tracking method

Info

Publication number: CN112116630A
Application number: CN202010834753.8A
Authority: CN
Inventors: 许剑华
Original assignee: Shanghai Supremind Intelligent Technology Co Ltd
Current assignee: Shanghai Supremind Intelligent Technology Co Ltd
Priority date: 2020-08-19
Filing date: 2020-08-19
Publication date: 2020-12-22

Abstract

The invention relates to a video monitoring technology, in particular to a target tracking method. The method is mainly used for realizing a single-target tracking scheme based on a reinforcement learning algorithm, and each tracking of each target is treated as a strategy problem. And optimizing a control strategy (policy) through the tracked state (reward), and further training the deep learning network. Compared with the general RL scheme, the method directly predicts the target based on the deducted image and predicts the related characteristics by using the target template and the image by combining the Siemese network structure and the RL scheme, can use the peripheral information of the target, improves the semantic background distinguishing capability of the target and improves the tracking robustness. The method simultaneously uses DQN for tracking, does not perform decoupling on estimation of the value function and optimization of the strategy, and is easy to train and not converge or overfit.

Description

Target tracking method

Technical Field

The invention relates to a video monitoring technology, in particular to a target tracking method.

Background

In an intelligent traffic system, a video monitoring technology is widely applied. Currently, research focuses on vehicle detection, identification, tracking, traffic statistics, traffic dispersion, and violation detection.

The existing target tracking method mainly comprises the following steps:

firstly, the method comprises the following steps: patent application No. 201810592957.8 multi-target tracking based on multi-agent deep reinforcement learning:

according to the patent, each frame is detected by using YOLOV3, objects are deducted by using a detection frame, an agent is created for each deducted object, then the deducted objects are subjected to feature extraction, and the action prediction is performed on each object through LSTM and DQN.

The single target tracking of the scheme extracts the features of each target based on CNN, extracts the related feature information of all targets by using LSTM, and predicts the action (the prediction of the position of the target frame) of each target frame through the strategy (policy) learned by DQN. A total of 9 actions are involved.

This patent suffers from the following disadvantages:

1) and the DQN is used for performing action and a cost function, so that overfitting is easy and training is not easy.

2) And the target prediction problem is regarded as a split screen (discrete) problem, so that the tracking precision is insufficient, and the target position change is regarded as a continuous problem of a space.

II, secondly: the patent of application No. 201810220513.1, multi-target tracking method based on deep reinforcement learning:

in the patent, each target is assigned with an agent, then the target frame is predicted through a DQN network, and the image input into the DQN network is the deducted detection frame image.

This patent suffers from the following disadvantages: the DQN is an action value optimization algorithm, and the strategy is optimized by maximizing the action value and converged to the local maximum. But the convergence is sensitive to reward and is not easy to converge.

Thirdly, the method comprises the following steps: siemeselrpn series single target tracking:

and respectively sending the template picture and the global picture into a CNN network in a certain area, then correlating by an RPN network, and respectively predicting the frame and the category of the target by two different correlated branches.

This method has the following disadvantages: the siamese rpn series scheme assumes that the current position of the target is near the position of the previous frame, so that the correlation operation is not performed on the whole image, but the image in a small range of the position of the previous frame of the target is deducted for correlation, and thus, the main problem of the operation is that the semantic background discrimination capability of the siamese rpn correlation operation is poor, so that the mis-tracking is easily caused, and the range of feature extraction is limited. Meanwhile, the SiemesRPN network does not consider the historical position information of the target, and only depends on the current template and the current frame for prediction.

Accordingly, there is a need for improvements in the art.

Disclosure of Invention

The invention aims to provide an efficient target tracking method.

In order to solve the above technical problem, the present invention provides a target tracking method, including the following steps:

1) inputting a target frame for selecting a target in the current frame image; executing the step 2;

2) the Simese feature extraction module uses an input target frame or a predicted target frame to scratch a target, the target frame scratch is used as a target scratch, and then a CNN network is used for extracting features of the target scratch and a current frame image; processing the two characteristic graphs to obtain a characteristic graph by using the same CNN network, and then performing related calculation on the two characteristic graphs to obtain a related characteristic graph which is used as the input of an RL tracking module; executing the step 3;

3) the RL tracking module predicts the position of a target frame of the next frame image based on the related characteristic diagram of the current frame image extracted in the step 2 to obtain the difference of the current frame relative to the previous frame, and obtains the position of the predicted target frame according to the difference and the position of the current frame target frame; according to the position of the prediction target frame, performing feature deduction from the next frame image, and performing score prediction to obtain tracked state reward;

4) and optimizing the control strategy policy according to the tracked state reward obtained in the step 3.

As an improvement to the method of object tracking of the present invention:

score prediction includes iou, giou, L2 distances, etc.

As a further improvement to the method of object tracking of the present invention:

the Siamese feature extraction module comprises three convolutional layers;

the RL tracking module comprises an Actor network and a Critic network, wherein the Actor network and the Critic network comprise shared three-layer convolution layers, the Actor network further comprises a convolution layer, two FC layers and an action layer, and the Critic network further comprises a Crop layer, a convolution layer, two FC layers and a score layer;

and the Simase feature extraction module inputs the output of the last convolution layer into the convolution layer of the Actor network and the Crop layer of the Critic network, and inputs the output of the action layer of the Actor network into the Crop layer of the Critic network.

in step 3:

the Actor network performs action prediction on a next frame target frame based on a current target frame to obtain the difference of the next frame relative to the current frame, including but not limited to the following parameters: Δ x, Δ y, Δ w, Δ h;

wherein, the position change of the target next frame prediction relative to the central point of the current frame is delta x and delta y, and the height and width change of the target next frame prediction relative to the current frame is delta w and delta h;

obtaining the position of a predicted target frame according to the position of the current frame target frame and the position of the delta x, the delta y, the delta w and the delta h;

the Critic network deducts the target from the middle layer by using the position of the predicted target frame, namely deducts related features from a certain layer in the middle of the Critic network, and then carries out feature extraction according to the position of the target;

then predicting the IOU of the target through the convolution layer and the FC layer; the predicted value is used as the action value socre of the RL network; for evaluating the reliability of the current action.

The single-target tracking method has the technical advantages that:

the method is mainly used for realizing a single-target tracking scheme based on a reinforcement learning algorithm, and each tracking of each target is treated as a strategy problem. And optimizing a control strategy (policy) through the tracked state (reward), and further training the deep learning network.

1) Compared with a general RL scheme, the method has the advantages that the target prediction is directly carried out on the basis of the deducted image, the target template and the image are used for carrying out the related characteristic prediction, the peripheral information of the target can be used, the semantic background distinguishing capability of the target is improved, and the tracking robustness is improved.

2) In the prior art, the position of a target frame is predicted by using a deducted image, and information around the target is lacked, so that the tracking robustness is poor. The method simultaneously uses DQN for tracking, has no decoupling on the estimation of the valence function and the optimization of the strategy, and is easy to train and not converge or overfit.

Drawings

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of an algorithmic framework of a system for object tracking according to the present invention;

FIG. 2 is a flow chart of a method of target tracking of the present invention;

FIG. 3 is a block diagram of the Siemese feature extraction module;

fig. 4 is a block diagram of the Siamese feature extraction module in connection with the RL tracking module.

Detailed Description

The invention will be further described with reference to specific examples, but the scope of the invention is not limited thereto.

Embodiment 1, a system for target tracking, as shown in fig. 1-4, includes a Siamese feature extraction module and a RL tracking module connected to each other.

The RL tracking module carries out tracking operation by using the features obtained by the Siamese network, adopts an operator-critic scheme, trains the network by using a policy gradient, does not directly depend on reward, increases the robustness of the network, and decouples the training of the policy and the prediction of the action value by using the operator-critic scheme. The RL trace module comprises an Actor network and a Critic network.

The actor network and critic network share features through the preceding shared convolutional layer.

An Actor network:

the Actor makes action prediction on the current target frame based on the previous frame target frame, including but not limited to the following parameters: Δ x, Δ y, Δ w, Δ h;

Δ x, Δ y are the position changes of the center point of the target current frame prediction relative to the previous frame, and Δ w, Δ h are the height and width changes of the target frame.

Critic network:

the Critic network deducts a target from the middle layer by using the predicted target position, and then predicts the IOU of the target through the convolution layer and the FC layer. The predicted value serves as the action value of the RL network.

The Simase feature extraction module comprises three convolutional layers, the Actor network and the Critic network comprise shared three convolutional layers, the Actor network further comprises a convolutional layer, two FC layers and an action layer, and the Critic network further comprises a Crop layer, a convolutional layer, two FC layers and a score layer.

And the Simase feature extraction module inputs the output of the last convolution layer into the convolution layers of the Actor network and the Critic network, and inputs the output of the action layer of the Actor network into the Crop layer of the Critic network.

The target tracking method comprises the following steps:

1) firstly, externally inputting a target frame for selecting a target in a current frame image for initializing a tracking algorithm; executing the step 2;

2) the Simese module uses the input target frame or the predicted target frame to scratch the target, the target frame is used as the target scratch, and then the CNN network is used for extracting the characteristics of the target scratch and the current frame image; and processing the two characteristic graphs by using the same CNN network to obtain a characteristic graph, and then performing a correlation algorithm on the two characteristic graphs to obtain a correlation characteristic graph which is used as the input of an RL tracking module. Executing the step 3;

3) the RL tracking module predicts the position of a target frame of the next frame image based on the related characteristic diagram of the current frame image extracted in the step 2 to obtain the difference of the next frame image relative to the current frame image, and obtains the position of the predicted target frame according to the difference and the position of the current frame target frame; and (3) performing feature extraction from the next frame of image according to the position of the prediction target frame, outputting a score as score prediction by the network, and evaluating the difference between the prediction frame and the truth frame as tracked state reward by the score prediction, wherein the score prediction generally adopts the schemes of iou, giou, L2 distance and the like.

4) Optimizing the control strategy policy according to the tracked state reward obtained in the step 3, and training the network of the RL tracking module by using a policy gradient.

The specific prediction method comprises the following steps:

wherein, the position change of the target next frame prediction relative to the center point of the current frame is delta x and delta y, and the height and width change of the target next frame prediction relative to the current frame is delta w and delta h.

And obtaining the position of the predicted target frame according to the position of the current frame target frame and the position of the delta x, the delta y, the delta w, the delta h.

then, the IOU of the target is predicted through the convolution layer and the FC layer. The predicted value serves as the action value of the RL network. For evaluating the reliability of the current action.

If a plurality of single targets need to be tracked, a plurality of tracking instances can be started to realize the tracking of the plurality of single targets.

Finally, it is also noted that the above-mentioned lists merely illustrate a few specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims

1. A method of target tracking, characterized by: the method comprises the following steps:

2) the Simese feature extraction module uses an input target frame or a predicted target frame to scratch a target, the target frame scratch is used as a target scratch, and then a CNN network is used for extracting features of the target scratch and a current frame image; processing the two characteristic graphs to obtain a characteristic graph by using the same CNN network, and then performing related calculation on the two characteristic graphs to obtain a related characteristic graph which is used as the input of an RL tracking module; the correlation algorithm executes step 3;

3) the RL tracking module predicts the position of a target frame of the next frame image based on the related characteristic diagram of the current frame image extracted in the step 2 to obtain the difference of the current frame relative to the previous frame, and obtains the position of the predicted target frame according to the difference and the position of the current frame target frame; performing feature extraction from the next frame image according to the position of the prediction target frame, outputting a score as score prediction by a network, and using the score prediction as tracked state reward;

2. The method of object tracking according to claim 1, characterized by:

score predictions include, but are not limited to, iou, giou, L2 distances, and the like.

3. The method of object tracking according to claim 2, characterized by:

the Siamese feature extraction module comprises three convolutional layers;

4. The method of object tracking according to claim 3, wherein:

in step 3: