CN108573496B

CN108573496B - Multi-target tracking method based on LSTM network and deep reinforcement learning

Info

Publication number: CN108573496B
Application number: CN201810271429.2A
Authority: CN
Inventors: 姜明新; 常波; 贾银洁
Original assignee: Huaiyin Institute of Technology
Current assignee: Dragon Totem Technology Hefei Co ltd; Shanghai Xinghao Information Technology Co ltd
Priority date: 2018-03-29
Filing date: 2018-03-29
Publication date: 2020-08-11
Anticipated expiration: 2038-03-29
Also published as: CN108573496A

Abstract

The invention discloses a multi-target tracking method based on an LSTM network and deep reinforcement learning, which adopts a target detector to detect each frame of image in a video to be detected and outputs a detection result

Constructing a plurality of single target trackers based on the deep reinforcement learning technology, wherein each single target tracker comprises a convolutional neural network and a full connection layer, and the convolutional neural network is constructed on the basis of a VGG-16 network and outputs the tracking result of each single target tracker

Computing similarity matrices for data correlations

Constructing a data association module based on an LSTM network, and inputting a similarity matrix to obtain a distribution probability vector

And the matching probability between the ith target and the detection result j is obtained, and the target detection result with the maximum matching probability is used as the tracking result of the ith target. The method and the device are not influenced by mutual shielding, similar appearance and continuous change of quantity in the multi-target tracking process, and improve the multi-target tracking accuracy and the multi-target tracking precision.

Description

Multi-target tracking method based on LSTM network and deep reinforcement learning

Technical Field

The invention belongs to the field of computer vision, relates to a video multi-target tracking method, and particularly relates to a multi-target tracking method based on an LSTM network and deep reinforcement learning.

Background

Multi-target tracking is a hot problem in the field of computer vision, and plays an important role in many application fields, such as: artificial intelligence, virtual reality, unmanned, etc. Although a great deal of related work exists in the early stage, the multi-target tracking is still a challenging problem due to the problems of frequent shielding, similar appearance of multiple targets, continuous change of the number of targets and the like in the multi-target tracking process.

In recent years, detection-based multi-target tracking methods have been successful, and divide multi-target tracking into two parts, namely multi-target detection and data association. Deep learning and deep reinforcement learning technologies have recently been widely applied in the computer field, but no relevant research results exist in the multi-target tracking technology.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the technical defects that a manually designed model is not comprehensive enough and the tracking result is not accurate enough in the prior art, the invention provides a multi-target tracking method based on an LSTM network and deep reinforcement learning.

The technical scheme is as follows: a multi-target tracking method based on an LSTM network and deep reinforcement learning comprises the following steps:

(1) detecting each frame of image in the video to be detected by adopting a target detector, outputting a detection result, and setting the detection result of the image corresponding to the time t as a set

The jth detection result of the corresponding image at the time t, and N is the total number of detections;

(2) constructing a plurality of single target trackers based on the deep reinforcement learning technology, wherein each single target tracker comprises a convolutional neural network and a full connection layer, the convolutional neural network is constructed on the basis of a VGG-16 network, the tracking result of each single target tracker is output, and the tracking result of the image corresponding to the time t is set as a set

The output result of the ith single-target tracker of the corresponding image at the time t is obtained, and M is the total number of targets which can be tracked at the time t simultaneously;

(3) according to step (1)

And step (2)

Computing similarity matrices for data correlations

Is that

And

the euclidean distance between them,

(4) constructing a data association module based on an LSTM network and inputting a similarity matrix

Obtaining an assigned probability vector

For the match probability between the ith target and all possible target detections,

and the matching probability between the ith target and the detection result j is obtained, and the target detection result with the maximum matching probability is used as the tracking result of the ith target.

Further, the target detector in the step (1) adopts YOLO V2.

Further, the detection result output by the target detector in the step (1) and the output result of the target tracker in the step (2) are both four-dimensional vectors,

wherein (x)_t',y_t') center coordinates of the target tracking rectangle in the target detector, w_t' Width of rectangular frame for target tracking in target detector, h_t' tracking the height of a rectangular box for an object in the object detector; (x)_t,y_t) Tracking center coordinates, w, of a rectangular frame for a target in a single target tracker_tWidth, h, of a rectangular frame for tracking a target in a single target tracker_tThe height of the rectangular box is tracked for the target in the single target tracker.

Further, the specific method for outputting the tracking result by the single-target tracker in the step (2) is as follows:

the single target tracker based on the deep reinforcement learning technology takes each target as an intelligent agent, trains the intelligent agent by utilizing the deep reinforcement learning technology, and determines actions according to feedback given by the intelligent agent according to the self state and the environment;

action a at time t_tThe vector is a seven-dimensional vector which comprises motion in two horizontal directions, motion in two vertical directions, size change and search termination action, and the state vector at the moment t is set as s_t＝(p_t,v_t)，v_tIs a vector of historical actions, state vector s at time t +1_t+1＝(p_t+1,v_t+1) From the state s at time t_t＝(p_t,v_t) Determining, based on the state transition equation, that p is_t+1＝f_t(p_t,a_t) And the equation of motion change v_t+1＝f_v(v_t,a_t) Predicting a state vector at the t +1 moment according to the motion at the t moment and the state vector at the t moment;

in the training process, the intelligent body receives a feedback signal r_tAt each iteration instant in the tracking process, r_t0, when the termination of the search action is selected at the termination time T, the feedback signal r_TIs a threshold function of the intersection ratio IoU:

wherein IoU (p)_T,g)＝area(p_T∩g)/area(p_T∪ g), g is the true value of the image block, i.e. the true position of the target calibrated manually, and tau is the threshold set manually.

Further, the LSTM network in step (4) includes the input parameters: step i hidden state h_iStep i cell status c_iSimilarity matrix

Hidden state h of output parameter step i +1_i+1Step i +1 cell status c_i+1Assigning probability vectors

First to the hidden state h_iCell state c_iInitializing, gradually inputting the hidden state h of the ith step_iStep i cell status c_iAnd similar matrices

Outputting the hidden state h of the step i +1_i+1Step i +1 cell status c_i+1And

has the advantages that: the LSTM network and depth reinforcement learning technology is applied to the video multi-target tracking method for the first time, the technical defects that a manually designed model is not comprehensive enough and the tracking result is not accurate enough are overcome, the influence of mutual shielding, similar appearance and continuous change of quantity in the multi-target tracking process is avoided, and the multi-target tracking accuracy are improved.

Drawings

FIG. 1 is a system diagram of the multi-target tracking method based on the LSTM network and the deep reinforcement learning according to the present invention;

FIG. 2 is a schematic diagram of a single target tracker;

FIG. 3 is a block diagram of deep reinforcement learning;

fig. 4 is a schematic diagram of an LSTM network.

Detailed Description

The invention is further described with reference to the following figures and specific examples.

As shown in fig. 1, the multi-target tracking method based on LSTM network and deep reinforcement learning includes the following steps:

(1) detecting each frame of image in the video to be detected by adopting a YOLO V2 target detector, outputting a detection result, and setting the detection result of the image corresponding to the time t as a set

(2) as shown in FIG. 2, a plurality of single target trackers based on the deep reinforcement learning technology are constructed, each single target tracker comprises a convolutional neural network CNN and a full connection layer FC, the convolutional neural network is constructed on the basis of a VGG-16 network, the VGG-16 belongs to the prior art, and the deep reinforcement learning technology is widely applied to the deep learning method. The CNN network designed by the invention comprises 5 pooling layers, namely Conv1-2, Conv2-2, Conv3-3, Conv4-3 and Conv5-3, and characteristics output by Conv3-3, Conv4-3 and Conv5-3 are used as target expression characteristics in a tracking process. Outputting traces for each single target trackerAs a result, let the tracking result of the corresponding image at time t be the set

as shown in fig. 3, the single target tracker based on the deep reinforcement learning technology regards each target as an agent, and trains the agent by using the deep reinforcement learning technology, and each agent determines an action according to the feedback given by the agent's own state and environment;

the action set (action) A adopted by us is composed of 6 actions in different directions and 1 action for terminating search, including horizontal two-direction movement { right, left }, vertical two-direction movement { up, down }, size change { scaleup, scale down }, and action for terminating search, that is, action a at time t_tConsisting of a 7-dimensional vector. Let the state vector at time t be s_t＝(p_t,v_t)，p_tRepresentative image blocks, v_tIs a vector of historical actions, this patent stores 10 historical actions, which means v_tIs a 70-dimensional vector. State vector s at time t +1_t+1＝(p_t+1,v_t+1) From the state s at time t_t＝(p_t,v_t) Determining, based on the state transition equation, that p is_t+1＝f_t(p_t,a_t) And the equation of motion change v_t+1＝f_v(v_t,a_t) Predicting the state vector at the t +1 moment according to the motion at the t moment and the state vector at the t moment, as shown in FIG. 2; the specific prediction method is described as follows: let p_tIs [ x ]_t,y_t,w_t,h_t]The state transition equation includes: Δ x_t＝αw_t,Δy_t＝αh_tα is 0.03, p is calculated by using the action change equation_t+1If act a_tBy "left", i.e. p_t+1Is [ x ]_t-Δx_t,y_t,w_t,h_t](ii) a If action a_tTo "scaleup", then p_t+1Is [ x ]_t,y_t,w_t+Δx_t,h_t+Δy_t]。

In the training process, the intelligent body receives a feedback signal r_tFeedback signal r_tThe effect of (a) is to tell the agent how to move or whether to terminate the action. At each iteration instant in the tracking process, r_t0, when the termination of the search action is selected at the termination time T, the feedback signal r_TIs a threshold function of IoU (Intersection-over-Union):

wherein IoU (p)_T,g)＝area(p_T∩g)/area(p_T∪ g), g is the real value of the image block (the group path), i.e. the real position of the target calibrated manually, and tau is the threshold set manually.

(3) The detection result output by the target detector in the step (1) and the output result of the target tracker in the step (2) are both four-dimensional vectors,

wherein (x)_t',y_t') center coordinates of the target tracking rectangle in the target detector, w_t' Width of rectangular frame for target tracking in target detector, h_t' tracking the height of a rectangular box for an object in the object detector; (x)_t,y_t) Tracking center coordinates, w, of a rectangular frame for a target in a single target tracker_tWidth, h, of a rectangular frame for tracking a target in a single target tracker_tThe height of the rectangular box is tracked for the target in the single target tracker. According to step (1)

And step (2)

Computing similarity matrices for data correlations

Is that

And

the euclidean distance between them,

data-associative similarity matrix for measuring output of single-target tracker

And the output of the target detector

The degree of similarity between them;

(4) the data association module is constructed based on an LSTM network, and the LSTM network comprises the following input parameters: step i hidden state h_iStep i cell status c_iSimilarity matrix

Output the (i + 1) th step of hiddenHidden state h_i+1Step i +1 cell status c_i+1And

In order to verify the technical effect of the method, the following experiments were carried out:

the Windows 10 operating system was used for experiments, MATLAB R2016a was used as the software platform, and the computer was configured primarily as Intel (R) core (TM) i7-4712MQ CPU @3.40GHz (with 16G memory) with TITAN GPU (12.00GB memory).

In order to measure the performance of the tracking method, 4 indexes of multi-target tracking accuracy (MOTA), multi-target tracking accuracy (MOTP), false alarm rate (FP) and identification switching (IDSW) are selected for comparison, and a table 1 lists comparison results between the method (LSTM _ DRL) and other 4 methods in the prior art, and data are based on 11 test videos.

TABLE 1 comparative results.

Experiments show that the method can overcome the defect that the tracking result in the prior art is not accurate enough, is not influenced by mutual shielding, similar appearance and continuous change of quantity in the multi-target tracking process, and can improve the multi-target tracking accuracy and the multi-target tracking precision.

Claims

1. A multi-target tracking method based on an LSTM network and deep reinforcement learning is characterized by comprising the following steps:

the specific method for outputting the tracking result by the single-target tracker in the step (2) is as follows:

action a at time t_tThe vector is a seven-dimensional vector which comprises motion in two horizontal directions, motion in two vertical directions, size change and search termination action, and the state vector at the moment t is set as s_t＝(p_t,v_t)，v_tIs the direction of historical actionsAmount, p_tFor the tracking target image block at time t, the state vector s at time t +1_t+1＝(p_t+1,v_t+1) From the state s at time t_t＝(p_t,v_t) Determination of p_t+1The image block of the tracked target at the time of t +1 is p according to a state conversion equation_t+1＝f_t(p_t,a_t) And the equation of motion change v_t+1＝f_v(v_t,a_t) Predicting a state vector at the t +1 moment according to the motion at the t moment and the state vector at the t moment;

wherein IoU (p)_T,g)＝area(p_T∩g)/area(p_T∪ g), g is the real value of the image block, i.e. the real position of the target calibrated manually, tau is the threshold set manually, p_TA tracking target image block at the termination time T;

(3) according to step (1)

And step (2)

Computing similarity matrices for data correlations

Is that

And

the euclidean distance between them,

Obtaining an assigned probability vector

2. The LSTM network and deep reinforcement learning-based multi-target tracking method of claim 1, wherein the target detector in step (1) adopts YOLO V2.

3. The LSTM network and deep reinforcement learning based multi-target tracking method of claim 1, wherein the detection result outputted by the target detector in step (1) and the output result of the target tracker in step (2) are four-dimensional vectors,

wherein, (x'_t,y'_t) Tracking center coordinates, w ', of rectangular box for targets in target detector'_tTracking width, h 'of rectangular box for target in target detector'_tTracking the height of the rectangular box for the target in the target detector; (x)_t,y_t) Tracking center coordinates, w, of a rectangular frame for a target in a single target tracker_tWidth, h, of a rectangular frame for tracking a target in a single target tracker_tThe height of the rectangular box is tracked for the target in the single target tracker.

4. The LSTM network and deep reinforcement learning-based multi-target tracking method according to claim 1, wherein the LSTM network in step (4) includes input parameters: step i hidden state h_iStep i cell status c_iSimilarity matrix