CN113096156B

CN113096156B - Automatic driving-oriented end-to-end real-time three-dimensional multi-target tracking method and device

Info

Publication number: CN113096156B
Application number: CN202110441246.2A
Authority: CN
Inventors: 张宇翔; 张昱; 张燕咏; 吉建民
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2024-05-24
Anticipated expiration: 2041-04-23
Also published as: CN113096156A

Abstract

The application provides an end-to-end real-time three-dimensional multi-target tracking method and device for automatic driving, which are characterized in that when sensor data of a t frame are received, a process of acquiring and predicting tracking results of the t frame according to a state set of the t frame is executed in parallel, a detection bounding box of a plurality of objects corresponding to the sensor data of the t frame is detected, when the state set is updated to the t-1 frame, the state sets of the t-1 frame and the detection bounding box are associated to obtain the state set of the t frame, and the state set of the t-1 frame is updated to the state set of the t frame. Therefore, when the sensor data of the t frame is received, the tracking result (r < t) of the t frame is directly predicted based on the state set of the r frame, so that three-dimensional multi-target tracking is realized, the efficiency of three-dimensional multi-target tracking is improved, and the state set is updated based on the sensor data of the t frame, so that the accuracy of three-dimensional multi-target tracking is ensured.

Description

Automatic driving-oriented end-to-end real-time three-dimensional multi-target tracking method and device

Technical Field

The application relates to the field of automatic driving, in particular to an end-to-end real-time three-dimensional multi-target tracking method and device for automatic driving.

Background

In an automatic driving scene, the road condition on the driving route of the automatic driving vehicle is not constant, if the road condition cannot be obtained in real time, and the route can be adjusted in time according to the road condition, collision can possibly occur, and the road condition can be obtained through three-dimensional multi-target tracking, so that the three-dimensional target tracking has a vital effect on subsequent route planning and collision avoidance.

Therefore, how to provide a technical solution for implementing three-dimensional multi-target tracking is a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The application aims to solve the technical problem of providing an end-to-end real-time three-dimensional multi-target tracking method for automatic driving so as to realize three-dimensional multi-target tracking.

The application also provides an end-to-end real-time three-dimensional multi-target tracking device oriented to automatic driving, which is used for ensuring the realization and application of the method in practice.

An end-to-end real-time three-dimensional multi-target tracking method for automatic driving comprises the following steps:

In the automatic driving process, when sensor data of a t frame is received, a prediction step and a state updating step are executed in parallel; wherein t is a positive integer;

The predicting step includes:

Acquiring a state set of an r frame; the state set of the r frame is a set obtained by updating recently, the state set of the r frame is obtained by updating sensor data of the r frame, and the state set comprises object state data of a plurality of objects;

Predicting a tracking result of the t frame according to the state set of the r frame; the tracking result is used for representing the position and the gesture of each object in the real scene in the t frame;

the state updating step includes:

detecting a plurality of objects contained in the sensor data of the t frame to obtain a plurality of detection bounding boxes of the t frame, and acquiring a state set of the t-1 frame when the state set is updated to the t-1 frame;

Associating each detection bounding box of the t frame with the state set of the t-1 frame to obtain the state set of the t frame;

and updating the state set of the t-1 th frame into the state set of the t-1 th frame.

In the above method, optionally, predicting the tracking result of the t frame according to the state set of the r frame includes:

calculating bounding boxes of each object in the t frame based on object state data of each object included in the state set of the r frame;

For each object, forming a sub-tracking result of the object in a t frame by using a bounding box of the object in the t frame and a track identifier in object state data corresponding to the object in a state set of the r frame;

and forming each sub-tracking result into a tracking result of the t frame.

The method, optionally, the calculating, based on the object state data of each object included in the state set of the nth frame, a bounding box of each object in the nth frame includes:

calculating a first displacement of each object in each dimension according to the movement speed in the object state data of each object included in the state set of the r-th frame;

And calculating the bounding box of each object in the t frame according to the bounding box of each object in the object state data of each object included in the state set of the r frame and the first displacement of each object in each dimension.

In the above method, optionally, the associating each detection bounding box of the t frame with the state set of the t-1 frame to obtain the state set of the t frame includes:

Calculating a prediction bounding box of each object in the t-1 frame based on object state data of each object included in the state set of the t-1 frame;

constructing an affinity matrix based on each prediction bounding box and each detection bounding box of the t frame;

Solving the affinity matrix to obtain unmatched detection bounding boxes, unmatched prediction bounding boxes and a plurality of matched pairs; each matching pair includes a detection bounding box and a prediction bounding box;

And obtaining the state set of the t frame based on the unmatched detection bounding boxes, the unmatched prediction bounding boxes, the matched pairs and the state set of the t-1 frame.

The method, optionally, the obtaining the state set of the t frame based on each unmatched detection bounding box, each unmatched prediction bounding box, each matched pair, and the state set of the t-1 frame includes:

For each matching pair, determining the respective weights of the detection bounding box and the prediction bounding box in the matching pair, carrying out weighted average on the detection bounding box and the observation bounding box in the matching pair according to the respective weights of the detection bounding box and the prediction bounding box in the matching pair to obtain a first bounding box, adding one to a first counting result in initial object state data to obtain a new first counting result, and forming first object state data by the first bounding box, the new first counting result, and a movement speed, a track identifier and a second counting result in the initial object state data; the initial object state data is object state data corresponding to a prediction bounding box in the matching pair in a state set of a t-1 frame, the first counting result is used for representing the observed times of the object, and the second counting result is used for representing the continuous unobserved times of the object;

For a detection bounding box which is not matched, a track identifier is allocated for the detection bounding box, initializing is carried out on a movement speed and a second counting result which correspond to the detection bounding box, a first counting result which correspond to the detection bounding box is assigned to be 1, and the track identifier, the movement speed, the first counting result and the second counting result which correspond to the detection bounding box form second object state data;

For each unmatched prediction bounding box, adding one to a second counting result in object state data corresponding to the prediction bounding box in the state set of the t-1 th frame to obtain a new second counting result, and if the new second counting result is smaller than a disappearance threshold value, forming third object state data by the prediction bounding box, the new second counting result, a movement speed, a track identifier and a first counting result in the object state data corresponding to the prediction bounding box in the state set of the t-1 th frame;

and forming a state set of the t frame by all the first object state data, the second object state data and the third object state data.

The method, optionally, calculates a predicted bounding box of each object in the t-th frame based on the object state data of each object included in the state set of the t-1 th frame, including:

Calculating a second displacement of each object in each dimension according to the motion speed in the object state data of each object included in the state set of the t-1 th frame;

A predicted bounding box for each object in the t-th frame is calculated from bounding boxes in the object state data for each object included in the state set of the t-1 th frame, and the second displacement of each object in each dimension.

In the above method, optionally, the detecting a plurality of objects included in the sensor data of the t frame to obtain a plurality of detection bounding boxes of the t frame includes:

And calling a three-dimensional target detector to detect a plurality of objects contained in the sensor data of the t frame, and obtaining a plurality of detection bounding boxes of the t frame.

An end-to-end real-time three-dimensional multi-target tracking device for automatic driving, comprising:

an execution unit for executing the prediction step and the state update step in parallel when sensor data of the t-th frame is received in the process of automatic driving; wherein t is a positive integer;

The predicting step includes:

The prediction unit is used for predicting the tracking result of the t frame according to the state set of the r frame; the tracking result is used for representing the position and the gesture of each object in the real scene in the t frame;

the state updating step includes:

In the foregoing apparatus, optionally, the execution unit is configured to predict a tracking result of the nth frame according to the state set of the nth frame, and the execution unit is specifically configured to:

and forming each sub-tracking result into a tracking result of the t frame.

The above apparatus, optionally, the execution unit is configured to calculate, based on object state data of each object included in the state set of the nth frame, a bounding box of each object in the nth frame, where the execution unit is specifically configured to:

The storage medium comprises a stored program, wherein the program is used for controlling equipment where the storage medium is located to execute the automatic driving-oriented end-to-end real-time three-dimensional multi-target tracking method when running.

An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors to perform the above-described end-to-end real-time three-dimensional multi-target tracking method for autopilot.

Compared with the prior art, the application has the following advantages:

The application provides an end-to-end real-time three-dimensional multi-target tracking method and device for automatic driving, wherein in the process of automatic driving, when sensor data of a t frame are received, a prediction step and a state update step are executed in parallel; t is a positive integer; the predicting step comprises the following steps: acquiring a state set of an r frame; r is smaller than t, the state set of the r frame is the latest updated set, the state set of the r frame is updated by the sensor data of the r frame, and the state set comprises object state data of a plurality of objects; predicting a tracking result of the t frame according to the state set of the r frame; the tracking result is used for representing the position and the gesture of each object in the real scene in the t frame; the state updating step comprises the following steps: detecting a plurality of objects contained in sensor data of a t frame to obtain a plurality of detection bounding boxes of the t frame, and acquiring a state set of the t-1 frame when the state set is updated to the t-1 frame; associating each detection bounding box of the t frame with the state set of the t-1 frame to obtain the state set of the t frame; and updating the state set of the t-1 th frame to the state set of the t-1 th frame. According to the technical scheme, when the sensor data of the t frame is received, the tracking result of the t frame is directly predicted based on the state set of the r frame, so that three-dimensional multi-target tracking is realized, the efficiency of three-dimensional multi-target tracking is improved, and the state set is updated based on the sensor data of the t frame, and the accuracy of three-dimensional multi-target tracking is ensured.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an end-to-end real-time three-dimensional multi-target tracking method for automatic driving according to the present application;

FIG. 2 is a flowchart of another method of the end-to-end real-time three-dimensional multi-target tracking method for automatic driving according to the present application;

FIG. 3 is a flowchart of another method of the end-to-end real-time three-dimensional multi-target tracking method for automatic driving according to the present application;

FIG. 4 is a diagram of an example of an end-to-end real-time three-dimensional multi-target tracking method for autopilot according to the present application;

FIG. 5 is a diagram of another example of an end-to-end real-time three-dimensional multi-target tracking method for automatic driving according to the present application;

FIG. 6 is a schematic structural diagram of an end-to-end real-time three-dimensional multi-target tracking device for automatic driving according to the present application;

Fig. 7 is a schematic structural diagram of an electronic device according to the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The embodiment of the application provides an end-to-end real-time three-dimensional multi-target tracking method oriented to automatic driving, which can be applied to ThunderMOT systems, wherein a flow chart of the method is shown in fig. 1, and specifically comprises the following steps:

s101, in the automatic driving process, receiving sensor data of a t frame.

In this embodiment, the sensor data of the t frame is received, where t is a positive integer, and the sensor data includes, but is not limited to, RGB image data, laser point cloud data, and inertial measurement unit data.

The sensor data is sensor data within a preset observation range of the vehicle.

In the present embodiment, after receiving the sensor data of the t-th frame, step S102 and step S104 are executed in parallel.

S102, acquiring a state set of the r frame.

In this embodiment, when sensor data of a t frame in a preset observation range of a vehicle is received, a state set of the r frame is obtained; wherein r is smaller than t, the state set of the r frame is the latest updated set, the state set of the r frame is updated by the sensor data of the r frame, and the state set comprises object state data of a plurality of objects.

In this embodiment, the object state data may be identified by s, the object state data s includes motion state data s ^m and control state data s ^c of the object, the motion state data s ^m includes a bounding box b and a motion velocity v, the control state data s ^c includes a trajectory identifier α, a first count result β, and a second count result γ, where the bounding box b= (h, w, l, x, y, z, θ), (h, w, l) represents a bounding box size, (x, y, z) represents a bounding box bottom surface center, θ represents a bounding box yaw angle, and v= (v _x,v_y,v_z).

In this embodiment, the first count result β is used to characterize the number of times the object state data corresponds to the object being observed, the second count result is used to characterize the number of times the object state data corresponds to the object being continuously not observed, and the track identifier is used to uniquely identify the motion track of the object state data corresponds to the object. It should be noted that different objects correspond to different track identifiers.

S103, predicting the tracking result of the t frame according to the state set of the r frame.

In this embodiment, according to the object state data of each object in the state set of the nth frame, the position and the posture of the nth frame of each object are predicted, that is, the bounding box of each object in the nth frame is predicted, the bounding box of each object in the nth frame obtained by prediction and the track identifier in the object state data corresponding to the object form the sub-tracking result of the object in the nth frame, and the sub-tracking result of all objects in the nth frame is used as the tracking result of the nth frame. The tracking result is used for representing the position and the gesture of each object in the real scene at the t-th frame.

Referring to fig. 2, a process for predicting a tracking result of the t frame according to the state set of the r frame specifically includes:

s201, calculating bounding boxes of each object in the t frame based on object state data of each object included in the state set of the r frame.

According to the object state data of each object included in the state set of the r frame, calculating a containing frame of each object, specifically, predicting the containing frame of each object of the t frame based on the containing frame of each object in the object state data of each object included in the state set of the r frame.

Specifically, the process of calculating the bounding box of each object in the t frame based on the object state data of each object included in the state set of the r frame specifically includes:

calculating a first displacement of each object in each dimension according to the motion speed in the object state data of each object included in the state set of the r-th frame;

The bounding box of each object in the t frame is calculated from the bounding box in the object state data of each object included in the state set of the r frame and the first displacement of each object in each dimension.

In this embodiment, according to the movement speed and the unit time interval in the object state data of each object included in the state set of the r frame, the first displacement of each object in each dimension is calculated, where the unit time interval is the time interval between the arrival of two adjacent frames of sensor data, in this embodiment, the time interval between the arrival of the t frame of sensor data and the arrival of the r frame of sensor data is calculated, and based on the time interval between the arrival of the t frame of sensor data and the arrival of the r frame of sensor data and the movement speed in the object state data of each object included in the state set of the r frame, the first displacement of each object in each dimension is calculated, and according to the bounding box in the object state data of each object included in the state set of the r frame and the first displacement of each object in each dimension, the bounding box of each object in the t frame is calculated.

For the above-mentioned object state data of each object included in the state set based on the r-th frame, a procedure of calculating a bounding box of each object in the t-th frame is exemplified as follows:

For example, the state set of the r-th frame includes object state data s corresponding to the object a, where s includes a bounding box b= (h, w, l, x, y, z, θ) and a motion velocity v= (v _x,v_y,v_z), and the unit interval is Δt, and if the first displacement Δx=v _x (t-r) Δt of the object a in the x-axis is Δy=v _y (t-r) Δt of the object a in the y-axis, and the first displacement Δz=v _z (t-r) Δt of the object in the z-axis is Δz=v _z (t-r) Δt.

The bounding box b' of object a at frame t= (h, w, l, x+Δx, y+Δy, z+Δz, θ).

S202, for each object, forming a sub-tracking result of the object in the t frame by using a bounding box of the object in the t frame and a track identifier in object state data corresponding to the object in a state set of the r frame.

For each object, the bounding box of the object in the t frame and the track identifier in the object state data corresponding to the object in the state set of the r frame form a sub-tracking result of the object in the t frame, that is, the sub-tracking result of the object in the t frame comprises the bounding box of the object in the t frame and the track identifier corresponding to the object.

S203, forming each sub-tracking result into a tracking result of the t frame.

And forming the sub-tracking results of all objects in the t frame into the tracking result of the t frame.

S104, detecting a plurality of objects contained in the sensor data of the t frame to obtain a plurality of detection bounding boxes of the t frame.

And detecting each object included in the sensor data of the t frame to obtain a detection bounding box of each object included in the sensor data of the t frame.

In this embodiment, a process of detecting a plurality of objects included in sensor data of a t frame to obtain a plurality of detection bounding boxes of the t frame specifically includes:

In this embodiment, the three-dimensional object detector detects the sensor data by sending the sensor data to the three-dimensional object detector, so as to obtain a detection bounding box of each object included in the sensor data fed back by the three-dimensional object detector.

It should be noted that, in this embodiment, a plurality of three-dimensional object detectors may be operated at the same time, and after receiving sensor data of a frame, the three-dimensional object detector of the sensor data of the current frame to be detected is determined by a polling method, and the sensor data of the current frame is sent to the determined three-dimensional object detector, and the three-dimensional object detector detects the sensor data. Therefore, detection can be ensured without waiting after any sensor data reach the three-dimensional target detector, and the detection efficiency of the sensor data is improved.

S105, when the state set is updated to the t-1 frame, the state set of the t-1 frame is acquired.

In this embodiment, it is determined whether the state set is updated to the t-1 frame, and if the state set is not updated to the t-1 frame, the step of determining whether the state set is updated to the t-1 frame is performed again until the state set is updated to the t-1 frame. And when the state set is updated to the t-1 frame, acquiring the state set of the t-1 frame.

In this embodiment, each time a state set is updated, an old state set is updated to a new state set, and a frame number corresponding to the old state set is updated to a frame number corresponding to the new state set.

In this embodiment, since the state set and the corresponding frame number are stored, it is possible to determine whether the state set is updated to the t-1 frame by determining whether the frame number is t-1.

In this embodiment, the state set of the t-1 th frame is acquired only when the waiting state set to be blocked is changed to the state set of the t-1 th frame, i.e., the waiting state set to be blocked is updated to the state set of the t-1 th frame.

S106, associating each detection bounding box of the t frame with the state set of the t-1 frame to obtain the state set of the t frame.

In this embodiment, the state sets of the t-1 st frame and the respective detection bounding boxes of the t-1 st frame are associated to determine object state data in the state sets of the detection bounding boxes and the t-1 st frame that are matched with each other and object state data in the state sets of the detection bounding boxes and the t-1 st frame that are not matched with each other, so that the object state data in the state sets of the detection bounding boxes and the t-1 st frame that are matched with each other and the object state data in the state sets of the detection bounding boxes and the t-1 st frame that are not matched with each other are processed to obtain the state set of the t-1 st frame.

Referring to fig. 3, the process of associating each detection bounding box of the t frame with the state set of the t-1 frame to obtain the state set of the t frame specifically includes the following steps:

s301, calculating a prediction bounding box of each object in the t-1 frame based on object state data of each object included in the state set of the t-1 frame.

According to the object state data of each object included in the state set of the t-1 frame, calculating a predicted inclusion frame of each object in the t frame, namely, predicting the inclusion frame of each object in the t frame based on the inclusion frame in the object state data of each object included in the state set of the t-1 frame.

Specifically, the process of calculating the predicted bounding box of each object in the t-1 frame based on the object state data of each object included in the state set of the t-1 frame specifically includes:

Calculating a second displacement of each object in each dimension according to the movement speed in the object state data of each object included in the state set of the t-1 th frame;

In this embodiment, according to the movement speed and the unit time interval in the object state data of each object included in the state set of the t-th frame, the first displacement of each object in each dimension is calculated, where the unit time interval is the time interval between the arrival of the sensor data of two adjacent frames, in this embodiment, the time interval between the arrival of the sensor data of the t-th frame and the arrival of the sensor data of the t-1 st frame, that is, the unit time interval, is calculated, based on the unit time interval and the movement speed in the object state data of each object included in the state set of the t-1 st frame, the second displacement of each object in each dimension is calculated, and according to the bounding box in the object state data of each object included in the state set of the t-1 st frame and the second displacement of each object in each dimension, the predicted bounding box of each object in the t-1 st frame is calculated.

For the above-mentioned object state data of each object included in the state set based on the t-1 th frame, a procedure of calculating bounding boxes of each object in the t-1 th frame is exemplified as follows:

For example, the state set of the t-1 frame includes object state data s ₁ corresponding to the object B, where s ₁ includes bounding box B ₁ = (h, w, l, x, y, z, θ) and motion speed v ₁＝(v_x,v_y,v_z), and the unit time interval is Δt, where the first displacement Δx ₁＝v_x Δt of the object B in the x-axis is Δy ₁＝v_y Δt of the object B in the y-axis, and the first displacement of the object in the z-axis is Δz ₁＝v_z Δt of the object B in the z-axis. The prediction bounding box B ₁' of object B at the t-th frame= (h, w, l, x+Δx, y+Δy, z+Δz, θ).

S302, constructing an affinity matrix based on each prediction bounding box and each detection bounding box of the t frame.

In this embodiment, for each detection bounding box of the t-th frame, calculating an intersection ratio of the detection bounding box and each prediction bounding box in a three-dimensional space; and constructing an affinity matrix based on each calculated cross ratio.

Wherein the affinity matrix A_ij＝iou3d(b_i ^d,b_j ^p)whereb_i ^d∈D_t,b_j ^p∈P_t.

Where b _i ^d denotes an ith detection bounding box, b _j ^p denotes a jth prediction bounding box, and iou3d () denotes a union of two bounding boxes in three-dimensional space.

And S303, solving the affinity matrix to obtain each unmatched detection bounding box, each unmatched prediction bounding box and a plurality of matched pairs.

In this embodiment, the affinity matrix is solved to obtain each unmatched detection bounding box, each unmatched prediction bounding box and a plurality of matching pairs, where each matching pair includes one detection bounding box and one prediction bounding box, and it should be noted that the detection bounding box and the prediction bounding box in each matching pair are matched with each other.

It should be noted that, each detection bounding box has one prediction bounding box matched with it, or has no prediction bounding box matched with it.

In this embodiment, a hungarian algorithm is adopted to solve the affinity matrix, so as to obtain each unmatched detection bounding box, each unmatched detection bounding box and a plurality of matched pairs.

In this embodiment, the matching problem of the detection bounding box and the previous state set is abstracted into the largest bipartite graph matching problem, that is, an affinity matrix is constructed, and the affinity matrix is solved by using the hungarian algorithm, so as to obtain an unmatched detection bounding box, an unmatched prediction bounding box and a plurality of matching pairs.

S304, a state set of the t frame is obtained based on each unmatched detection bounding box, each unmatched prediction bounding box, each matched pair and the state set of the t-1 frame.

In this embodiment, a state set of the t frame is obtained according to each unmatched detection bounding box, each unmatched prediction bounding box, each matched pair, and the previous state set.

In this embodiment, different processing methods are adopted for each unmatched detection bounding box, each unmatched prediction bounding box and each matched pair, so as to obtain a state set of the t frame.

Specifically, the process of obtaining the state set of the t frame based on the unmatched detection bounding boxes, the unmatched prediction bounding boxes, the unmatched matching pairs and the state set of the t-1 frame includes:

For each matching pair, determining the weight of each detection bounding box and each prediction bounding box in the matching pair, carrying out weighted average on the detection bounding box and the observation bounding box in the matching pair according to the weight of each detection bounding box and each prediction bounding box in the matching pair to obtain a first bounding box, adding one to a first counting result in initial object state data to obtain a new first counting result, and forming first object state data by the first bounding box, the new first counting result, and a movement speed, a track identifier and a second counting result in the initial object state data; the initial object state data are object state data corresponding to a prediction bounding box in a matching pair in a state set of a t-1 frame, the first counting result is used for representing the observed times of an object, and the second counting result is used for representing the continuous unobserved times of the object;

For a detection bounding box which is not matched, distributing a track identifier for the detection bounding box, initializing a motion speed and a second counting result corresponding to the detection bounding box, assigning a first counting result corresponding to the detection bounding box to be 1, and forming second object state data by the detection bounding box, the track identifier, the motion speed, the first counting result and the second counting result corresponding to the detection bounding box;

In this embodiment, since the autopilot scene is highly dynamic, both the observation point and the observed object move, the tracked object may disappear from the observation range, and a new object may enter the observation range to start being tracked. The present embodiment employs a state machine as shown in fig. 4 to manage the appearance and disappearance of objects within the observation range. In this embodiment, a threshold N1 is introduced to evaluate the stability of the object observed over a period of time: when the observed count beta of the object is less than N1, the object is called to be in an unstable state, and whether the object actually exists or a short-time false positive caused by a prediction error is difficult to judge; when β.gtoreq.N1, the object is in a stable state, it can be considered that it is stably present in the observation range. And further introducing a threshold N2 to evaluate whether the originally observed object has disappeared from the observation range: when an observed object is not observed in a certain frame, the counting result gamma of the object which is not observed is set to be 1, and then if the object is not observed continuously, the counting result gamma is increased by 1 each time the object is not observed; when γ.gtoreq.N2, the object is considered to disappear completely from the observation range. Wherein, N1 and N1 are both positive integers.

It should be noted that, the unmatched detection bounding box corresponds to a new object entering the observation range. An unmatched predicted bounding box corresponds to an original object not currently observed.

In this embodiment, for each matching pair, the weight of the detection bounding box in the matching pair is determined by the uncertainty of the detection bounding box, the weight of the prediction bounding box is determined by the uncertainty of the prediction bounding box, and the weight of the detection bounding box in the matching pair and the weight of the prediction bounding box are calculated by using a filter, and optionally, the filter may be a kalman filter, and specifically, the weights of the detection bounding box and the prediction bounding box in the matching pair are input into the kalman filter to obtain the weights of the detection bounding box and the prediction bounding box output by the kalman filter; and carrying out weighted average on the detected bounding box and the predicted bounding box according to the respective weights of the detected bounding box and the predicted bounding box, and obtaining a first bounding box corresponding to the matched pair. In this embodiment, the updating of the control state data corresponding to the matching pair corresponds to the directional edge of the "observed" transition condition in fig. 5, that is, the object is observed, the first count result in the initial object state data is added by one to obtain a new first count result, the track identifier included in the initial object state data and the second count result form the first control state data, the first surrounding frame and the movement speed in the initial object state data form the first movement state, the first control state data and the first movement state data form the first object state data, the first object state data is the object state data obtained after the initial object state data is updated, and the initial object state data is the object state data corresponding to the prediction surrounding frame in the matching pair in the state set of the t-1 frame.

In this embodiment, for each detection bounding box that is not matched, new object state data needs to be created in the state set of the t-th frame, specifically, a track identifier is allocated to the detection bounding box, and the motion velocity v and the second count result γ are initialized, where the motion velocity may be initialized to 0, the second count result initialization corresponds to the directed edge from the "start" state to the "unstable" state in fig. 4, that is, the second count result γ is initialized to 0, the first count result β is assigned to 1, the detection bounding box and the motion velocity form second motion state data, the track identifier, the first count result and the second count result form second control state data, and the second motion state data and the second control state data form second object state data corresponding to the detection bounding box, where the second object state data is the newly created object state data.

In this embodiment, for each predicted bounding box that is not matched, only the predicted bounding box can be fully trusted due to the lack of observations, and the update of the control state data corresponds to the directed edge for which the transition condition is "not observed" in fig. 4. Specifically, the second count result in the state set of the t-1 frame corresponding to the predicted bounding box is added by one to obtain a new second count result, whether the new second count result is smaller than a disappearance threshold value or not is judged, the disappearance threshold value corresponds to N2 in fig. 4, if the new second count result is not smaller than the disappearance threshold value, the object is considered to completely disappear from the observation range without calculating the corresponding new object state data, if the new second count result is smaller than the disappearance threshold value, the motion speed in the object state data corresponding to the predicted bounding box in the state set of the predicted bounding box and the t-1 frame forms third motion state data, the track identifier in the object state data corresponding to the predicted bounding box in the state set of the new second count result and the t-1 frame and the first count result form third object state data, and the third object state data is obtained after updating the object state data corresponding to the predicted bounding box in the state set of the t-1 frame.

In this embodiment, all the first object state data, all the second object state data, and all the third object state data are combined into the state set of the t-th frame.

S107, updating the state set of the t-1 th frame to the state set of the t-1 th frame.

In this embodiment, the state set is updated, and the state set of the t-1 st frame is updated to the state set of the t-1 st frame.

According to the end-to-end real-time three-dimensional multi-target tracking method for automatic driving, in the process of automatic driving, when sensor data of a t frame are received, state sets of the t frame are obtained in parallel, a tracking result of the t frame is predicted according to the state sets of the r frame, a plurality of detection bounding boxes of the t frame are obtained by detecting a plurality of objects contained in the sensor data of the t frame, and when the state sets are updated to the t-1 frame, the state sets of the t frame and the t-1 frame are associated to obtain the state sets of the t frame, and the state sets of the t-1 frame are updated to the state sets of the t frame. By applying the end-to-end real-time three-dimensional multi-target tracking method for automatic driving, when the sensor data of the t frame is received, the tracking result of the t frame is directly predicted based on the state set of the r frame, the tracking result of the t frame is predicted without updating the blocked waiting state set to the transition state set of the t frame, thereby realizing three-dimensional multi-target tracking, improving the efficiency of three-dimensional multi-target tracking, updating the state set based on the sensor data of the t frame, and ensuring the accuracy of three-dimensional multi-target tracking.

Referring to fig. 5, the specific implementation procedure of the above-mentioned end-to-end real-time three-dimensional multi-target tracking method for automatic driving is illustrated as follows:

state definition:

Given an object o, its state s consists of two parts, a motion state s ^m and a control state s ^c: s ^m includes a bounding box b= (h, w, l, x, y, z, θ) and a motion velocity v= (v _x,v_y,v_z), where (h, w, l) is the bounding box size, (x, y, z) is the bounding box bottom surface center point, θ is the bounding box yaw angle; s ^c includes a trajectory identifier α, a counter β where the object o is observed, and a counter γ where it is not observed. The states of all tracked objects in the t-th frame constitute a set S _t＝{s_i|i＝1,...,n_t, where n _t represents the number of tracked objects in the t-th frame.

Specifically, after receiving the sensor data I _t of the t-th frame, the fast prediction module and the slow update module are started and executed in parallel.

The slow detection module firstly calls the three-dimensional object detector to detect the I _t to obtain an object bounding box set D _t, and then the slow tracking module associates the D _t with the last state set S _t-1 and updates the state set S _t to obtain a t frame state set. The fast prediction module directly predicts the bounding box of the tracked object in the t frame according to the current state set S _r and takes bounding box and the corresponding track identifier alpha as final output, wherein r is less than or equal to t because the state update speed may be slower than the data arrival speed.

Wherein, for the fast prediction module:

Since the update rate of the state set by the slow tracking module is likely slower than the data arrival rate, the globally shared state set, i.e., the pool, may be updated only to the (r < t) th frame, when the fast prediction task of the (t) th frame is received, denoted as S _r.

At this time, unlike the conventional detection tracking paradigm which waits for the state set S _r to update to S _t and then outputs the tracking result of the t frame, the fast prediction module is not blocked, but applies a constant speed model to each S e S _r to predict the bounding box of the object in the t frame, and fast gives the tracking result of the t frame.

Specifically, assuming that the state S e S _r of an object and the arrival time interval of each frame data is fixed to Δt, according to the constant velocity model, the displacement of the object from the r-th frame to the t-th frame under the observation coordinate system is estimated as:

Δx＝v_x(t-r)Δt,Δy＝v_y(t-r)Δt,Δz＝v_z(t-r)Δt (1)

then, the bounding box b= (h, w, l, x+Δx, y+Δy, z+Δz, θ) of the t-th frame can be obtained.

For the slow detection module:

The object detection step calls a three-dimensional object detector to obtain an object bounding box set in the input data. The specific implementation of the detection step is not limited and can be integrated into the ThunderMOT system as long as the definition of its input and output interfaces is satisfied. This enables ThunderMOT to flexibly access different detectors according to the scene, and realizes hot plug of the detection model.

For the slow tracking module:

the slow tracking module includes a data association step and a status update step.

The data association step specifically comprises the following steps:

The bounding box set D _t detected by the slow detection module is matched with the state set S _t-1. It should be noted that this step needs to wait for the slow trace module to update to the state set S _t-1 in a blocking manner before proceeding.

Specifically, the matching problem between the bounding box set D _t of the t frame and the state set S _t-1 of the t-1 frame is abstracted to be the maximum bipartite graph matching problem, and the solution is carried out by adopting a Hungary algorithm, wherein the affinity matrixAs shown in the formula (2), m _t is the size of the bounding box set D _t, and n _t-1 is the size of the state set S _t-1. P _t is the set of t-th frame bounding boxes predicted according to equation (1), and the function iou3d () represents the ratio of the intersection of two bounding boxes in three-dimensional space.

A_ij＝iou3d(b_i ^d,b_j ^p)whereb_i ^d∈D_t,b_j ^p∈P_t (2)

The output of the data association step is three sets: matching set M _t, unmatched bounding box set D _t ', unmatched state set S _t-1'.

A state updating step:

Each matching tuple (b_t ^d,b_t ^p,s_t-1)∈M_t,s_t-1、b_t ^d、b_t ^p represents, for each matching tuple, the state of an object o at the t-1 frame and the observed bounding box and the predicted bounding box of the object at the t frame, respectively. And (3) taking the observation bounding box b _t ^d as an observation result of the object corresponding to the state s _t-1 in the t frame, and calling a Kalman filter to perform state estimation to obtain a motion state s _t ^m of the object in the t frame. According to the bayesian rule, the updated motion state s _t ^m is a weighted average of the state spaces in which b _t ^d and b _t ^p are located, and the weights (i.e., the kalman gains) are determined by the uncertainties of b _t ^d and b _t ^p. Control state slave To/>Corresponding to the directed edge of fig. 5 for which the transition condition is "observed".

For each unmatched detection bounding boxCreating a new object state S _t,/>, in state set S _t of the t-th frameB in (c) is initialized to/>Is initialized to 0. The control state initialization corresponds to the directed edge from the "start" state to the "unstable" state in fig. 5.

For each unmatched tracked object stateDue to the lack of observations, only bounding boxes predicted according to equation (1)/>, can be fully trustedDirectly will/>B in (2) is set to/>Thereby obtaining the motion state/>Control state slave/>To/>Corresponding to the directed edge of fig. 5 for which the transition condition is "not observed".

In this embodiment, the fast tracking module and the fast prediction module do not have any explicit synchronization operation except for avoiding read-write collision to the state set, so the execution time of the fast prediction module is the response delay of each frame of data. The fast prediction module is realized based on motion model prediction, and compared with the traditional detection tracking method, the motion prediction calculation cost of each object is very small.

In this embodiment, in order to facilitate multiple deep learning-based 3D object detection model access systems that rely on different software environments (e.g., different Python versions, different deep learning frameworks, different versions of the same framework), slow object detection is implemented as a local server, and HTTP is selected as an application layer communication protocol. When each frame of data arrives, the fast prediction task and the slow tracking task are submitted to a thread pool for parallel execution. The slow tracking module is used as a client to send a request to a detection server, and then a associate () method and an update () method are sequentially called after the frame detection result is obtained to realize association and update of the object state.

In this embodiment, the object state is shared by two tasks, namely fast prediction and slow tracking, the former calls the prediction method predict () of the object state and the latter calls the update method () of the motion state. ThunderMOT by introducing Read-Copy-Update locks (RCUs) at the object level to ensure consistency of motion state under concurrent behavior, while ensuring that fast prediction tasks do not timeout due to occlusion of write operations to object state by slow tracking tasks. Under the mechanism, the fast prediction task serves as a reader, and the motion state of an object can be accessed without acquiring any lock; while the slow update task acts as a writer, copying the copy first before accessing the motion state, then modifying the copy, and finally modifying the pointer to the history state at the appropriate time to point to the updated state. The prediction and the update of different objects are packaged into independent tasks to be submitted to a thread pool for execution, and the prediction and the update of different objects do not affect each other.

In this embodiment, thunderMOT, the detection server, the tracking server and the data sensor share the file system. The transfer of raw sensor data between components is accomplished by transferring the storage path of the sensor data in the file system, thereby avoiding the significant communication overhead of explicitly transferring the sensor data via byte streams.

In order to better illustrate the effect of the end-to-end real-time three-dimensional multi-target tracking method for automatic driving, the inventor evaluates ThunderMOT systems from two aspects of tracking speed and tracking precision through experiments.

The experimental environment of the application is a server which is provided with 2 Intel Xeon CPUs E5-2690 v3 (each CPU contains 12 physical cores and opens a hyper-thread), 4 Geforce RTX 2080Ti GPUs (each 4352 core and 12GB video memory) and 256GB memory. Server software configuration: the operating system was Ubuntu 18.04, python version 3.7.7, cuda version 10.2. Tracking accuracy assessment employs KITTI multi-target tracking datasets.

The evaluation result shows that on KITTI multi-target tracking data set, the average delay is 2.0 milliseconds, the worst delay is 8.6 milliseconds, the multi-target tracking accuracy MOTA (multiple object tracking accuracy) can reach 83.71 percent, and the tracking speed and the tracking accuracy are very good.

Corresponding to the method shown in fig. 1, the embodiment of the application further provides an end-to-end real-time three-dimensional multi-target tracking device for automatic driving, which is used for implementing the method in fig. 1, and a schematic structure diagram of the device is shown in fig. 6, and specifically includes:

An execution unit 601 for executing the prediction step and the status update step in parallel when the sensor data of the t-th frame is received in the course of automatic driving; wherein t is a positive integer;

The predicting step includes:

the state updating step includes:

According to the end-to-end real-time three-dimensional multi-target tracking device for automatic driving, when the sensor data of the t frame is received, the tracking result of the t frame is directly predicted based on the state set of the r frame, the tracking result of the t frame is predicted without updating the blocking waiting state set to the transition state set of the t frame, three-dimensional multi-target tracking is achieved, the three-dimensional multi-target tracking efficiency is improved, and the state set is updated based on the sensor data of the t frame, so that the accuracy of three-dimensional multi-target tracking is ensured.

In one embodiment of the present application, based on the foregoing solution, the execution unit 601 is configured to predict a tracking result of the nth frame according to the state set of the nth frame, and the execution unit 601 is specifically configured to:

and forming each sub-tracking result into a tracking result of the t frame.

In one embodiment of the present application, based on the foregoing solution, the execution unit 601 is configured to calculate, based on object state data of each object included in the state set of the nth frame, a bounding box of each object in the nth frame, where the execution unit 601 is specifically configured to:

In one embodiment of the present application, based on the foregoing scheme, the executing unit 601 is configured to associate each detection bounding box of the t frame with the state set of the t-1 st frame to obtain the state set of the t frame, and the executing unit 601 is specifically configured to:

In one embodiment of the present application, based on the foregoing scheme, the execution unit 601 is configured to obtain the state set of the t-th frame based on each unmatched detection bounding box, each unmatched prediction bounding box, each matched pair, and the state set of the t-1 th frame, where the execution unit 601 is specifically configured to:

In one embodiment of the present application, based on the foregoing scheme, the execution unit 601 is configured to calculate, based on object state data of each object included in the state set of the t-1 th frame, a predicted bounding box of each object in the t-1 th frame, where the execution unit specifically includes:

In one embodiment of the present application, based on the foregoing solution, the execution unit is configured to detect a plurality of objects included in the sensor data of the t frame, and obtain a plurality of detection bounding boxes of the t frame, where the execution unit is specifically configured to:

The embodiment of the application also provides a storage medium, which comprises stored instructions, wherein the equipment where the storage medium is located is controlled to execute the automatic driving-oriented end-to-end real-time three-dimensional multi-target tracking method when the instructions run.

The embodiment of the present application further provides an electronic device, whose structural schematic diagram is shown in fig. 7, specifically including a memory 701, and one or more instructions 702, where the one or more instructions 702 are stored in the memory 701, and configured to be executed by the one or more processors 703, where the one or more instructions 702 perform the following operations:

In the automatic driving process, when sensor data of a t frame is received, a prediction step and a state updating step are executed in parallel; wherein t is a positive integer,

The predicting step includes:

the state updating step includes:

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.

From the above description of embodiments, it will be apparent to those skilled in the art that the present application may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.

The application provides an end-to-end real-time three-dimensional multi-target tracking method and device for automatic driving, which are described in detail above, wherein specific examples are applied to illustrate the principle and implementation of the application, and the description of the above examples is only used for helping to understand the method and core idea of the application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. An end-to-end real-time three-dimensional multi-target tracking method for automatic driving is characterized by comprising the following steps:

The predicting step includes:

the state updating step includes:

Updating the state set of the t-1 th frame to the state set of the t-1 th frame;

Wherein predicting the tracking result of the t frame according to the state set of the r frame includes:

Calculating bounding boxes of each object in the t frame according to bounding boxes of object state data of each object included in the state set of the r frame and first displacement of each object in each dimension;

each sub-tracking result is formed into a tracking result of the t frame;

Associating each detection bounding box of the t frame with the state set of the t-1 frame to obtain the state set of the t frame, including:

For each unmatched prediction bounding box, adding one to a second counting result in the object state corresponding to the prediction bounding box in the state set of the t-1 th frame to obtain a new second counting result, and if the new second counting result is smaller than a disappearance threshold value, forming third object state data by the prediction bounding box, the new second counting result, a movement speed, a track identifier and a first counting result in the object state data corresponding to the prediction bounding box in the state set of the t-1 th frame;

2. The method of claim 1, wherein the calculating a predicted bounding box for each object at the t-th frame based on the object state data for each object included in the state set of the t-1 th frame comprises:

3. The method according to claim 1, wherein detecting a plurality of objects contained in the sensor data of the t-th frame to obtain a plurality of detection bounding boxes of the t-th frame includes:

and calling a three-dimensional target detector to detect a plurality of objects included in the sensor data of the t frame to obtain a plurality of detection bounding boxes of the t frame.

4. An end-to-end real-time three-dimensional multi-target tracking device for automatic driving, comprising:

The predicting step includes:

the state updating step includes:

each sub-tracking result is formed into a tracking result of the t frame;