CN116363171A

CN116363171A - Three-dimensional multi-target tracking method integrating point cloud and image information

Info

Publication number: CN116363171A
Application number: CN202310165319.9A
Authority: CN
Inventors: 才华; 郑延阳; 付强; 马智勇; 王伟刚; 李英超
Original assignee: Changchun University of Science and Technology
Current assignee: Changchun University of Science and Technology
Priority date: 2023-02-27
Filing date: 2023-02-27
Publication date: 2023-06-30

Abstract

The invention relates to a three-dimensional multi-target tracking method integrating point cloud and image information, which belongs to the field of computer vision technology and image processing, and aims to solve the problems that a target is lost and frequent identity switching caused by mutual shielding between a tracking target and a target and between the tracking target and a scene object due to motion is solved by continuously using the characteristics of a target sub-region and a space background of a target extraction network of a two-dimensional detection frame of the same target matched with a low-score three-dimensional detection frame for the three-dimensional detection frame with the low-score three-dimensional detection frame.

Description

Three-dimensional multi-target tracking method integrating point cloud and image information

Technical Field

The invention relates to the field of computer vision technology and image processing, in particular to a multi-target tracking method based on fusion of laser radar point cloud data, image target subregions and space background information thereof.

Background

Detection-based Tracking (Tracking-by-detection) is one of the three most common frameworks in the field of multi-target Tracking (MOT), which directly acquire detection frames from detectors, construct tracks across frames and assign IDs to the same targets.

The most important problem faced by multi-target tracking at present is the problem of target loss or identity conversion caused by object shielding in a scene. The detection-based tracking framework tends to rely relatively well on detectors, that is, poor performing target detectors can directly contaminate the performance of the tracker. Since a single three-dimensional detector is mostly used in a tracking frame based on detection in a three-dimensional visual tracking task, the point cloud data used by the tracking frame carries unique depth information, but does not have semantic information such as texture unique to an image.

At present, most trackers perform cross-frame matching by using a Hungary algorithm, but when an object is blocked, the tracker will automatically discard a part of detection frames with low scores according to a manually set threshold value due to the fact that the detection frames with low scores are often accompanied by the reduction of the scores of the detection frames, and obviously, the fact that the detection frames with low scores are incorrect does not mean that targets in a scene are lost.

Disclosure of Invention

In view of the above, the present invention aims to solve the problem of target loss caused by mutual occlusion between a tracking target and a target and between the tracking target and a scene object due to motion when multi-target tracking is performed in a three-dimensional scene, and provides a multi-target tracking method based on fusion of laser radar point cloud data and image target sub-regions and spatial background information thereof.

The technical scheme adopted by the invention for achieving the purpose is as follows: a three-dimensional multi-target tracking method for fusing point cloud and image information is characterized by comprising the following steps:

step 1, acquiring three-dimensional point cloud data and two-dimensional image data of a time-synchronized target space;

step 2, extracting point cloud data features by using a three-dimensional target detector to obtain a three-dimensional detection frame of the target, and extracting a two-dimensional detection frame of the target on the obtained two-dimensional image by using a two-dimensional target detector;

step 3, associating the two-dimensional detection frame and the three-dimensional detection frame from the same target, so that the two-dimensional detection frame and the three-dimensional detection frame from the same target in the same frame can be determined by searching the value of the association number p;

step 4, traversing all two-dimensional detection frames detected by a two-dimensional target detector, determining the frame of each two-dimensional detection frame in a two-dimensional image, acquiring a detection frame containing a new background in the frame corresponding to each two-dimensional detection frame, wherein the detection frame containing the new background coincides with the center of the two-dimensional detection frame, the detection frame containing the new background and the two-dimensional detection frame corresponding to the detection frame meet the condition that (W, H) =alpha (W, H), alpha represents the magnification factor, W and H are the width and the height of the detection frame containing the new background, and W and H are the width and the height of the two-dimensional detection frame; inputting the acquired detection frame containing the new background into a trained subarea and background information network, identifying the subarea in the target, extracting the characteristics, carrying out characteristic coding on the identified subarea, and finally outputting a subarea characteristic vector R _j ，R _j The feature vector of the sub-region which is expressed as the j-th detection, the spatial background information is identified and the feature is extracted, the identified spatial background information is feature-coded, and the background feature vector S is output _j ，S _j A background feature vector denoted as j-th detected; the subarea and the background information network are of a vertically parallel symmetrical network structure, and the uplink and downlink network structures are respectively provided with a first convolution layer, a second convolution layer and a maximum pooling layer in sequenceThe system comprises a first residual error module, a second residual error module, a third residual error module, a fourth residual error module, an average pooling layer and a full connection layer;

step 5, the three-dimensional detection frame detected by the three-dimensional target detector and the characteristics output by the subarea and the background information network in the step 4 are sent into a target tracking network for data association, specifically:

sequencing all three-dimensional detection frames output by a three-dimensional target detector according to the confidence score, and setting a score threshold to divide the three-dimensional detection frames into a high-score detection frame and a low-score detection frame;

the high-score detection frame adopts 3D GIOU as an association index to directly carry out coordinate-based data association with the track;

for the low-resolution detection frame, firstly judging whether a two-dimensional detection frame corresponding to the low-resolution detection frame from the same target exists, if the two-dimensional detection frame does not exist, directly discarding the low-resolution detection frame, and if the two-dimensional detection frame exists, continuously inquiring feature codes output by the low-resolution detection frame in a subarea and a background information network through a correlation number p to perform data correlation based on appearance features, wherein the method comprises the following specific steps of:

inquiring the sub-region feature codes output by the sub-region and background information network in the step 4 through the association number p, and carrying out association based on appearance feature data by using feature vectors corresponding to the feature codes of the sub-region; in particular, the association mode adopts a cosine distance D ¹ (i, j) as an index for measuring the degree of similarity of feature vectors between two frames, wherein the cosine distance D ¹ (i, j) is specifically:

wherein the method comprises the steps of

Feature vector expressed as detection target i +.>

The feature vector of the target j is tracked;

adopting a Hungary algorithm to perform matching, and if only one sub-region block feature similarity is close in the target, considering that the target is successfully matched and still performing state update on the target by using Kalman filtering;

if no sub-region matching is successful, searching the background information again, and continuously using the cosine distance as the feature similarity index at the cosine distance D by utilizing the background feature code output by the downlink network in the step 4 ² (i, j) is specifically

Wherein the method comprises the steps of

Feature vector denoted as detection context i +.>

Tracking the feature vector of the background j;

carrying out background information matching by adopting a Hungary algorithm, if the background characteristics can be matched, considering that the target is not moved or is only blocked, and continuously carrying out state updating by using Kalman filtering;

step 6, using 3D Kalman filter to update state

For the three-dimensional detection frame successfully matched with the data correlation based on the coordinates, directly sending the state of the three-dimensional detection frame into a Kalman filter, and obtaining the target state of the three-dimensional detection frame

Wherein (x, y, z) is the coordinate of the target in the three-dimensional space, θ is the rotation angle of the target in the xy plane, l, w and h represent the length, width and height of the three-dimensional detection frame, v _x ,v _y ,v _z Representing the speed of the target in three directions of x, y and z;

for a two-dimensional instance successfully matched by data association based on appearance characteristics, inquiring corresponding three-dimensional instance state information according to the association number p, and using the corresponding three-dimensional instance state information to send the three-dimensional instance state information into a KalmanThe filter performs track updating so that the target state thereof is the same as the target state associated with the coordinate-based data

Updating the three-dimensional Kalman filtering state by using the detection result of the current moment of the three-dimensional target detector as an observation value, updating the prediction state of the corresponding track through matching detection to obtain a final t-frame matching track, wherein the final 3D Kalman updated track is (x ', y ', z ', theta ', l ', w ', h ', v) _x ',v _y ',v _z ')；

Step 7: tracking lifecycle management

Setting that if the target does not have any update in the continuous E frames or the background information does not match beyond the set maximum number of frames, the target is considered to leave the scene; for potential new objects that newly enter the scene, if three-dimensional instances are detected consecutively for matching for N frames, it will be considered as the target of the newly added scene and a track will be allocated for it.

Further, the two-dimensional image is acquired by a camera, and the three-dimensional point cloud data is acquired by a laser radar.

Further, the step 2 specifically includes:

the three-dimensional detection frame is obtained by utilizing a three-dimensional object detector PV-RCNN, and the output characteristics of the three-dimensional detection frame are (x, y, z, theta, l, w, h and p), wherein (x, y, z) is the coordinate of an object in a three-dimensional space, theta is the rotation angle of the object in an xy plane, l, w and h represent the length, width and height of the three-dimensional detection frame, and p is used for representing the association number;

the two-dimensional target detector YOLOv7 is utilized to acquire a two-dimensional detection frame, the output characteristic of the two-dimensional detection frame is (x, y, w, h, p), wherein (x, y) is the coordinate of a target center point, w and h are the width and the height of the two-dimensional detection frame respectively, and p is expressed as an association number.

Further, in step 3, the process of associating the two-dimensional detection frame and the three-dimensional detection frame from the same target is:

firstly, performing front projection on a three-dimensional detection frame of a target, and establishing and storing a parameter of the three-dimensional detection frame after front projectionNumber set

Simultaneously, based on a two-dimensional detection frame of the target detected by the two-dimensional target detector, establishing a parameter set of the two-dimensional detection frame>

And then, calculating IOU values of the two-dimensional detection frames subjected to three-dimensional forward projection and the two-dimensional detection frames output by the two-dimensional target detector in a two-dimensional image domain, matching the two-dimensional detection frames corresponding to the same target by using a greedy algorithm, and assigning the same association number p to the two-dimensional detection frames corresponding to the same target.

In step 4, the process of extracting the target sub-region features in the detection frame including the detection frame of the new background is as follows:

firstly, assigning RGB values of a background part except a target of the detection frame to be 0 to delete the background part so as to extract the subarea characteristics of the target by the reserved original two-dimensional detection frame only containing the target; and detecting and identifying the subarea of the target through the pre-trained subarea and the background information network, and performing feature coding on the subarea.

In step 4, the feature extraction process of the spatial background information for the detection frame containing the new background is as follows: and firstly, giving all RGB values of the detection frame containing the target part to 0 for target deletion, then continuously sending the detection frame containing the new background into a downlink network of a subarea and a background information network for background information detection and identification, and carrying out feature coding on the identified background information part.

In step 5, the data association process of the high-score detection frame using the 3D GIOU as the association index and directly performing the coordinate-based data association with the track includes:

setting GIOU3d as a correlation index to establish a data correlation matrix, wherein the GIOU3d expression is:

in B of ₁ ,B ₂ Representing a three-dimensional detection frame, I represents B ₁ And B ₂ U represents B ₁ And B ₂ C is the closed total outsourcing of U, V represents the volume of the polyhedron, where V _I Representing the volume of the intersection of two three-dimensional inspection frames, V _U Representing the volume of the union part of two three-dimensional detection frames, V _C Representing the volume of the enclosed fully encased portion of the two three-dimensional inspection frames;

thereby obtaining an association matrix A between two frames as follows:

wherein GIOU3d _m,n GIOU3d values representing the mth three-dimensional detection frame of the t-th frame and the nth three-dimensional detection frame of the t+1th frame;

and matching by adopting a greedy algorithm, setting a minimum GIOU3d threshold value, and considering that the target has no matched object if the minimum GIOU3d threshold value is smaller than the minimum GIOU3d threshold value.

Compared with the prior invention, the invention has the following advantages:

1. the invention fully utilizes the abundant texture characteristics of the image data, and utilizes the image data to make up the defect that a single three-dimensional object detector based on the point cloud only carries depth information, so as to achieve better tracking effect in the later period.

2. Further target sub-region feature extraction is carried out through a two-dimensional detection frame which is output by a two-dimensional target detector and contains targets, and data association based on appearance features is carried out on a low-resolution detection frame which is output by a three-dimensional target detector, so that the problem of target loss caused by partial shielding and other reasons is solved, and tracking accuracy is improved.

3. By the aid of two-dimensional image data, the invariance of spatial background information is utilized, background information of adjacent frames is searched for a target in the center of a scene, and therefore the track reservation of a shielding target is achieved.

Drawings

FIG. 1 is a diagram of a three-dimensional object detector framework;

FIG. 2 is a diagram of a subregion and background information network structure;

FIG. 3 is a block diagram of residual modules in a subregion and a background information network;

fig. 4 is a flow chart of multi-objective tracking for fusing point cloud and image information.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the present invention is not limited by the following examples, and specific embodiments can be determined according to the technical scheme and practical situation of the present invention. Well-known methods, procedures, and flows have not been described in detail so as not to obscure the nature of the invention.

As shown in fig. 1, fig. 2, fig. 3 and fig. 4, the present invention proposes a three-dimensional multi-target tracking method for fusing point cloud and image information, in particular, a multi-target tracking method based on fusion of laser radar point cloud data and image target subregion and space background information thereof, which mainly uses the advantage of multi-mode data complementation, uses a three-dimensional detection frame output by a three-dimensional target detector based on point cloud data as the main input of a tracking network, uses a two-dimensional detection frame based on a two-dimensional target detector of an image as the auxiliary, associates the same target detected by two different sensors of laser radar and camera, and simultaneously inputs the two-dimensional detection frame output by the two-dimensional target detector into the subregion and background information network of the present invention to further extract the characteristics of the target subregion and space background information. The invention sorts the three-dimensional detection frames output by the three-dimensional object detector according to the output confidence score, directly carries out data association based on coordinates for the high-resolution three-dimensional detection frames, and carries out data association based on appearance characteristics by continuously using the characteristic information of the two-dimensional detection frames corresponding to the low-resolution three-dimensional detection frames after being processed by the subareas and the background information network, thereby realizing the reduction of the problems of object loss and frequent identity switching caused by mutual shielding between tracking objects and between the tracking objects and scene objects due to movement, and the specific implementation steps are as shown in figure 4:

the three-dimensional point cloud data and the two-dimensional image data are respectively from two sensors of a laser radar and a camera of the same vehicle in the same time period;

specifically, the three-dimensional object detector PV-RCNN is used to obtain a three-dimensional detection frame, and the structure of the three-dimensional object detector PV-RCNN is shown in fig. 1, which is a prior art and will not be described in detail herein. Finally, parameters of the three-dimensional detection frame are obtained, wherein the parameters are (x, y, z, theta, l, w, h and p), the coordinates of the target in the three-dimensional space are (x, y, z), the theta is the rotation angle of the target in the xy plane, l, w and h represent the length, width and height of the three-dimensional detection frame, and p is used for representing the association number. In addition, the output of the three-dimensional object detector also contains confidence and categories common in three-dimensional detection;

similarly, step 2 further includes using a two-dimensional object detector YOLOv7 to obtain a two-dimensional detection frame, where the two-dimensional detection frame output is characterized by (x, y, w, h, p), where (x, y) is the coordinates of the object center point, w and h are the width and height of the two-dimensional detection frame, and p is the association number. p is mainly used to make the same data correlation for both object detectors. In addition, the output of the two-dimensional object detector also contains confidence and classification common in two-dimensional detection;

step 3: associating two-dimensional and three-dimensional detection frames from the same target

Firstly, performing front projection on a three-dimensional detection frame of a target acquired based on a laser radar, and establishing and storing a parameter set after the front projection of the three-dimensional detection frame

Meanwhile, based on a two-dimensional detection frame of a target acquired by a camera, establishing a parameter set of the two-dimensional detection frame>

And then, calculating IOU values of the two-dimensional detection frame subjected to three-dimensional forward projection and the two-dimensional detection frame output by the two-dimensional target detector in a two-dimensional image domain.

In step 3, firstly, a front projection is made on a three-dimensional detection frame of a target detected based on three-dimensional point cloud data acquired by a laser radar, and a parameter set is established after the front projection is made on the three-dimensional detection frame

Simultaneously, based on a two-dimensional detection frame of a target detected by two-dimensional image data acquired by a camera, establishing a parameter set of the two-dimensional detection frame>

And then, calculating IOU of the two-dimensional detection frames subjected to three-dimensional forward projection and the two-dimensional detection frames output by the two-dimensional object detector in a two-dimensional image domain, matching the two-dimensional detection frames corresponding to the same object by using a greedy algorithm, and assigning the same association number p to the detection frames corresponding to the same object, wherein the initial value of the association number p is set to be 1, and adding one to the value of the data association number p whenever a group of detection frames identified as the same object is carried out, so that each group of objects has the same association number p, and the two-dimensional detection frames and the three-dimensional detection frames from the same object can be conveniently determined by searching the value of the association number p in the later period. After the three-dimensional detection frames and the two-dimensional detection frames are associated, the three-dimensional detection frames and the two-dimensional detection frames are obviously uncorrelated, and because the three-dimensional detection frames and the two-dimensional detection frames are mainly aimed at the three-dimensional multi-target tracking field, the rest three-dimensional detection frames which are not matched are reserved, and the two-dimensional detection frames output by the two-dimensional detector are not processed later.

Step 4: and sending the two-dimensional detection frame detected by the two-dimensional target detector into a subarea and a background information network for further feature processing.

In a specific step 4, the invention designs a sub-area and a background information network which are parallel up and down, and the network structure is shown in fig. 2 and 3. The network is of an uplink-downlink symmetrical structure, taking an uplink part as an example, the network firstly passes through two 3 x 3Conv layer convolution layers, and is connected with 4 residual blocks (residual modules) after Maxpool (maximum pooling layer), wherein BN (BatchNormalization) layers are batch normalization, a Relu layer is an activation function, avgpool is an average pooling layer, and FC is represented as a full connection layer. The main difference between the uplink and the downlink of the network is that the objects for extracting the characteristics are different. The network is a pre-trained network. When the target is blocked, the score of all detectors is reduced, and the invention focuses on mining the part of the target possibly exposed in a scene, so that the network mainly performs recognition training on objects possibly appearing in a tracking scene in daily life. The uplink is mainly aimed at handbags, caps, bicycles, etc. for pedestrians, and of course, the head which is most easily exposed in shielding is included. The downlink network mainly aims at the category of information which may occur in some road surfaces, such as street lamps, trees and the like. The sub-region and background information network will identify these examples and feature code the information of the identified examples for use by the need for subsequent tracking. The input of the parallel network is the intercepted image which is obtained by the range expansion processing of the two-dimensional detection frame output by the two-dimensional detector in the step 2.

Specifically, when a three-dimensional detection frame output by a three-dimensional object detector is divided into high and low confidence scores, the three-dimensional detection frame which is matched and is considered to be a two-dimensional detection frame from the same object is generated, a new detection frame containing background information is generated according to the two-dimensional detection frame output by the two-dimensional object detector, and the generation process is that the original width and the height of the two-dimensional detection frame output by a p-th object in the two-dimensional detector are assumed to be w respectively _p And h _p This is denoted as m _p (w _p ,h _p ) Where p is the association number, the two-dimensional detection frame (m ₁ ,m ₂ ,…,m _p ) The parameters are respectively expanded by alpha times to obtain the width and the height (M ₁ ,M ₂ …,M _p ) This isThe method only uses the parameters of the original two-dimensional detection frame to generate new detection frame parameters containing background information, and does not enlarge the image in the original two-dimensional detection frame. Generating new detection frames containing background information with width and height parameters as follows;

(M ₁ ,M ₂ …,M _p )＝α(m ₁ ,m ₂ …,m _p )

wherein alpha represents magnification, M is width and height of detection frame containing new background, specifically M _p (W _p ,H _p )，W _p ,H _p The width and the height of the generated p-th detection frame containing the new background are respectively, and p is the association number. The central position parameter of the detection frame containing the new background is the same as the two-dimensional detection frame which only contains the target, and after finishing, the detection frame parameter containing the new background is finally (x, y, W, H, p). Establishing a subordinate relation between the background and the target according to the association number p for the expanded detection frame containing the new background, and simultaneously carrying out image interception by using the expanded detection frame to serve as the input of a subarea and a background information network;

and further, carrying out feature coding on the sub-region features of the target in the detection frame containing the new background after expansion by the uplink implementation of the sub-region and the background information network. Firstly, aiming at the images input into the region and the background information network, the RGB values of the parts except the original two-dimensional detection frame are assigned to 0, so that the background part is deleted. Extracting features of the target by using the reserved original two-dimensional detection frame only containing the target, and identifying details of the sub-region in the target and extracting features by using the pre-trained sub-region and a background information network.

The method is characterized in that the subarea and the background information network are used for detecting and identifying the subarea details of the target by taking pedestrians as examples, and more particularly, the subarea of partial targets, such as the heads of pedestrians, shoes, portable bags, riding electric vehicles and the like, of which the pedestrians are easy to expose when shielding occurs are identified. The target subareas are identified and then are subjected to feature coding, and finally subarea feature vectors R are output _j ，R _j Represented as the j-th detected sub-region feature vector.

And (3) further realizing the sub-region and the background information network mentioned in the step (4), wherein the downlink part realizes the functions of background information identification, and simultaneously extracting background characteristics aiming at the detection frame containing the new background after expansion.

More specifically, the background near the target in the detection frame containing the new background after expansion is sampled through the subarea and the background information network, the target such as a tree, a street lamp and the like possibly contained in the background part is identified, and the background characteristic information is extracted. The specific sampling process is as follows: for the input expanded detection frame containing a new background, firstly, all RGB values of the detection frame containing a target part are endowed with 0 to delete the target, then the expanded detection frame containing the new background is continuously sent into a downlink network of a subarea and a background information network to perform feature extraction, and the identified background part is subjected to feature coding and a background feature vector S is also output _j ，S _j Represented as the j-th detected background feature vector.

Step 5: and sending the three-dimensional detection frame detected by the three-dimensional target detector, the subareas and the characteristics output by the background information network into a target tracking network for data association.

The invention comprises two independent parallel data association modules. The specific step 5 includes the steps of reserving detection frames of confidence scores output by all three-dimensional target detectors, setting up a score threshold, manually giving values to the threshold according to different scenes, and dividing the three-dimensional detection frames output by the three-dimensional detectors into a high-score detection frame and a low-score detection frame according to the score threshold. The high score detection box will be directly associated with the trajectory with coordinate-based data.

Specifically, the high-score detection frame track is associated, GIOU3d is set to be used as an associated index to establish a data associated matrix, wherein the expression of the GIOU3d is as follows:

in B of ₁ ,B ₂ Representing a three-dimensional detection frame, I represents B ₁ And B ₂ Intersection of U tableShow B ₁ And B ₂ C is the closed total outsourcing of U, V represents the volume of the polyhedron, where V _I Representing the volume of the intersection of two three-dimensional inspection frames, V _U Representing the volume of the union part of two three-dimensional detection frames, V _C Representing the volume of the enclosed fully encased portion of the two three-dimensional inspection frames;

thereby obtaining an association matrix A between two frames as follows:

matching by adopting a greedy algorithm, setting a minimum GIOU3d threshold (the threshold is manually given a value according to different scenes), and considering that a target does not have a matched object if the threshold is smaller than the value;

further, for the detection frame divided into the low-score, it is not directly discarded, for the low-score detection frame, it is first determined whether there is a two-dimensional detection frame corresponding to the low-score detection frame from the same target, if there is no two-dimensional detection frame, the low-score detection frame is directly discarded, if there is a two-dimensional detection frame, the data association based on the appearance feature is performed by continuously querying the feature codes output in the sub-region and the background information network through the association number p, which is specifically expressed as follows:

and establishing a data association module aiming at the low-score three-dimensional detection frame, and inquiring feature codes output by the sub-region and the background information network in the step 4 through the association number p. The sub-area and the uplink network of the background information network in the step 4 have completed detection and identification of the sub-area of the target and have performed feature coding on the identified sub-area, for the invention, the feature vectors after feature coding of the sub-area are used for data association, which has better robustness to problems such as shielding, dense target and the like, the data association adopts cosine distance D (i, j) as an index for measuring the similarity of the feature vectors between two frames, wherein the cosine distance D ¹ (i, j) is specifically:

wherein the method comprises the steps of

Feature vector expressed as detection target i +.>

To track the feature vector of object j.

According to the invention, a Hungary algorithm is adopted for matching, and if only one sub-region block feature similarity is close in the target, the target is considered to be successfully matched, and the Kalman filtering is still used for updating the state of the target.

If no sub-region matching is successful, searching the background information, and the space background information of the static target is unchanged when the static target is blocked by other moving targets, so that the background information features are utilized to continuously mine potential blocked static targets. And (3) continuously using the cosine distance as a feature similarity index by utilizing the background feature code output by the downlink network in the step (4), wherein the cosine distance D is ² (i, j) is specifically

Wherein the method comprises the steps of

Feature vector denoted as detection context i +.>

Tracking the feature vector of the background j;

still adopt hungarian algorithm to match, if the background characteristic can match, consider the goal does not take place to move but is merely blocked, still keep the position of the goal on the last frame. And deleting the target position information when the matching times of the background information exceeds a set threshold value or the feature matching of the background information is not carried out after the matching is successful.

Step 6: status update using 3D kalman filter

The invention uses successful instance state information in two independent parallel data association modules to update the track, which is characterized in that a three-dimensional detection frame which is successful in associating coordinate-based data is directly sent into a Kalman filter, and the target state comprises

Wherein (x, y, z) is the coordinate of the target in the three-dimensional space, θ is the rotation angle of the target in the xy plane, l, w, h represents the length, width and height of the three-dimensional detection frame, v _x ,v _y ,v _z Representing the velocity of the object in the x, y, z directions.

For the two-dimensional instance successfully matched through the image target sub-region and the background information, inquiring the corresponding three-dimensional instance state information according to the association number p, and using the corresponding three-dimensional instance state information to send the three-dimensional instance state information into a Kalman filter for track updating, so that the target state is the same as the first mode data association target state

The invention regards the target as a uniform motion model, and the target is predicted as (x) in the current frame _now ,y _now ,z _now ,θ,l,w,h,v _x ,v _y ,v _z ) This state will be used to input the data association module for matching.

The three-dimensional kalman filter state needs to be updated with the current detection result as an observation value. And updating the prediction state of the corresponding track through matching detection to obtain the matching track of the final t frame. The final updated trajectories of the 3D Kalman are (x ', y ', z ', θ ', l ', w ', h ', v) _x ',v _y ',v _z ')。

Step 7: tracking lifecycle management

This section is mainly directed to the birth and death problems of tracked objects leaving the scene and new objects joining the scene. The present invention uses a simple set of rules for lifecycle management, i.e. if the object does not have any updates in consecutive E frames or the background information does not match beyond a set maximum number of frames, the present invention will see that the object leaves the scene, which it will be discarded. For potential new objects that newly enter the scene, if three-dimensional instances are detected consecutively for matching for N frames, it will be considered as the target of the newly added scene and a track will be allocated for it.

Claims

1. A three-dimensional multi-target tracking method for fusing point cloud and image information is characterized by comprising the following steps:

step 4, traversing all two-dimensional detection frames detected by a two-dimensional target detector, determining the frame of each two-dimensional detection frame in a two-dimensional image, acquiring a detection frame containing a new background in the frame corresponding to each two-dimensional detection frame, wherein the detection frame containing the new background coincides with the center of the two-dimensional detection frame, the detection frame containing the new background and the two-dimensional detection frame corresponding to the detection frame meet the condition that (W, H) =alpha (W, H), alpha represents the magnification factor, W and H are the width and the height of the detection frame containing the new background, and W and H are the width and the height of the two-dimensional detection frame; inputting the acquired detection frame containing the new background into a trained subarea and background information network, identifying the subarea in the target, extracting the characteristics, carrying out characteristic coding on the identified subarea, and finally outputting a subarea characteristic vector R _j ，R _j Feature vectors representing the j-th detected sub-region, and identifying spatial background informationExtracting features, performing feature coding on the identified spatial background information, and outputting a background feature vector S _j ，S _j A background feature vector denoted as j-th detected; the subarea and the background information network are of a vertically parallel symmetrical network structure, and the uplink network structure and the downlink network structure are composed of a first convolution layer, a second convolution layer, a maximum pooling layer, a first residual error module, a second residual error module, a third residual error module, a fourth residual error module, an average pooling layer and a full connection layer which are sequentially arranged;

wherein the method comprises the steps of

Feature vector expressed as detection target i +.>

The feature vector of the target j is tracked;

Wherein the method comprises the steps of

Feature vector denoted as detection context i +.>

Tracking the feature vector of the background j;

step 6, using 3D Kalman filter to update state

Wherein (x, y, z) is the target in three dimensionsIn (a), θ is the rotation angle of the target in the xy plane, l, w and h represent the length, width and height of the three-dimensional detection frame, v _x ,v _y ,v _z Representing the speed of the target in three directions of x, y and z;

for the two-dimensional instance successfully matched with the data association based on the appearance characteristics, inquiring the corresponding three-dimensional instance state information according to the association number p, and sending the three-dimensional instance state information to a Kalman filter for track updating, so that the target state of the two-dimensional instance successfully matched with the data association based on the coordinates is the same as the target state of the data association based on the coordinates

Step 7: tracking lifecycle management

2. The three-dimensional multi-target tracking method for fusing point cloud and image information according to claim 1, wherein: the two-dimensional image is acquired by a camera, and the three-dimensional point cloud data is acquired by a laser radar.

3. The three-dimensional multi-target tracking method for fusing point cloud and image information according to claim 1, wherein: the step 2 specifically includes:

4. The three-dimensional multi-target tracking method for fusing point cloud and image information according to claim 1, wherein: in step 3, the process of associating the two-dimensional detection frame and the three-dimensional detection frame from the same target is as follows:

firstly, performing front projection on a three-dimensional detection frame of a target, and establishing and storing a parameter set after the front projection of the three-dimensional detection frame

5. The three-dimensional multi-target tracking method for fusing point cloud and image information according to claim 1, wherein: in step 4, the target sub-region feature extraction process is performed in the detection frame including the detection frame of the new background as follows:

6. The three-dimensional multi-target tracking method for fusing point cloud and image information according to claim 1, wherein: in step 4, the feature extraction process of the spatial background information of the detection frame containing the new background is as follows: and firstly, giving all RGB values of the detection frame containing the target part to 0 for target deletion, then continuously sending the detection frame containing the new background into a downlink network of a subarea and a background information network for background information detection and identification, and carrying out feature coding on the identified background information part.

7. The three-dimensional multi-target tracking method for fusing point cloud and image information according to claim 1, wherein: in step 5, the data association process of the high-score detection frame using the 3D GIOU as the association index and directly performing the coordinate-based data association with the track includes:

thereby obtaining an association matrix A between two frames as follows: