CN116363171A - Three-dimensional multi-target tracking method integrating point cloud and image information - Google Patents

Three-dimensional multi-target tracking method integrating point cloud and image information Download PDF

Info

Publication number
CN116363171A
CN116363171A CN202310165319.9A CN202310165319A CN116363171A CN 116363171 A CN116363171 A CN 116363171A CN 202310165319 A CN202310165319 A CN 202310165319A CN 116363171 A CN116363171 A CN 116363171A
Authority
CN
China
Prior art keywords
dimensional
target
detection frame
dimensional detection
background
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310165319.9A
Other languages
Chinese (zh)
Inventor
才华
郑延阳
付强
马智勇
王伟刚
李英超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changchun University of Science and Technology
Original Assignee
Changchun University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changchun University of Science and Technology filed Critical Changchun University of Science and Technology
Priority to CN202310165319.9A priority Critical patent/CN116363171A/en
Publication of CN116363171A publication Critical patent/CN116363171A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/292Multi-camera tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/273Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion removing elements interfering with the pattern to be recognised
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10032Satellite or aerial image; Remote sensing
    • G06T2207/10044Radar image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a three-dimensional multi-target tracking method integrating point cloud and image information, which belongs to the field of computer vision technology and image processing, and aims to solve the problems that a target is lost and frequent identity switching caused by mutual shielding between a tracking target and a target and between the tracking target and a scene object due to motion is solved by continuously using the characteristics of a target sub-region and a space background of a target extraction network of a two-dimensional detection frame of the same target matched with a low-score three-dimensional detection frame for the three-dimensional detection frame with the low-score three-dimensional detection frame.

Description

Three-dimensional multi-target tracking method integrating point cloud and image information
Technical Field
The invention relates to the field of computer vision technology and image processing, in particular to a multi-target tracking method based on fusion of laser radar point cloud data, image target subregions and space background information thereof.
Background
Detection-based Tracking (Tracking-by-detection) is one of the three most common frameworks in the field of multi-target Tracking (MOT), which directly acquire detection frames from detectors, construct tracks across frames and assign IDs to the same targets.
The most important problem faced by multi-target tracking at present is the problem of target loss or identity conversion caused by object shielding in a scene. The detection-based tracking framework tends to rely relatively well on detectors, that is, poor performing target detectors can directly contaminate the performance of the tracker. Since a single three-dimensional detector is mostly used in a tracking frame based on detection in a three-dimensional visual tracking task, the point cloud data used by the tracking frame carries unique depth information, but does not have semantic information such as texture unique to an image.
At present, most trackers perform cross-frame matching by using a Hungary algorithm, but when an object is blocked, the tracker will automatically discard a part of detection frames with low scores according to a manually set threshold value due to the fact that the detection frames with low scores are often accompanied by the reduction of the scores of the detection frames, and obviously, the fact that the detection frames with low scores are incorrect does not mean that targets in a scene are lost.
Disclosure of Invention
In view of the above, the present invention aims to solve the problem of target loss caused by mutual occlusion between a tracking target and a target and between the tracking target and a scene object due to motion when multi-target tracking is performed in a three-dimensional scene, and provides a multi-target tracking method based on fusion of laser radar point cloud data and image target sub-regions and spatial background information thereof.
The technical scheme adopted by the invention for achieving the purpose is as follows: a three-dimensional multi-target tracking method for fusing point cloud and image information is characterized by comprising the following steps:
step 1, acquiring three-dimensional point cloud data and two-dimensional image data of a time-synchronized target space;
step 2, extracting point cloud data features by using a three-dimensional target detector to obtain a three-dimensional detection frame of the target, and extracting a two-dimensional detection frame of the target on the obtained two-dimensional image by using a two-dimensional target detector;
step 3, associating the two-dimensional detection frame and the three-dimensional detection frame from the same target, so that the two-dimensional detection frame and the three-dimensional detection frame from the same target in the same frame can be determined by searching the value of the association number p;
step 4, traversing all two-dimensional detection frames detected by a two-dimensional target detector, determining the frame of each two-dimensional detection frame in a two-dimensional image, acquiring a detection frame containing a new background in the frame corresponding to each two-dimensional detection frame, wherein the detection frame containing the new background coincides with the center of the two-dimensional detection frame, the detection frame containing the new background and the two-dimensional detection frame corresponding to the detection frame meet the condition that (W, H) =alpha (W, H), alpha represents the magnification factor, W and H are the width and the height of the detection frame containing the new background, and W and H are the width and the height of the two-dimensional detection frame; inputting the acquired detection frame containing the new background into a trained subarea and background information network, identifying the subarea in the target, extracting the characteristics, carrying out characteristic coding on the identified subarea, and finally outputting a subarea characteristic vector R j ,R j The feature vector of the sub-region which is expressed as the j-th detection, the spatial background information is identified and the feature is extracted, the identified spatial background information is feature-coded, and the background feature vector S is output j ,S j A background feature vector denoted as j-th detected; the subarea and the background information network are of a vertically parallel symmetrical network structure, and the uplink and downlink network structures are respectively provided with a first convolution layer, a second convolution layer and a maximum pooling layer in sequenceThe system comprises a first residual error module, a second residual error module, a third residual error module, a fourth residual error module, an average pooling layer and a full connection layer;
step 5, the three-dimensional detection frame detected by the three-dimensional target detector and the characteristics output by the subarea and the background information network in the step 4 are sent into a target tracking network for data association, specifically:
sequencing all three-dimensional detection frames output by a three-dimensional target detector according to the confidence score, and setting a score threshold to divide the three-dimensional detection frames into a high-score detection frame and a low-score detection frame;
the high-score detection frame adopts 3D GIOU as an association index to directly carry out coordinate-based data association with the track;
for the low-resolution detection frame, firstly judging whether a two-dimensional detection frame corresponding to the low-resolution detection frame from the same target exists, if the two-dimensional detection frame does not exist, directly discarding the low-resolution detection frame, and if the two-dimensional detection frame exists, continuously inquiring feature codes output by the low-resolution detection frame in a subarea and a background information network through a correlation number p to perform data correlation based on appearance features, wherein the method comprises the following specific steps of:
inquiring the sub-region feature codes output by the sub-region and background information network in the step 4 through the association number p, and carrying out association based on appearance feature data by using feature vectors corresponding to the feature codes of the sub-region; in particular, the association mode adopts a cosine distance D 1 (i, j) as an index for measuring the degree of similarity of feature vectors between two frames, wherein the cosine distance D 1 (i, j) is specifically:
Figure BDA0004095746510000031
wherein the method comprises the steps of
Figure BDA0004095746510000032
Feature vector expressed as detection target i +.>
Figure BDA0004095746510000033
The feature vector of the target j is tracked;
adopting a Hungary algorithm to perform matching, and if only one sub-region block feature similarity is close in the target, considering that the target is successfully matched and still performing state update on the target by using Kalman filtering;
if no sub-region matching is successful, searching the background information again, and continuously using the cosine distance as the feature similarity index at the cosine distance D by utilizing the background feature code output by the downlink network in the step 4 2 (i, j) is specifically
Figure BDA0004095746510000034
Wherein the method comprises the steps of
Figure BDA0004095746510000035
Feature vector denoted as detection context i +.>
Figure BDA0004095746510000036
Tracking the feature vector of the background j;
carrying out background information matching by adopting a Hungary algorithm, if the background characteristics can be matched, considering that the target is not moved or is only blocked, and continuously carrying out state updating by using Kalman filtering;
step 6, using 3D Kalman filter to update state
For the three-dimensional detection frame successfully matched with the data correlation based on the coordinates, directly sending the state of the three-dimensional detection frame into a Kalman filter, and obtaining the target state of the three-dimensional detection frame
Figure BDA0004095746510000037
Wherein (x, y, z) is the coordinate of the target in the three-dimensional space, θ is the rotation angle of the target in the xy plane, l, w and h represent the length, width and height of the three-dimensional detection frame, v x ,v y ,v z Representing the speed of the target in three directions of x, y and z;
for a two-dimensional instance successfully matched by data association based on appearance characteristics, inquiring corresponding three-dimensional instance state information according to the association number p, and using the corresponding three-dimensional instance state information to send the three-dimensional instance state information into a KalmanThe filter performs track updating so that the target state thereof is the same as the target state associated with the coordinate-based data
Figure BDA0004095746510000038
Updating the three-dimensional Kalman filtering state by using the detection result of the current moment of the three-dimensional target detector as an observation value, updating the prediction state of the corresponding track through matching detection to obtain a final t-frame matching track, wherein the final 3D Kalman updated track is (x ', y ', z ', theta ', l ', w ', h ', v) x ',v y ',v z ');
Step 7: tracking lifecycle management
Setting that if the target does not have any update in the continuous E frames or the background information does not match beyond the set maximum number of frames, the target is considered to leave the scene; for potential new objects that newly enter the scene, if three-dimensional instances are detected consecutively for matching for N frames, it will be considered as the target of the newly added scene and a track will be allocated for it.
Further, the two-dimensional image is acquired by a camera, and the three-dimensional point cloud data is acquired by a laser radar.
Further, the step 2 specifically includes:
the three-dimensional detection frame is obtained by utilizing a three-dimensional object detector PV-RCNN, and the output characteristics of the three-dimensional detection frame are (x, y, z, theta, l, w, h and p), wherein (x, y, z) is the coordinate of an object in a three-dimensional space, theta is the rotation angle of the object in an xy plane, l, w and h represent the length, width and height of the three-dimensional detection frame, and p is used for representing the association number;
the two-dimensional target detector YOLOv7 is utilized to acquire a two-dimensional detection frame, the output characteristic of the two-dimensional detection frame is (x, y, w, h, p), wherein (x, y) is the coordinate of a target center point, w and h are the width and the height of the two-dimensional detection frame respectively, and p is expressed as an association number.
Further, in step 3, the process of associating the two-dimensional detection frame and the three-dimensional detection frame from the same target is:
firstly, performing front projection on a three-dimensional detection frame of a target, and establishing and storing a parameter of the three-dimensional detection frame after front projectionNumber set
Figure BDA0004095746510000041
Simultaneously, based on a two-dimensional detection frame of the target detected by the two-dimensional target detector, establishing a parameter set of the two-dimensional detection frame>
Figure BDA0004095746510000042
And then, calculating IOU values of the two-dimensional detection frames subjected to three-dimensional forward projection and the two-dimensional detection frames output by the two-dimensional target detector in a two-dimensional image domain, matching the two-dimensional detection frames corresponding to the same target by using a greedy algorithm, and assigning the same association number p to the two-dimensional detection frames corresponding to the same target.
In step 4, the process of extracting the target sub-region features in the detection frame including the detection frame of the new background is as follows:
firstly, assigning RGB values of a background part except a target of the detection frame to be 0 to delete the background part so as to extract the subarea characteristics of the target by the reserved original two-dimensional detection frame only containing the target; and detecting and identifying the subarea of the target through the pre-trained subarea and the background information network, and performing feature coding on the subarea.
In step 4, the feature extraction process of the spatial background information for the detection frame containing the new background is as follows: and firstly, giving all RGB values of the detection frame containing the target part to 0 for target deletion, then continuously sending the detection frame containing the new background into a downlink network of a subarea and a background information network for background information detection and identification, and carrying out feature coding on the identified background information part.
In step 5, the data association process of the high-score detection frame using the 3D GIOU as the association index and directly performing the coordinate-based data association with the track includes:
setting GIOU3d as a correlation index to establish a data correlation matrix, wherein the GIOU3d expression is:
Figure BDA0004095746510000051
in B of 1 ,B 2 Representing a three-dimensional detection frame, I represents B 1 And B 2 U represents B 1 And B 2 C is the closed total outsourcing of U, V represents the volume of the polyhedron, where V I Representing the volume of the intersection of two three-dimensional inspection frames, V U Representing the volume of the union part of two three-dimensional detection frames, V C Representing the volume of the enclosed fully encased portion of the two three-dimensional inspection frames;
thereby obtaining an association matrix A between two frames as follows:
Figure BDA0004095746510000052
wherein GIOU3d m,n GIOU3d values representing the mth three-dimensional detection frame of the t-th frame and the nth three-dimensional detection frame of the t+1th frame;
and matching by adopting a greedy algorithm, setting a minimum GIOU3d threshold value, and considering that the target has no matched object if the minimum GIOU3d threshold value is smaller than the minimum GIOU3d threshold value.
Compared with the prior invention, the invention has the following advantages:
1. the invention fully utilizes the abundant texture characteristics of the image data, and utilizes the image data to make up the defect that a single three-dimensional object detector based on the point cloud only carries depth information, so as to achieve better tracking effect in the later period.
2. Further target sub-region feature extraction is carried out through a two-dimensional detection frame which is output by a two-dimensional target detector and contains targets, and data association based on appearance features is carried out on a low-resolution detection frame which is output by a three-dimensional target detector, so that the problem of target loss caused by partial shielding and other reasons is solved, and tracking accuracy is improved.
3. By the aid of two-dimensional image data, the invariance of spatial background information is utilized, background information of adjacent frames is searched for a target in the center of a scene, and therefore the track reservation of a shielding target is achieved.
Drawings
FIG. 1 is a diagram of a three-dimensional object detector framework;
FIG. 2 is a diagram of a subregion and background information network structure;
FIG. 3 is a block diagram of residual modules in a subregion and a background information network;
fig. 4 is a flow chart of multi-objective tracking for fusing point cloud and image information.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the present invention is not limited by the following examples, and specific embodiments can be determined according to the technical scheme and practical situation of the present invention. Well-known methods, procedures, and flows have not been described in detail so as not to obscure the nature of the invention.
As shown in fig. 1, fig. 2, fig. 3 and fig. 4, the present invention proposes a three-dimensional multi-target tracking method for fusing point cloud and image information, in particular, a multi-target tracking method based on fusion of laser radar point cloud data and image target subregion and space background information thereof, which mainly uses the advantage of multi-mode data complementation, uses a three-dimensional detection frame output by a three-dimensional target detector based on point cloud data as the main input of a tracking network, uses a two-dimensional detection frame based on a two-dimensional target detector of an image as the auxiliary, associates the same target detected by two different sensors of laser radar and camera, and simultaneously inputs the two-dimensional detection frame output by the two-dimensional target detector into the subregion and background information network of the present invention to further extract the characteristics of the target subregion and space background information. The invention sorts the three-dimensional detection frames output by the three-dimensional object detector according to the output confidence score, directly carries out data association based on coordinates for the high-resolution three-dimensional detection frames, and carries out data association based on appearance characteristics by continuously using the characteristic information of the two-dimensional detection frames corresponding to the low-resolution three-dimensional detection frames after being processed by the subareas and the background information network, thereby realizing the reduction of the problems of object loss and frequent identity switching caused by mutual shielding between tracking objects and between the tracking objects and scene objects due to movement, and the specific implementation steps are as shown in figure 4:
step 1, acquiring three-dimensional point cloud data and two-dimensional image data of a time-synchronized target space;
the three-dimensional point cloud data and the two-dimensional image data are respectively from two sensors of a laser radar and a camera of the same vehicle in the same time period;
step 2, extracting point cloud data features by using a three-dimensional target detector to obtain a three-dimensional detection frame of the target, and extracting a two-dimensional detection frame of the target on the obtained two-dimensional image by using a two-dimensional target detector;
specifically, the three-dimensional object detector PV-RCNN is used to obtain a three-dimensional detection frame, and the structure of the three-dimensional object detector PV-RCNN is shown in fig. 1, which is a prior art and will not be described in detail herein. Finally, parameters of the three-dimensional detection frame are obtained, wherein the parameters are (x, y, z, theta, l, w, h and p), the coordinates of the target in the three-dimensional space are (x, y, z), the theta is the rotation angle of the target in the xy plane, l, w and h represent the length, width and height of the three-dimensional detection frame, and p is used for representing the association number. In addition, the output of the three-dimensional object detector also contains confidence and categories common in three-dimensional detection;
similarly, step 2 further includes using a two-dimensional object detector YOLOv7 to obtain a two-dimensional detection frame, where the two-dimensional detection frame output is characterized by (x, y, w, h, p), where (x, y) is the coordinates of the object center point, w and h are the width and height of the two-dimensional detection frame, and p is the association number. p is mainly used to make the same data correlation for both object detectors. In addition, the output of the two-dimensional object detector also contains confidence and classification common in two-dimensional detection;
step 3: associating two-dimensional and three-dimensional detection frames from the same target
Firstly, performing front projection on a three-dimensional detection frame of a target acquired based on a laser radar, and establishing and storing a parameter set after the front projection of the three-dimensional detection frame
Figure BDA0004095746510000071
Meanwhile, based on a two-dimensional detection frame of a target acquired by a camera, establishing a parameter set of the two-dimensional detection frame>
Figure BDA0004095746510000072
And then, calculating IOU values of the two-dimensional detection frame subjected to three-dimensional forward projection and the two-dimensional detection frame output by the two-dimensional target detector in a two-dimensional image domain.
In step 3, firstly, a front projection is made on a three-dimensional detection frame of a target detected based on three-dimensional point cloud data acquired by a laser radar, and a parameter set is established after the front projection is made on the three-dimensional detection frame
Figure BDA0004095746510000073
Simultaneously, based on a two-dimensional detection frame of a target detected by two-dimensional image data acquired by a camera, establishing a parameter set of the two-dimensional detection frame>
Figure BDA0004095746510000074
And then, calculating IOU of the two-dimensional detection frames subjected to three-dimensional forward projection and the two-dimensional detection frames output by the two-dimensional object detector in a two-dimensional image domain, matching the two-dimensional detection frames corresponding to the same object by using a greedy algorithm, and assigning the same association number p to the detection frames corresponding to the same object, wherein the initial value of the association number p is set to be 1, and adding one to the value of the data association number p whenever a group of detection frames identified as the same object is carried out, so that each group of objects has the same association number p, and the two-dimensional detection frames and the three-dimensional detection frames from the same object can be conveniently determined by searching the value of the association number p in the later period. After the three-dimensional detection frames and the two-dimensional detection frames are associated, the three-dimensional detection frames and the two-dimensional detection frames are obviously uncorrelated, and because the three-dimensional detection frames and the two-dimensional detection frames are mainly aimed at the three-dimensional multi-target tracking field, the rest three-dimensional detection frames which are not matched are reserved, and the two-dimensional detection frames output by the two-dimensional detector are not processed later.
Step 4: and sending the two-dimensional detection frame detected by the two-dimensional target detector into a subarea and a background information network for further feature processing.
In a specific step 4, the invention designs a sub-area and a background information network which are parallel up and down, and the network structure is shown in fig. 2 and 3. The network is of an uplink-downlink symmetrical structure, taking an uplink part as an example, the network firstly passes through two 3 x 3Conv layer convolution layers, and is connected with 4 residual blocks (residual modules) after Maxpool (maximum pooling layer), wherein BN (BatchNormalization) layers are batch normalization, a Relu layer is an activation function, avgpool is an average pooling layer, and FC is represented as a full connection layer. The main difference between the uplink and the downlink of the network is that the objects for extracting the characteristics are different. The network is a pre-trained network. When the target is blocked, the score of all detectors is reduced, and the invention focuses on mining the part of the target possibly exposed in a scene, so that the network mainly performs recognition training on objects possibly appearing in a tracking scene in daily life. The uplink is mainly aimed at handbags, caps, bicycles, etc. for pedestrians, and of course, the head which is most easily exposed in shielding is included. The downlink network mainly aims at the category of information which may occur in some road surfaces, such as street lamps, trees and the like. The sub-region and background information network will identify these examples and feature code the information of the identified examples for use by the need for subsequent tracking. The input of the parallel network is the intercepted image which is obtained by the range expansion processing of the two-dimensional detection frame output by the two-dimensional detector in the step 2.
Specifically, when a three-dimensional detection frame output by a three-dimensional object detector is divided into high and low confidence scores, the three-dimensional detection frame which is matched and is considered to be a two-dimensional detection frame from the same object is generated, a new detection frame containing background information is generated according to the two-dimensional detection frame output by the two-dimensional object detector, and the generation process is that the original width and the height of the two-dimensional detection frame output by a p-th object in the two-dimensional detector are assumed to be w respectively p And h p This is denoted as m p (w p ,h p ) Where p is the association number, the two-dimensional detection frame (m 1 ,m 2 ,…,m p ) The parameters are respectively expanded by alpha times to obtain the width and the height (M 1 ,M 2 …,M p ) This isThe method only uses the parameters of the original two-dimensional detection frame to generate new detection frame parameters containing background information, and does not enlarge the image in the original two-dimensional detection frame. Generating new detection frames containing background information with width and height parameters as follows;
(M 1 ,M 2 …,M p )=α(m 1 ,m 2 …,m p )
wherein alpha represents magnification, M is width and height of detection frame containing new background, specifically M p (W p ,H p ),W p ,H p The width and the height of the generated p-th detection frame containing the new background are respectively, and p is the association number. The central position parameter of the detection frame containing the new background is the same as the two-dimensional detection frame which only contains the target, and after finishing, the detection frame parameter containing the new background is finally (x, y, W, H, p). Establishing a subordinate relation between the background and the target according to the association number p for the expanded detection frame containing the new background, and simultaneously carrying out image interception by using the expanded detection frame to serve as the input of a subarea and a background information network;
and further, carrying out feature coding on the sub-region features of the target in the detection frame containing the new background after expansion by the uplink implementation of the sub-region and the background information network. Firstly, aiming at the images input into the region and the background information network, the RGB values of the parts except the original two-dimensional detection frame are assigned to 0, so that the background part is deleted. Extracting features of the target by using the reserved original two-dimensional detection frame only containing the target, and identifying details of the sub-region in the target and extracting features by using the pre-trained sub-region and a background information network.
The method is characterized in that the subarea and the background information network are used for detecting and identifying the subarea details of the target by taking pedestrians as examples, and more particularly, the subarea of partial targets, such as the heads of pedestrians, shoes, portable bags, riding electric vehicles and the like, of which the pedestrians are easy to expose when shielding occurs are identified. The target subareas are identified and then are subjected to feature coding, and finally subarea feature vectors R are output j ,R j Represented as the j-th detected sub-region feature vector.
And (3) further realizing the sub-region and the background information network mentioned in the step (4), wherein the downlink part realizes the functions of background information identification, and simultaneously extracting background characteristics aiming at the detection frame containing the new background after expansion.
More specifically, the background near the target in the detection frame containing the new background after expansion is sampled through the subarea and the background information network, the target such as a tree, a street lamp and the like possibly contained in the background part is identified, and the background characteristic information is extracted. The specific sampling process is as follows: for the input expanded detection frame containing a new background, firstly, all RGB values of the detection frame containing a target part are endowed with 0 to delete the target, then the expanded detection frame containing the new background is continuously sent into a downlink network of a subarea and a background information network to perform feature extraction, and the identified background part is subjected to feature coding and a background feature vector S is also output j ,S j Represented as the j-th detected background feature vector.
Step 5: and sending the three-dimensional detection frame detected by the three-dimensional target detector, the subareas and the characteristics output by the background information network into a target tracking network for data association.
The invention comprises two independent parallel data association modules. The specific step 5 includes the steps of reserving detection frames of confidence scores output by all three-dimensional target detectors, setting up a score threshold, manually giving values to the threshold according to different scenes, and dividing the three-dimensional detection frames output by the three-dimensional detectors into a high-score detection frame and a low-score detection frame according to the score threshold. The high score detection box will be directly associated with the trajectory with coordinate-based data.
Specifically, the high-score detection frame track is associated, GIOU3d is set to be used as an associated index to establish a data associated matrix, wherein the expression of the GIOU3d is as follows:
Figure BDA0004095746510000101
in B of 1 ,B 2 Representing a three-dimensional detection frame, I represents B 1 And B 2 Intersection of U tableShow B 1 And B 2 C is the closed total outsourcing of U, V represents the volume of the polyhedron, where V I Representing the volume of the intersection of two three-dimensional inspection frames, V U Representing the volume of the union part of two three-dimensional detection frames, V C Representing the volume of the enclosed fully encased portion of the two three-dimensional inspection frames;
thereby obtaining an association matrix A between two frames as follows:
Figure BDA0004095746510000111
wherein GIOU3d m,n GIOU3d values representing the mth three-dimensional detection frame of the t-th frame and the nth three-dimensional detection frame of the t+1th frame;
matching by adopting a greedy algorithm, setting a minimum GIOU3d threshold (the threshold is manually given a value according to different scenes), and considering that a target does not have a matched object if the threshold is smaller than the value;
further, for the detection frame divided into the low-score, it is not directly discarded, for the low-score detection frame, it is first determined whether there is a two-dimensional detection frame corresponding to the low-score detection frame from the same target, if there is no two-dimensional detection frame, the low-score detection frame is directly discarded, if there is a two-dimensional detection frame, the data association based on the appearance feature is performed by continuously querying the feature codes output in the sub-region and the background information network through the association number p, which is specifically expressed as follows:
and establishing a data association module aiming at the low-score three-dimensional detection frame, and inquiring feature codes output by the sub-region and the background information network in the step 4 through the association number p. The sub-area and the uplink network of the background information network in the step 4 have completed detection and identification of the sub-area of the target and have performed feature coding on the identified sub-area, for the invention, the feature vectors after feature coding of the sub-area are used for data association, which has better robustness to problems such as shielding, dense target and the like, the data association adopts cosine distance D (i, j) as an index for measuring the similarity of the feature vectors between two frames, wherein the cosine distance D 1 (i, j) is specifically:
Figure BDA0004095746510000112
wherein the method comprises the steps of
Figure BDA0004095746510000113
Feature vector expressed as detection target i +.>
Figure BDA0004095746510000114
To track the feature vector of object j.
According to the invention, a Hungary algorithm is adopted for matching, and if only one sub-region block feature similarity is close in the target, the target is considered to be successfully matched, and the Kalman filtering is still used for updating the state of the target.
If no sub-region matching is successful, searching the background information, and the space background information of the static target is unchanged when the static target is blocked by other moving targets, so that the background information features are utilized to continuously mine potential blocked static targets. And (3) continuously using the cosine distance as a feature similarity index by utilizing the background feature code output by the downlink network in the step (4), wherein the cosine distance D is 2 (i, j) is specifically
Figure BDA0004095746510000121
Wherein the method comprises the steps of
Figure BDA0004095746510000122
Feature vector denoted as detection context i +.>
Figure BDA0004095746510000123
Tracking the feature vector of the background j;
still adopt hungarian algorithm to match, if the background characteristic can match, consider the goal does not take place to move but is merely blocked, still keep the position of the goal on the last frame. And deleting the target position information when the matching times of the background information exceeds a set threshold value or the feature matching of the background information is not carried out after the matching is successful.
Step 6: status update using 3D kalman filter
The invention uses successful instance state information in two independent parallel data association modules to update the track, which is characterized in that a three-dimensional detection frame which is successful in associating coordinate-based data is directly sent into a Kalman filter, and the target state comprises
Figure BDA0004095746510000124
Wherein (x, y, z) is the coordinate of the target in the three-dimensional space, θ is the rotation angle of the target in the xy plane, l, w, h represents the length, width and height of the three-dimensional detection frame, v x ,v y ,v z Representing the velocity of the object in the x, y, z directions.
For the two-dimensional instance successfully matched through the image target sub-region and the background information, inquiring the corresponding three-dimensional instance state information according to the association number p, and using the corresponding three-dimensional instance state information to send the three-dimensional instance state information into a Kalman filter for track updating, so that the target state is the same as the first mode data association target state
Figure BDA0004095746510000125
The invention regards the target as a uniform motion model, and the target is predicted as (x) in the current frame now ,y now ,z now ,θ,l,w,h,v x ,v y ,v z ) This state will be used to input the data association module for matching.
The three-dimensional kalman filter state needs to be updated with the current detection result as an observation value. And updating the prediction state of the corresponding track through matching detection to obtain the matching track of the final t frame. The final updated trajectories of the 3D Kalman are (x ', y ', z ', θ ', l ', w ', h ', v) x ',v y ',v z ')。
Step 7: tracking lifecycle management
This section is mainly directed to the birth and death problems of tracked objects leaving the scene and new objects joining the scene. The present invention uses a simple set of rules for lifecycle management, i.e. if the object does not have any updates in consecutive E frames or the background information does not match beyond a set maximum number of frames, the present invention will see that the object leaves the scene, which it will be discarded. For potential new objects that newly enter the scene, if three-dimensional instances are detected consecutively for matching for N frames, it will be considered as the target of the newly added scene and a track will be allocated for it.

Claims (7)

1. A three-dimensional multi-target tracking method for fusing point cloud and image information is characterized by comprising the following steps:
step 1, acquiring three-dimensional point cloud data and two-dimensional image data of a time-synchronized target space;
step 2, extracting point cloud data features by using a three-dimensional target detector to obtain a three-dimensional detection frame of the target, and extracting a two-dimensional detection frame of the target on the obtained two-dimensional image by using a two-dimensional target detector;
step 3, associating the two-dimensional detection frame and the three-dimensional detection frame from the same target, so that the two-dimensional detection frame and the three-dimensional detection frame from the same target in the same frame can be determined by searching the value of the association number p;
step 4, traversing all two-dimensional detection frames detected by a two-dimensional target detector, determining the frame of each two-dimensional detection frame in a two-dimensional image, acquiring a detection frame containing a new background in the frame corresponding to each two-dimensional detection frame, wherein the detection frame containing the new background coincides with the center of the two-dimensional detection frame, the detection frame containing the new background and the two-dimensional detection frame corresponding to the detection frame meet the condition that (W, H) =alpha (W, H), alpha represents the magnification factor, W and H are the width and the height of the detection frame containing the new background, and W and H are the width and the height of the two-dimensional detection frame; inputting the acquired detection frame containing the new background into a trained subarea and background information network, identifying the subarea in the target, extracting the characteristics, carrying out characteristic coding on the identified subarea, and finally outputting a subarea characteristic vector R j ,R j Feature vectors representing the j-th detected sub-region, and identifying spatial background informationExtracting features, performing feature coding on the identified spatial background information, and outputting a background feature vector S j ,S j A background feature vector denoted as j-th detected; the subarea and the background information network are of a vertically parallel symmetrical network structure, and the uplink network structure and the downlink network structure are composed of a first convolution layer, a second convolution layer, a maximum pooling layer, a first residual error module, a second residual error module, a third residual error module, a fourth residual error module, an average pooling layer and a full connection layer which are sequentially arranged;
step 5, the three-dimensional detection frame detected by the three-dimensional target detector and the characteristics output by the subarea and the background information network in the step 4 are sent into a target tracking network for data association, specifically:
sequencing all three-dimensional detection frames output by a three-dimensional target detector according to the confidence score, and setting a score threshold to divide the three-dimensional detection frames into a high-score detection frame and a low-score detection frame;
the high-score detection frame adopts 3D GIOU as an association index to directly carry out coordinate-based data association with the track;
for the low-resolution detection frame, firstly judging whether a two-dimensional detection frame corresponding to the low-resolution detection frame from the same target exists, if the two-dimensional detection frame does not exist, directly discarding the low-resolution detection frame, and if the two-dimensional detection frame exists, continuously inquiring feature codes output by the low-resolution detection frame in a subarea and a background information network through a correlation number p to perform data correlation based on appearance features, wherein the method comprises the following specific steps of:
inquiring the sub-region feature codes output by the sub-region and background information network in the step 4 through the association number p, and carrying out association based on appearance feature data by using feature vectors corresponding to the feature codes of the sub-region; in particular, the association mode adopts a cosine distance D 1 (i, j) as an index for measuring the degree of similarity of feature vectors between two frames, wherein the cosine distance D 1 (i, j) is specifically:
Figure FDA0004095746490000021
wherein the method comprises the steps of
Figure FDA0004095746490000026
Feature vector expressed as detection target i +.>
Figure FDA0004095746490000022
The feature vector of the target j is tracked;
adopting a Hungary algorithm to perform matching, and if only one sub-region block feature similarity is close in the target, considering that the target is successfully matched and still performing state update on the target by using Kalman filtering;
if no sub-region matching is successful, searching the background information again, and continuously using the cosine distance as the feature similarity index at the cosine distance D by utilizing the background feature code output by the downlink network in the step 4 2 (i, j) is specifically
Figure FDA0004095746490000023
Wherein the method comprises the steps of
Figure FDA0004095746490000024
Feature vector denoted as detection context i +.>
Figure FDA0004095746490000025
Tracking the feature vector of the background j;
carrying out background information matching by adopting a Hungary algorithm, if the background characteristics can be matched, considering that the target is not moved or is only blocked, and continuously carrying out state updating by using Kalman filtering;
step 6, using 3D Kalman filter to update state
For the three-dimensional detection frame successfully matched with the data correlation based on the coordinates, directly sending the state of the three-dimensional detection frame into a Kalman filter, and obtaining the target state of the three-dimensional detection frame
Figure FDA0004095746490000027
Wherein (x, y, z) is the target in three dimensionsIn (a), θ is the rotation angle of the target in the xy plane, l, w and h represent the length, width and height of the three-dimensional detection frame, v x ,v y ,v z Representing the speed of the target in three directions of x, y and z;
for the two-dimensional instance successfully matched with the data association based on the appearance characteristics, inquiring the corresponding three-dimensional instance state information according to the association number p, and sending the three-dimensional instance state information to a Kalman filter for track updating, so that the target state of the two-dimensional instance successfully matched with the data association based on the coordinates is the same as the target state of the data association based on the coordinates
Figure FDA0004095746490000028
Updating the three-dimensional Kalman filtering state by using the detection result of the current moment of the three-dimensional target detector as an observation value, updating the prediction state of the corresponding track through matching detection to obtain a final t-frame matching track, wherein the final 3D Kalman updated track is (x ', y ', z ', theta ', l ', w ', h ', v) x ',v y ',v z ');
Step 7: tracking lifecycle management
Setting that if the target does not have any update in the continuous E frames or the background information does not match beyond the set maximum number of frames, the target is considered to leave the scene; for potential new objects that newly enter the scene, if three-dimensional instances are detected consecutively for matching for N frames, it will be considered as the target of the newly added scene and a track will be allocated for it.
2. The three-dimensional multi-target tracking method for fusing point cloud and image information according to claim 1, wherein: the two-dimensional image is acquired by a camera, and the three-dimensional point cloud data is acquired by a laser radar.
3. The three-dimensional multi-target tracking method for fusing point cloud and image information according to claim 1, wherein: the step 2 specifically includes:
the three-dimensional detection frame is obtained by utilizing a three-dimensional object detector PV-RCNN, and the output characteristics of the three-dimensional detection frame are (x, y, z, theta, l, w, h and p), wherein (x, y, z) is the coordinate of an object in a three-dimensional space, theta is the rotation angle of the object in an xy plane, l, w and h represent the length, width and height of the three-dimensional detection frame, and p is used for representing the association number;
the two-dimensional target detector YOLOv7 is utilized to acquire a two-dimensional detection frame, the output characteristic of the two-dimensional detection frame is (x, y, w, h, p), wherein (x, y) is the coordinate of a target center point, w and h are the width and the height of the two-dimensional detection frame respectively, and p is expressed as an association number.
4. The three-dimensional multi-target tracking method for fusing point cloud and image information according to claim 1, wherein: in step 3, the process of associating the two-dimensional detection frame and the three-dimensional detection frame from the same target is as follows:
firstly, performing front projection on a three-dimensional detection frame of a target, and establishing and storing a parameter set after the front projection of the three-dimensional detection frame
Figure FDA0004095746490000031
Simultaneously, based on a two-dimensional detection frame of the target detected by the two-dimensional target detector, establishing a parameter set of the two-dimensional detection frame>
Figure FDA0004095746490000032
And then, calculating IOU values of the two-dimensional detection frames subjected to three-dimensional forward projection and the two-dimensional detection frames output by the two-dimensional target detector in a two-dimensional image domain, matching the two-dimensional detection frames corresponding to the same target by using a greedy algorithm, and assigning the same association number p to the two-dimensional detection frames corresponding to the same target.
5. The three-dimensional multi-target tracking method for fusing point cloud and image information according to claim 1, wherein: in step 4, the target sub-region feature extraction process is performed in the detection frame including the detection frame of the new background as follows:
firstly, assigning RGB values of a background part except a target of the detection frame to be 0 to delete the background part so as to extract the subarea characteristics of the target by the reserved original two-dimensional detection frame only containing the target; and detecting and identifying the subarea of the target through the pre-trained subarea and the background information network, and performing feature coding on the subarea.
6. The three-dimensional multi-target tracking method for fusing point cloud and image information according to claim 1, wherein: in step 4, the feature extraction process of the spatial background information of the detection frame containing the new background is as follows: and firstly, giving all RGB values of the detection frame containing the target part to 0 for target deletion, then continuously sending the detection frame containing the new background into a downlink network of a subarea and a background information network for background information detection and identification, and carrying out feature coding on the identified background information part.
7. The three-dimensional multi-target tracking method for fusing point cloud and image information according to claim 1, wherein: in step 5, the data association process of the high-score detection frame using the 3D GIOU as the association index and directly performing the coordinate-based data association with the track includes:
setting GIOU3d as a correlation index to establish a data correlation matrix, wherein the GIOU3d expression is:
Figure FDA0004095746490000041
in B of 1 ,B 2 Representing a three-dimensional detection frame, I represents B 1 And B 2 U represents B 1 And B 2 C is the closed total outsourcing of U, V represents the volume of the polyhedron, where V I Representing the volume of the intersection of two three-dimensional inspection frames, V U Representing the volume of the union part of two three-dimensional detection frames, V C Representing the volume of the enclosed fully encased portion of the two three-dimensional inspection frames;
thereby obtaining an association matrix A between two frames as follows:
Figure FDA0004095746490000042
wherein GIOU3d m,n GIOU3d values representing the mth three-dimensional detection frame of the t-th frame and the nth three-dimensional detection frame of the t+1th frame;
and matching by adopting a greedy algorithm, setting a minimum GIOU3d threshold value, and considering that the target has no matched object if the minimum GIOU3d threshold value is smaller than the minimum GIOU3d threshold value.
CN202310165319.9A 2023-02-27 2023-02-27 Three-dimensional multi-target tracking method integrating point cloud and image information Pending CN116363171A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310165319.9A CN116363171A (en) 2023-02-27 2023-02-27 Three-dimensional multi-target tracking method integrating point cloud and image information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310165319.9A CN116363171A (en) 2023-02-27 2023-02-27 Three-dimensional multi-target tracking method integrating point cloud and image information

Publications (1)

Publication Number Publication Date
CN116363171A true CN116363171A (en) 2023-06-30

Family

ID=86940453

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310165319.9A Pending CN116363171A (en) 2023-02-27 2023-02-27 Three-dimensional multi-target tracking method integrating point cloud and image information

Country Status (1)

Country Link
CN (1) CN116363171A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116758110A (en) * 2023-08-15 2023-09-15 中国科学技术大学 Robust multi-target tracking method under complex motion scene
CN117152199A (en) * 2023-08-30 2023-12-01 成都信息工程大学 Dynamic target motion vector estimation method, system, equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116758110A (en) * 2023-08-15 2023-09-15 中国科学技术大学 Robust multi-target tracking method under complex motion scene
CN116758110B (en) * 2023-08-15 2023-11-17 中国科学技术大学 Robust multi-target tracking method under complex motion scene
CN117152199A (en) * 2023-08-30 2023-12-01 成都信息工程大学 Dynamic target motion vector estimation method, system, equipment and storage medium
CN117152199B (en) * 2023-08-30 2024-05-31 成都信息工程大学 Dynamic target motion vector estimation method, system, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN113674328B (en) Multi-target vehicle tracking method
CN108875548B (en) Character track generation method and device, storage medium and electronic equipment
Kawewong et al. Online and incremental appearance-based SLAM in highly dynamic environments
CN116363171A (en) Three-dimensional multi-target tracking method integrating point cloud and image information
CN110288627B (en) Online multi-target tracking method based on deep learning and data association
CN103824070A (en) Rapid pedestrian detection method based on computer vision
KR101788225B1 (en) Method and System for Recognition/Tracking Construction Equipment and Workers Using Construction-Site-Customized Image Processing
CN110567324B (en) Multi-target group threat degree prediction device and method based on DS evidence theory
WO2021249114A1 (en) Target tracking method and target tracking device
Khan et al. Multi-person tracking based on faster R-CNN and deep appearance features
CN112651994A (en) Ground multi-target tracking method
CN103985142A (en) Federated data association Mean Shift multi-target tracking method
CN112927264A (en) Unmanned aerial vehicle tracking shooting system and RGBD tracking method thereof
CN110929670A (en) Muck truck cleanliness video identification and analysis method based on yolo3 technology
CN114596340A (en) Multi-target tracking method and system for monitoring video
CN115620393A (en) Fine-grained pedestrian behavior recognition method and system oriented to automatic driving
CN111241943B (en) Scene recognition and loopback detection method based on background target and triple loss
CN111159475B (en) Pedestrian re-identification path generation method based on multi-camera video image
CN114419669A (en) Real-time cross-camera pedestrian tracking method based on re-recognition and direction perception
KR101492059B1 (en) Real Time Object Tracking Method and System using the Mean-shift Algorithm
CN114926859A (en) Pedestrian multi-target tracking method in dense scene combined with head tracking
JPH10255057A (en) Mobile object extracting device
Antonio et al. Pedestrians' detection methods in video images: A literature review
CN114627339A (en) Intelligent recognition and tracking method for border crossing personnel in dense jungle area and storage medium
CN116912763A (en) Multi-pedestrian re-recognition method integrating gait face modes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination