CN117576150A

CN117576150A - Multi-mode multi-target 3D tracking method and device considering far-frame dependency relationship

Info

Publication number: CN117576150A
Application number: CN202311459231.4A
Authority: CN
Inventors: 周思远; 朱玉鹤; 周春云; 包敏
Original assignee: Yangzhou Wanfang Technology Co ltd
Current assignee: Yangzhou Wanfang Technology Co ltd
Priority date: 2023-11-03
Filing date: 2023-11-03
Publication date: 2024-02-20
Anticipated expiration: 2043-11-03

Abstract

The invention belongs to the technical field of 3D multi-target tracking, and provides a multi-mode multi-target 3D tracking method and device considering a far-frame dependency relationship. The tracking method comprises the following steps: acquiring RGB and 3D point cloud information of a multi-frame scene; filling the sliding window; aligning the 3D coordinates to a 3D coordinate system of the learnable parameters; dividing a 3D grid, and extracting frame-by-frame 3D features; extracting 2D pixel-by-pixel deep features, mapping, and extracting frame-by-frame 3D mapping features; querying a vector table to generate an initial 3D detection frame representation; simultaneously predicting a multi-frame multi-target detection frame through a transducer decoder; performing ID matching of 3D detection frames of each frame; and updating the 3D detection frame by adopting a Kalman filter. The tracking method considers the long-distance context dependency relationship among the multiple frames in the sliding window, automatically learns the characteristic weights among different points of different frames by using a transducer, and effectively solves the problems of target shielding and target loss during multi-frame tracking.

Description

Multi-mode multi-target 3D tracking method and device considering far-frame dependency relationship

Technical Field

The invention relates to the technical field of 3D multi-target tracking, in particular to a multi-mode multi-target 3D tracking method and device considering far-frame dependency.

Background

The 3D multi-target tracking task aims at tracking individual targets in a multi-frame 3D scene, identifying the 3D location and object class of the targets. The 3D multi-target tracking task plays an important role in the fields of automatic driving, machine navigation, intelligent safety, medical diagnosis and the like. One common problem with multi-target tracking tasks is the phenomena of target occlusion and target loss during multi-frame tracking, wherein multiple targets are difficult to identify through a limited visible area when entering the occlusion, and the phenomena of target confusion and target loss occur when leaving the occlusion. To cope with this problem, we need to mine the dependency of the far-frame features, that is, not only judge according to the current frame and a small number of frames before and after, but also process in combination with the far-frame context features more widely.

At present, a convolution-based multi-target tracking method is widely applied to the field of 3D multi-target tracking, is limited by the local window characteristics of convolution kernels, has expensive cost in mining far-frame characteristic dependency, and generally only adopts a mode of detecting and then tracking a single frame. This results in that the convolutional network of the limited layer cannot find the correlation between frames, and the identification of the detection frame only uses the characteristics of the current frame. Unlike stacking a plurality of convolution layers in succession, the transducer can effectively track far frames by mining the attention weights of the various features through a multi-head attention mechanism, but still faces the problem of expensive computational overhead.

Meanwhile, the 3D multi-target tracking task often involves multiple modalities, such as 3D point cloud, RGB, far infrared, and so on. Due to the complexity of feature alignment, the existing method mostly adopts a post-fusion scheme, namely, each mode respectively detects the target frame and then carries out matching and correction of the detection frame. The scheme has high requirements on the matching time complexity and precision, and the characteristic information of other modes is not fully utilized in the detection stage of each mode.

Disclosure of Invention

In view of this, the embodiment of the invention provides a multi-mode multi-target 3D tracking method and device considering far-frame dependency, which provides a scheme for simultaneously mining attention weights of each dimension in inter-frame and inter-mode of multi-frame multi-target in a 3D sliding window, so as to solve or partially solve the above problems.

In a first aspect, an embodiment of the present invention provides a multi-mode multi-target 3D tracking method considering far-frame dependency, including the following steps:

s1: acquiring RGB information and 3D point cloud information of a multi-frame scene;

s2: filling a tracking scene into the 3D sliding window to a fixed window size;

s3: aligning 3D coordinates of each frame within the sliding window to a 3D coordinate system of a learnable parameter;

s4: dividing a 3D grid, and extracting frame-by-frame 3D features according to the 3D grid;

s5: extracting 2DRGB pixel-by-pixel deep features, mapping the deep features to a 3D coordinate system of a learnable parameter, and extracting frame-by-frame 3D mapping features according to the 3D grid;

s6: querying a vector table of the learnable parameters to generate an initial 3D detection frame representation;

s7: simultaneously predicting the positions, the sizes, the angles and the category probabilities of multi-frame multi-target detection frames through a transducer decoder;

s8: adopting a Hungary algorithm to carry out ID matching of the 3D detection frames of each frame;

s9: and updating the 3D detection frame by adopting a Kalman filter.

According to a specific implementation manner of the embodiment of the present invention, the step S2 specifically includes: filling the tracking scene into the 3D sliding window to a fixed window size, and supplementing n-t frames if the last sliding window frame number t is smaller than the window size n. The implementation can be performed by taking frames at the previous consecutive window intervals.

According to a specific implementation manner of the embodiment of the present invention, the step S3 specifically includes:

s3.1: generating 3D coordinate system transformations of learnable parametersMatrix arrayMatrix->Initializing the weight as a unit matrix;

s3.2: transforming each frame scene within the window into a 3D coordinate system of learnable parameters:

wherein,represents the set of points of the ith frame in the world coordinate system, and k represents the total number of points in the ith frame. Matrix of learnable parameters->Can explore the more efficient 3D characterization space, get +.>Representing the point set of the i-th frame in the 3D coordinate system of the learnable parameters.

According to a specific implementation manner of the embodiment of the present invention, in the step S4, before dividing the 3D grid, inter-frame space coding and intra-frame position coding are further performed on the 3D coordinates of each frame in the sliding window, which specifically includes:

inter-frame space coding of the ith frameIs a learnable parameter with dimension 8, < ->Is sampled in a standard gaussian distribution:

ω _j ～N(0,1),j＝1,2,...,8；

ith frameIntra-frame position coding of (a)Is a feature representation with dimension 128, intra-frame position coding of the ith frame +.>Extracted from the split task main network of PointNet++,

wherein,point set representation representing the ith frame in 3D coordinate system of the learnable parameters, +.>Intra-frame position coding, PN, representing the ith frame _seg A split task master network representing PointNet++;

final 3D feature representation PE _i ∈R ^k×136 Is inter-frame space codingAnd intra position coding +.>Is spliced by (1):

wherein the rep function represents the inter-frame space codingRepeating k times, con function represents inter-frame space coding to be repeated k times +.>And intra position coding +.>Feature stitching is performed in the second dimension.

According to a specific implementation manner of the embodiment of the present invention, the step S4 specifically includes:

s4.1: dividing the 3D grid, grouping all points according to the 3D grid, wherein, 3D ROI space is evenly divided into an=m×m×m 3D grids;

s4.2 for all points in each group, calculate the inverse distance weighted mean of its 3D features representing PE, called frame-by-frame 3D features

Wherein p=2, d (x, x _i ) ^p Represents the square distance, w, from the point of use to the center point _i (x) Representing the inverse weighting using the squared distance. The 3D characteristic is expressed as PE epsilon R ^n×k×136 Transforming into frame-by-frame 3D features

According to a specific implementation manner of the embodiment of the present invention, the step S5 specifically includes:

s5.1: extracting 2D semantic features by using a real-time semantic segmentation network BiseNetV2 as a 2D backup:

wherein R is _i ∈R ^h×w×3 RGB values, BN, representing the 2D pixels of the ith frame _v2 BiseNet V2 backbone network pre-trained on behalf of ImageNet dataset, MLP function represents tail feature variation via a single layer MLPThe switch is made to the 44 dimension,a 2D feature representing an i-th frame;

s5.2: calculating 3D coordinates of a leachable parameter corresponding to the j pixel point in the i frame under a 3D coordinate system:

wherein,is the space coordinate of the rasterized 3D camera viewing cone corresponding to the pixel point, and dj is the depth of the jth pixel point; />A transformation matrix representing the self coordinate system of the i-th frame to the camera coordinate system; />A transformation matrix representing the world coordinate system of the i-th frame to a self coordinate system; />A 3D coordinate system transformation matrix representing the learnable parameters; />Is the 3D coordinates (alpha) in the 3D coordinate system of the learnable parameters _j ，β _j ，γ _j )；

S5.3: for all 3D mapping points in each 3D ROI grid, calculating 3D mapping characteristics R of all 3D mapping points in each 3D ROI grid according to the formula in S5.2 ^3d Is an inverse distance weighted mean of frame-by-frame 3D mapping featuresWherein 3D ROI space is evenly divided adult an=m×m× m 3D grids.

According to a specific implementation manner of the embodiment of the present invention, the step S6 specifically includes: generating an initial 3D detection frame characteristic representation of each frame in a sliding window by using a query vector table of the learnable parameters, wherein the characteristic representation is specifically as follows:

wherein the 3D ROI space is divided into m 3D grids, each grid is an alternative 3D initial detection box;representing hidden vector representation corresponding to each dimension of the space; the arange function is used for generating all integers from 1 to m; the emmbed function is a query vector table of learnable parameters, which has m rows, each row being a 60-dimensional learnable parameter, initialized from a uniform distribution:

ω _j ～U(0,1),j＝1,2,...,60；

splicing hidden vector representations corresponding to the space dimension to obtain hidden vector representations of the 3D anchor, which are called initial 3D detection frame representations:

anchor＝rep(anchor _i ,n)

wherein, anchor epsilon R ^n×an×180 All 3D anchor feature representations representing sliding windows, referred to as an initial 3D detection box representation, wherein an=m x m. The above two equations omit the reshape operation.

According to a specific implementation manner of the embodiment of the present invention, the step S7 specifically includes: the concatenation of the frame-by-frame 3D features and the frame-by-frame 3D mapping features is used as Key and Value, the 3D detection frame representation is initially used as Query, a transducer decoder with alternate sparse and dense attention is used for generating a multi-frame multi-target 3D detection frame, and the multi-frame multi-target 3D detection frame is formed by the following steps:

Q＝anchor；

wherein,representing a frame-by-frame 3D feature->Representing a frame-by-frame 3D mapping feature, +.>The function represents alternately performing dense and sparse masks; subsequent blocks of the transducer use a multi-headed self-attention mechanism consistent with Key, value and Query:

K _i+1 ＝V _i+1 ＝Q _i+1 ＝block _i (K _i ,V _i ,Q _i )；

the output of the last block is spliced with a two-layer MLP to obtain a predicted multi-frame multi-target detection frame:

wherein,representing predicted multi-frame multi-target detection frames, m represents the total number of categories, and m+7 represents the classified one-hot probability and the regressive 3D frame size, relative position and angle respectively. m is 1 more than the number of real categories, the first category representing whether there is an object.

In a second aspect, an embodiment of the present invention provides a multi-mode multi-target 3D tracking apparatus considering far-frame dependency, including:

the acquisition module is used for acquiring RGB information and 3D point cloud information of a multi-frame scene;

a filling module to fill a tracking scene within a 3D sliding window to a fixed window size;

an alignment module to align 3D coordinates of each frame within the sliding window to a 3D coordinate system of a learnable parameter;

the division module is used for dividing the 3D grids and extracting frame-by-frame 3D features according to the 3D grids;

the extraction module is used for extracting 2DRGB pixel-by-pixel deep features and mapping the deep features to a 3D coordinate system of a learnable parameter, and extracting frame-by-frame 3D mapping features according to the 3D grid;

the query module is used for querying a vector table of the learnable parameters to generate an initial 3D detection frame representation;

the prediction module is used for simultaneously predicting the positions, the sizes, the angles and the category probabilities of the multi-frame multi-target detection frames through the transducer decoder;

the matching module is used for carrying out ID matching of the 3D detection frames of each frame by adopting a Hungary algorithm;

and the updating module is used for carrying out noise filtering and updating of the 3D detection frame by adopting a Kalman filter.

In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the multi-mode multi-target 3D tracking method in the foregoing first aspect or any implementation manner of the first aspect when the program is executed.

The embodiment of the invention has at least the following technical effects:

firstly, the invention considers the long-distance context dependency relationship among the multiframes in the sliding window, automatically learns the characteristic weights among different points of different frames by using a transducer, and excavates the inter-frame and intra-frame relativity in the sliding window, thereby effectively solving the problems of target shielding and target loss during multi-frame tracking;

secondly, different from the method that the 2D and 3D detection frames are obtained firstly, then the feature extraction and fusion tracking are carried out on the detection frames, the method unifies the multi-mode features into the 3D sliding window grid representation space in a single stage and then directly predicts the 3D detection frames, so that the calculation cost and the frame-by-frame and mode-by-mode reading and writing time are saved;

thirdly, the method is different from extracting the characteristics of each mode detection frame, and the method directly fuses the hidden layer characteristics of each mode, so that more mode information is reserved;

fourthly, the attention mechanism of the invention directly acts on each dimension of multi-frame multi-mode characteristics, namely, the attention is mined among modes while the attention is mined in the modes, the attention mechanism is executed on all characteristic dimensions of all modes, the relevance between the modes and each dimension in the modes is comprehensively considered, and the fusion weight is implicitly calculated in the tracking process, so that the multi-mode characteristics are fused more reasonably;

fifth, the invention unifies each mode to the 3D grid characterization space, and puts forward the concept of frame-by-frame 3D characteristics to represent the multi-mode characteristic hidden vector, thereby overcoming the interference caused by different 3D points of each mode and enabling the multi-mode to be pluggable and easy to expand.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. Like elements or portions are generally identified by like reference numerals throughout the several figures. In the drawings, elements or portions thereof are not necessarily drawn to scale.

FIG. 1 is a diagram depicting a target occlusion and target loss problem for cross-multiframe tracking.

Fig. 2 is a flowchart illustrating a multi-mode multi-target 3D tracking method considering far-frame dependency in embodiment 1.

Fig. 3 is a flowchart for calculating the transducer initial Key, query, and Value in example 1.

Fig. 4 is a network architecture diagram of the network designed in example 1.

Fig. 5 is a flowchart of the transducer calculation of the detection frame position offset and the class probability in embodiment 1.

Fig. 6 is a block diagram of a multi-mode multi-target 3D tracking device in which far-frame dependency is considered in embodiment 3.

Fig. 7 is a schematic structural diagram of an electronic device in embodiment 4.

Detailed Description

Embodiments of the technical scheme of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and thus are merely examples, which should not be construed as limiting the scope of the present invention.

It is noted that unless otherwise indicated, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this invention pertains.

One common problem with multi-target tracking tasks is the problem of target occlusion and target loss during multi-frame tracking, as shown in fig. 1, when multiple targets enter the occlusion, the targets are difficult to distinguish through a limited visible area, and when multiple targets leave the occlusion, the phenomena of target confusion and target loss are easy to occur.

At present, a convolution-based multi-target tracking method is widely applied to the field of 3D multi-target tracking, is limited by the local window characteristics of convolution kernels, and has expensive cost for mining far-frame characteristic dependency. This results in a limited layer convolutional network that cannot find correlation between far frames and that has insufficient capability in handling target occlusion and target loss problems for tracking across multiple frames. Unlike stacking a plurality of continuous convolutional layers, the transducer can effectively handle the far frame tracking problem by mining the attention weight of each feature through a multi-head attention mechanism, but still faces the problem of expensive computational overhead. Meanwhile, the 3D multi-target tracking task often involves multiple modalities, such as 3D point cloud, RGB, far infrared, and so on. Due to complexity of feature alignment, the existing method mostly adopts a post-fusion scheme, namely, each mode respectively detects a target frame and then performs matching and correction. The scheme has high requirements on the matching time complexity and precision, and the characteristic information of other modes is not fully utilized in the detection stage of each mode.

Based on the problems, the invention provides a novel multi-mode multi-target 3D tracking method considering long-frame-length distance dependency relationship, the strategy can mine long-distance context dependency relationship crossing multiple frames, and meanwhile, the proposed multi-mode fusion strategy executes an attention mechanism on each characteristic dimension of all modes, and overall consideration of the weights of each characteristic element between modes and in modes is achieved.

Example 1:

fig. 2 is a step flowchart of a multi-mode multi-target 3D tracking method considering far-frame dependency, which is provided in an embodiment of the present invention, referring to fig. 2, the multi-mode multi-target 3D tracking method includes the following steps:

step1: and acquiring RGB information and 3D point cloud information of the multi-frame scene.

RGBD data and 3D point cloud data are acquired through 3D sensors such as a laser radar, a millimeter wave radar, an RGBD camera and the like, and a plurality of frames are continuously shot to obtain multi-frame multi-mode scene input data. RGB data represents input information of a 2D structure texture mode; the depth data is the distance of the pixel points of the RGB to the object in 3D space, both from the same RGBD camera; the point cloud data is sampled and aligned from a plurality of 3D sensors, such as a plurality of laser radar and millimeter wave radar sensors carried in automatic driving, and is obtained by modeling after being aligned to the same world coordinate system through coordinate transformation. These data may be captured by driving with multiple 3D sensors and RGBD cameras on board an unmanned vehicle or an autopilot.

Step2: filling the 3D sliding window.

The total frame sequence is evenly divided into 3D sliding windows, and the tracking scene is filled to a fixed window size in the last sliding window. If the frame number t of the last sliding window is smaller than the window size n, the n-t frames are supplemented at the beginning of the last sliding window. The complementary frames are obtained by sampling the same index position in each window from n-t sliding windows before the last sliding window. The method comprises the following steps: sampling a first frame of n-t sliding windows immediately preceding the last sliding window; if the number of frames lacking in the last sliding window is larger than the number of the current existing sliding windows, directly taking n-t frames positioned in front of the current sliding window in the total frame sequence; if the frame is still insufficient, the first frame in the total frame sequence is repeated to supplement.

Step3: the 3D coordinates of each frame within the sliding window are aligned to the 3D coordinate system of the learnable parameters.

Step3.1, generating a 3D coordinate transformation matrix of the learnable parametersMatrix->The initialization weight is an identity matrix.

Step3.2, transforming each frame scene within the window into a 3D coordinate system of learnable parameters:

Step4-6 is used to calculate the initial Key, query and Value of the transducer, and the flowchart is shown in fig. 3, and specifically is as follows:

step4: the 3D grid is partitioned and the frame-by-frame 3D features are computed.

Step4.1, generating a learnable parameter with dimension of 8 for each frame as inter-frame space coding, and initializing weight sampling Gaussian distribution.

Inter-frame space coding of the ith frameIs a learnable parameter with dimension 8. />Is sampled in a standard gaussian distribution:

ω _j ～N(0,1), j＝1,2,...,8 (2)

step4.2, using a partition task main network of PointNet++ to extract a frame-by-frame and point-by-point characteristic, wherein the dimension is 128, and the characteristic is used as intra-frame position coding.

Intra-frame position coding of an ith frameIs a feature representation of dimension 128. Intra position coding of the i-th frame +.>The method is extracted from a PointNet++ segmentation task main network, and comprehensively considers the characteristics of local structural relation, global structural relation, disorder, replacement inequality and the like of the point cloud.

Wherein,point set representation representing the ith frame in 3D coordinate system of the learnable parameters, +.>Intra-frame position representing the ith frameCoding, PN _seg A split task master network representing PointNet++.

Step4.3, splicing inter-frame position codes and intra-frame position codes, wherein each point splice under the same frame uses the inter-frame position code of the frame where the intra-frame position code splice of the point to obtain a point-by-point 3D code with 136 dimensions.

Point-by-point 3D encoded PE _i ∈R ^k×136 Is inter-frame space codingAnd intra position coding +.>Is spliced by (1):

Step4.4, 3D ROI (region of interest) space is divided into an=m x m 3D grids, each grid is an alternative 3D initial detection box.

Step4.5, grouping all points according to the 3D initial detection box.

Step4.6, for all points in each initial detection frame, calculate the inverse distance weighted mean of its 3D sliding window representation, dimension 136, called frame-by-frame 3D feature

Wherein p=2, d (x, x _i ) ^p Represents the square distance, w, from the point of use to the center point _i (x) Representing the inverse weighting using the squared distance.

Step5: 2D pixel-by-pixel deep features are extracted and mapped, and frame-by-frame 3D mapping features are calculated.

Step5.1, extracting 2D semantic features using a real-time semantic segmentation network bisanetv 2 as a 2D backup:

wherein R is _i ∈R ^h×w×3 RGB values, BN, representing the 2D pixels of the ith frame _v2 A BiseNetV2 backbone network pre-trained on behalf of the ImageNet dataset, MLP function represents the tail feature transformed to the 44 dimension via one single layer MLP,representing the 2D features of the i-th frame.

Step5.2, mapping the 2D features to a 3D point cloud space.

Calculating 3D coordinates of a leachable parameter corresponding to the j pixel point in the i frame under a 3D coordinate system:

wherein,is the rasterization corresponding to the pixel pointIs the depth of the j-th pixel point. />A transformation matrix representing the self coordinate system of the i-th frame to the camera coordinate system; />A transformation matrix representing the world coordinate system of the i-th frame to a self coordinate system; />A 3D coordinate system transformation matrix representing the learnable parameters. Thereby get +.>Its front three dimensions are the 3D coordinates (alpha) in the 3D coordinate system of the learnable parameters _j ，β _j ，γ _j ). These points are collectively referred to as the 3D mapping feature R ^3d ∈R ^n×h×w×44 The pixel points and the 3D points are in one-to-one relation in the current step, and a single camera can obtain h multiplied by w 3D points after mapping in the current step.

Step5.3, extracting frame-by-frame 3D mapping features according to 3D grid grouping, specifically:

for all 3D mapping points in each 3D ROI grid, computing 3D mapping characteristics R of all 3D mapping points in each 3D ROI grid according to a formula in Step4.6 ^3d Is called a frame-by-frame 3D mapping featureWherein 3D ROI space is evenly divided adult an=m×m× m 3D grids.

Step6: the query vector table generates an initial 3D detection box representation.

Generating an initial 3D detection frame characteristic representation of each frame in a sliding window by using a query vector table of the learnable parameters, wherein the characteristic representation is specifically as follows:

wherein the 3D ROI (region of interest) space is divided into m x m 3D grids, each grid being an alternative 3D initial detection box;representing hidden vector representation corresponding to each dimension of the space; the arange function is used for generating all integers from 1 to m; the emmbed function is a query vector table of learnable parameters, which has m rows, each row being a 60-dimensional learnable parameter, initialized from a uniform distribution:

ω _j ～U(0，1)，j＝1，2，...，60 (10)

anchor＝rep(anchor _i ，n) (12)

Step7: the transducer predicts the multi-frame multi-target detection frame position, size, angle and class probability at the same time.

The network architecture of the network designed by the present invention is shown in fig. 4, where the transducer employs alternating sparse and dense attention mechanisms, with the MLP at the tail of each block also using relatively small dimensions 256 and 180. In addition, the attention mechanism in the graph directly acts on each dimension of multi-frame multi-mode features, namely, the inter-mode attention mining is carried out while the intra-mode attention mining is carried out, the feature fusion strategy executes the attention mechanism on all feature dimensions of all modes, the relevance between the modes and each dimension in the modes is comprehensively considered, and the fusion weight is implicitly calculated in the tracking process, so that the multi-mode feature fusion is more reasonable.

The flow chart of the transducer for calculating the position offset and the class probability of the detection frame is shown in fig. 5, and specifically comprises the following steps:

step7.1, using a concatenation of frame-by-frame 3D features and frame-by-frame 3D mapping features as Key and Value, using an initial 3D detection frame representation as Query, and using a transducer decoder with alternating sparse and dense attention to generate a multi-frame multi-target 3D detection frame, specifically:

Q＝anchor (14)

wherein,the function represents alternately performing dense and sparse masks, involving an alternation of the three modes of expanded attention atrous, local attention local and non-local attention non. Expansion attention requires that each element be associated only with elements of relative distance k,2k,3k ² /k); local attention requires that each element is associated with only k elements before and after, the time complexity O (k); non-local attention quadratic calculation of the product of any two, time complexity O (n ² ). The mask is embodied by subtracting a maximum number, such as 1e5, from the non-interesting locations. Subsequent blocks of the transducer use a multi-headed self-attention mechanism consistent with Key, value and Query:

K _i+1 ＝V _i+1 ＝Q _i+1 ＝block _i (K _i ,V _i ,Q _i ) (16)

In the concrete implementation, the attention mechanism is alternately controlled by judging the result of the current block for the remainder 3; and judging whether the current block is 0 to control the values of Key, value and Query.

Step8: the hungarian algorithm performs inter-frame ID matching.

Step9: if the training stage is not performed, a Kalman filter is adopted to perform noise filtering and updating (reasoning stage) of the 3D detection frame; if the training phase is the training phase, the classification of the detection frame is guided by using the Focal Loss with balanced category, the regression of the coordinate offset of the detection frame is guided by using the L1 Loss (training phase), and the network weight is updated by using a random gradient descent algorithm with momentum (training phase). Focal Loss evaluation is used for predicting the class Loss of the detection frame, and the Loss can effectively relieve the influence caused by class imbalance. The position regression Loss for the test frames was evaluated using L1 Loss. The final loss is a weighted sum of the two.

The above embodiment has at least the following technical effects:

the first and the third-dimension tracking methods consider the long-distance context dependency relationship among multiple frames in a sliding window, and automatically learn the characteristic weights among different points of different frames by using a transducer, so that the problems of target shielding and target loss during multi-frame tracking are effectively relieved;

secondly, unlike the method of detecting before tracking, the 3D tracking method predicts multi-frame multi-target detection frames at the same time, which saves the read-write time consumed by frame-by-frame prediction;

thirdly, the 3D tracking method uses a transducer with alternate sparse and dense attention to decode, so that the algorithm has relatively small parameter quantity and calculation cost as a whole;

fourth, the attention mechanism of the 3D tracking method directly acts on each dimension of multi-frame multi-mode features, namely, the inter-mode attention mining is carried out while the intra-mode attention mining is carried out, the attention mechanism is executed on all feature dimensions of all modes, the relevance between the modes and each dimension in the modes is comprehensively considered, and the fusion weight is implicitly calculated in the tracking process, so that the multi-mode features are fused more reasonably;

fifth, the 3D tracking method unifies all modes into a 3D grid representation space, puts forward a concept of frame-by-frame 3D characteristics to represent multi-mode characteristic hidden vectors, overcomes interference caused by different 3D points of all modes, and enables the multi-modes to be pluggable and easy to expand.

Example 2:

taking unmanned aerial vehicle tracking multi-target in a certain area of a certain city as an example, the invention further discloses a multi-mode multi-target 3D tracking method considering a far-frame dependency relationship, which comprises the following steps: the system receives point cloud and pixel information acquired by an unmanned aerial vehicle airborne laser radar and an RGBD camera, the information is acquired by continuous flying and pitching of the unmanned aerial vehicle, and a camera parameter matrix of the unmanned aerial vehicle, a transformation matrix from a camera coordinate system to a radar coordinate system of each frame, a transformation matrix from a self coordinate system to a radar coordinate system of each frame and a transformation matrix from a world coordinate system of each frame to a self coordinate system are all known. It is assumed that one RGBD camera and a plurality of lidars and millimeter wave radars are mounted on board.

Assume that an RGBD camera obtains RGB data with shape (100, 3, h, w) and depth data with shape (100, 1, h, w); and aligning the plurality of laser radars and the millimeter wave radars to a 3D world coordinate system to obtain continuous 100 point cloud scenes.

Step2: filling the 3D sliding window.

Assuming a sliding window size of 16, 100 frames of data are equally divided into 7 groups, with the last group of missing 12 frames sampled from frame 85 to frame 96. After division into windows, RGB data shape is (7, 16,3, h, w), depth data shape is (7, 16,1, h, w), and the point cloud scene is supplemented to 112 frames.

Step3.2, transforming each frame of scene in the window into a 3D coordinate system of the learnable parameters, and transforming the coordinates of all points in the 112-frame point cloud scene into the 3D coordinate system of the same learnable parameters.

Step4.1, generating a learnable parameter with dimension of 8 for each frame as inter-frame space code, initializing weight sampling Gaussian distribution, and describing inter-frame space code PE of the case ^inter Is (7, 16, 8).

Step4.2, using the main network of the partition task of PointNet++ to extract the frame-by-frame and point-by-point features, with dimension 128, as intra-frame position codes, intra-frame position codes PE of the described case ^inner Is (7, 16, k, 128), where k is not a constant value, representing the number of 3D points that each frame has.

Step4.3, splicing inter-frame position codes and intra-frame position codes, wherein each point splice under the same frame uses the inter-frame position code of the frame where the intra-frame position code splice of the point to obtain a point-by-point 3D code with 136 dimensions. The shape of the splice feature PE of the described case is (7, 16, k, 136).

The step4.4, 3D ROI (region of interest) space is divided into 10 x 10 3D grids, each of which is an alternative 3D initial detection box.

Step4.5, grouping all points according to the 3D initial detection box.

Step4.6, calculate inter-frame position code and frame for all points in each gridInverse distance weighted mean of intra-position coded stitching feature PE, dimension 136, referred to as frame-by-frame 3D featureFrame-by-frame 3D features of described casesIs (7, 16, 1000, 136).

Step5.1, extracting 2D semantic features of RBG modes by using a lightweight semantic segmentation network BiseNetV2 as a 2D backup, and splicing MLP into feature representation with the dimension of 44. 2D features R of the described case ^2d Is (7, 16, h, w, 44).

Step5.2, mapping the 2D features to a 3D world coordinate system of the learnable parameters to obtain point-by-point 2D mapping features, wherein the dimension is 44. Point-by-point mapping feature R of described cases ^3d Is (7, 16, h×w, 44).

Step5.3, for all 3D mapping points within each 3D ROI grid, calculate its 3D mapping feature R ^3d Is called a frame-by-frame 3D mapping featureFrame-by-frame 3D features of described casesIs (7, 16, 1000, 44).

Step6: generating initial 3D detection frame feature representations of each frame in the sliding window by using a query vector table capable of learning parameters, and splicing hidden vector representations corresponding to the space dimension to obtain hidden vector representations of the 3D anchor, which are called initial 3D detection frame representations. The initial 3D detection box of the described case indicates that shape of the anchor is (7, 16, 1000, 180), where 180=60+60+60, representing the sum of the look-up vector table encodings of the three spatial coordinate axes.

Step7: transformer is identical toAnd predicting the position, the size, the angle and the class probability of the multi-frame multi-target detection frame. And taking the concatenation of the frame-by-frame 3D features and the frame-by-frame 3D mapping features as Key and Value, wherein shape is (7, 16, 1000, 180), and 180 = 136+44. And taking the initial 3D detection frame representation as a Query (Key, value and Query of the subsequent block are all from the output of the last block), generating a multi-frame multi-target 3D detection frame by using a transducer decoder with alternate sparse and dense attention, and splicing one two-layer MLP by the output of the last block to obtain a predicted multi-frame multi-target detection frame. Suppose that the transducer uses 9 blocks in total, there are 12 categories of targets in the scene. In the described case, the attention mechanism is alternately controlled by judging the result of the current block for the remainder 3; and judging whether the current block is 0 to control the values of Key, value and Query. The 0 th block will frame-by-frame 3D featureAs Key and Value, the initial 3D detection box represents an anchor as Query; the 1 st to 8 th blocks use the output of the last block as Key, value and Query; the 0,3, 6 th blocks use non-local attention; the 1 st, 4 th and 7 th blocks use expansion attention; the 2 nd, 5 th and 8 th blocks use local attention; the shape of the respective inputs and outputs of each block is (7, 16, 1000, 180); the output of the last block is spliced with a two-layer MLP to obtain a predicted multi-frame multi-target detection frame +.>Is (7, 16, 1000, 20). Wherein 20=1+12+7, the first dimension represents the empty class, the first 13 dimensions are calculated from softmax, and the last 7 dimensions represent the relative position, size, angle of the 3D detection frame:

(h_shift，w_shift，d_shift，center_x_shift，center_y_shift，center_z_shift，cosθ)。

step8: the hungarian algorithm performs inter-frame ID matching.

Step9: if the training stage is not performed, a Kalman filter is adopted to perform noise filtering and updating (reasoning stage) of the 3D detection frame; if the training phase is the training phase, the classification of the detection frame is guided by using the Focal Loss with balanced category, the regression of the coordinate offset of the detection frame is guided by using the L1 Loss (training phase), and the network weight is updated by using a random gradient descent algorithm with momentum (training phase).

Example 3:

fig. 6 is a block diagram of a multi-mode multi-target 3D tracking device according to an embodiment of the present invention, where the multi-mode multi-target 3D tracking device considers far-frame dependency, and the device includes:

The functions of each module in embodiment 3 correspond to the content in the corresponding method embodiment, and are not described herein.

Example 4:

fig. 7 shows a schematic structural diagram of an electronic device 70 according to an embodiment of the present invention, where the electronic device 70 includes at least one processor 701 (e.g. a CPU), at least one input/output interface 704, a memory 702, and at least one communication bus 703 for enabling connection communication between these components. At least one processor 701 is configured to execute computer instructions stored in a memory 702 to enable the at least one processor 701 to perform an embodiment of any one of the 3D tracking methods described previously. The memory 702 is a non-transitory memory (non-transitory memory) that may include volatile memory, such as high-speed random access memory (RAM: random Access Memory), or may include non-volatile memory, such as at least one disk memory. Communication connection(s) with at least one other device or unit is effected through at least one input output interface 704 (which may be a wired or wireless communication interface).

In some embodiments, the memory 702 stores a program 7021, and the processor 701 executes the program 7021 to perform any of the foregoing sub-table method embodiments.

The electronic device may exist in a variety of forms including, but not limited to:

(1) A mobile communication device: such devices are characterized by mobile communication capabilities and are primarily aimed at providing voice, data communications. Such terminals include: smart phones (e.g., iPhone), multimedia phones, functional phones, and low-end phones, etc.

(2) Ultra mobile personal computer device: such devices are in the category of personal computers, having computing and processing functions, and generally also having mobile internet access characteristics. Such terminals include: PDA, MID, and UMPC devices, etc., such as iPad.

(3) Portable entertainment device: such devices may display and play multimedia content. The device comprises: audio, video players (e.g., iPod), palm game consoles, electronic books, and smart toys and portable car navigation devices.

(4) Specific server: the configuration of the server includes a processor, a hard disk, a memory, a system bus, and the like, and the server is similar to a general computer architecture, but is required to provide highly reliable services, and thus has high requirements in terms of processing capacity, stability, reliability, security, scalability, manageability, and the like.

(5) Other electronic devices with data interaction functions.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A multi-mode multi-target 3D tracking method considering far-frame dependency relationship is characterized by comprising the following steps:

s2: filling a tracking scene into the 3D sliding window to a fixed window size;

s9: and updating the 3D detection frame by adopting a Kalman filter.

2. The multi-mode multi-target 3D tracking method according to claim 1, wherein the step S2 specifically comprises: filling the tracking scene into the 3D sliding window to a fixed window size, and supplementing n-t frames if the last sliding window frame number t is smaller than the window size n.

3. The multi-mode multi-target 3D tracking method according to claim 1, wherein the step S3 specifically comprises:

s3.1: generating a 3D coordinate system transformation matrix of a learnable parameterMatrix->Initializing the weight as a unit matrix;

wherein,represents the set of points of the ith frame in the world coordinate system, and k represents the total number of points in the ith frame.

4. The multi-mode multi-target 3D tracking method according to claim 3, wherein in step S4, before dividing the 3D grid, inter-frame position encoding and intra-frame position encoding are further performed on each frame 3D coordinate in the sliding window, specifically:

ω _j ～N(0,1),j＝1,2,...,8；

intra-frame position coding of an ith frameIs a feature representation with dimension 128, intra-frame position coding of the ith frame +.>Extracted from the split task main network of PointNet++,

wherein the rep function represents the inter-frame space codingCodeRepeating k times, con function represents inter-frame space coding to be repeated k times +.>And intra position coding +.>Feature stitching is performed in the second dimension.

5. The multi-modal multi-target 3D tracking method according to claim 4, wherein the step S4 is specifically:

6. The multi-mode multi-target 3D tracking method according to claim 5, wherein the step S5 specifically comprises:

wherein R is _i ∈R ^h×w×3 RGB values, BN, representing the 2D pixels of the ith frame _v2 A BiseNetV2 backbone network pre-trained on behalf of the ImageNet dataset, MLP function represents the tail feature transformed to the 44 dimension via one single layer MLP,a 2D feature representing an i-th frame;

wherein,is the space coordinate of the rasterized 3D camera viewing cone corresponding to the pixel point, D _j Is the depth of the j-th pixel point;a transformation matrix representing the self coordinate system of the i-th frame to the camera coordinate system; />A transformation matrix representing the world coordinate system of the i-th frame to a self coordinate system; />A 3D coordinate system transformation matrix representing the learnable parameters;is the 3D coordinates (alpha) in the 3D coordinate system of the learnable parameters _j ,β _j ,γ _j )；

7. The multi-modal multi-target 3D tracking method according to claim 6, wherein the step S6 is specifically: generating an initial 3D detection frame feature representation of each frame within the sliding window using a lookup vector table of the learnable parameters as:

ω _j ～U(0,1),j＝1,2,...,60；

anchor＝rep(anchor _i ,n)

wherein, anchor epsilon R ^n×an×180 All 3D anchor feature representations representing sliding windows, referred to as an initial 3D detection box representation, wherein an=m x m.

8. The multi-modal multi-target 3D tracking method according to claim 7, wherein the step S7 is specifically: the concatenation of the frame-by-frame 3D features and the frame-by-frame 3D mapping features is used as Key and Value, the 3D detection frame representation is initially used as Query, a transducer decoder with alternate sparse and dense attention is used for generating a multi-frame multi-target 3D detection frame, and the multi-frame multi-target 3D detection frame is formed by the following steps:

Q＝anchor；

K _i+1 ＝V _i+1 ＝Q _i+1 ＝block _i (K _i ,V _i ,Q _i )；

wherein,representing predicted multi-frame multi-target detection frames, m represents the total number of categories, and m+7 represents the classified one-hot probability and the regressive 3D frame size, relative position and angle respectively.

9. A multi-modal multi-target 3D tracking apparatus taking into account far-frame dependencies, comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the multi-modal multi-target 3D tracking method according to any one of claims 1 to 8 when the program is executed by the processor.