CN117710659A

CN117710659A - Deformable attention three-dimensional point cloud target detection method, system and equipment

Info

Publication number: CN117710659A
Application number: CN202311822752.1A
Authority: CN
Inventors: 李垚辰; 唐文能; 李一帆
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-03-15

Abstract

The invention discloses a method, a system and equipment for detecting a three-dimensional point cloud target with deformable attention, which comprise the steps of obtaining point cloud data and preprocessing the point cloud data: extracting three-dimensional voxel characteristics from the preprocessed point cloud data, extracting point cloud characteristics from the voxel characteristics, predicting candidate frame types according to the point cloud characteristics, detecting frame sizes and course angles, and obtaining candidate frames; filtering the generated candidate frames by using a post-processing algorithm to serve as a region of interest; extracting the characteristics of grid points of the region of interest according to the region of interest; and adjusting the size and the confidence of the characteristics of the grid points of the region of interest, generating the positions of candidate frames, and realizing the three-dimensional point cloud target detection of the deformable attention. The invention greatly improves the detection precision of the existing method on difficult samples and samples with smaller volume, further improves the safety of the automatic driving vehicle under extreme conditions, and has high practical application value.

Description

Deformable attention three-dimensional point cloud target detection method, system and equipment

Technical Field

The invention belongs to the technical field of computer vision and pattern recognition, and particularly relates to a method, a system and equipment for detecting a three-dimensional point cloud target with deformable attention.

Background

The three-dimensional object detection task is a key technology in the fields of automatic driving, robot navigation, face recognition and the like, and the three-dimensional object detection task needs to detect information such as the category, the position, the course angle and the like of an object from a three-dimensional space.

In recent years, as an important object in urban traffic scenes, a three-dimensional object detection task of an outdoor scene based on point clouds has also received attention of more and more researchers. Common three-dimensional object detection can be classified into a one-stage method and a two-stage method. A one-stage method is disclosed in a publication "SECOND: sparsely Embedded Convolutional Detection" by Yan et al on the journal of Sensor, wherein features of a three-dimensional grid are extracted by using a three-dimensional sparse convolution layer, then the three-dimensional grid is converted into a bird's-eye view angle, and then a two-dimensional convolution layer is used for generating a detection result. Two-stage methods such as the "Voxel r-cnn: towards high performance Voxel-based 3d object detection" by Shi et al, conference The AAAI Conference on Artificial Intelligent.2021, are used to further generate more accurate predictions by further extracting features of the region of interest generated in one stage, including features in three-dimensional sparse feature maps, three-dimensional points, etc., based on the one-stage method.

The two-stage method does not fully utilize the prediction result of the previous stage and the three-dimensional sparse features extracted by the sparse convolution layer, the extraction features are limited in the prediction result, abundant context information around the prediction target is difficult to obtain, and more accurate prediction three-dimensional targets are not easy to generate.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a three-dimensional point cloud target detection method, system and equipment with higher precision and deformable attention.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a three-dimensional point cloud target detection method of deformable attention comprises the following steps:

acquiring point cloud data and preprocessing the point cloud data:

extracting three-dimensional voxel characteristics from the preprocessed point cloud data, extracting point cloud characteristics from the voxel characteristics, predicting candidate frame types according to the point cloud characteristics, detecting frame sizes and course angles, and obtaining candidate frames;

filtering the generated candidate frames by using a post-processing algorithm to serve as a region of interest;

extracting the characteristics of grid points of the region of interest according to the region of interest;

and adjusting the size and the confidence of the characteristics of the grid points of the region of interest, generating the positions of candidate frames, and realizing the three-dimensional point cloud target detection of the deformable attention.

Further, the point cloud data comprises training data and test data, the training data is enhanced, and voxel processing is carried out on the enhanced training data and test data;

the concrete process of voxelization treatment of the training data and the test data after the enhancement treatment is as follows: the training data and the test data after the enhancement processing are uniformly divided into three-dimensional voxels with uniform sizes along three dimensions of an x axis, a y axis and a z axis, the sizes of the three-dimensional voxels are respectively recorded as the total length H on the x axis, the total length W on the y axis, the total length D on the z axis and the length V of each voxel on the x axis _h Each voxel length V on the y-axis _w Each voxel length V in the z-axis _d Each voxel contains a number of points not exceeding the threshold N, and if the number of voxels exceeding the threshold N is discarded.

Further, three-dimensional voxel features are extracted by the following formula:

wherein N is _p Representing the number of point clouds in a voxel, p _i Representing coordinates of a voxel midpoint, and f representing three-dimensional voxel characteristics;

the point cloud features are extracted by the following process: extracting features from three-dimensional voxel features by adopting two sub-manifold convolution layers, and then promoting interaction among all voxels by adopting a space sparse convolution layer according to the extracted features to obtain a three-dimensional sparse feature map; the convolution kernel size of the sub-manifold convolution layer is 3, the step length is 1, the filling is 1, the convolution kernel size of the space sparse convolution layer is 3, the step length is 2, and the filling is 2.

Further, extracting three-dimensional voxel features from the preprocessed point cloud data, extracting point cloud features from the voxel features, predicting candidate frame types, detecting frame sizes and course angles according to the point cloud features, and obtaining candidate frames, wherein the method comprises the following steps: compressing the z-axis of the three-dimensional sparse feature map to obtain a two-dimensional feature map under the view angle of the aerial view, adopting a two-dimensional convolution layer to the two-dimensional feature map under the view angle of the aerial view to obtain candidate frames, and obtaining a category prediction according to the candidate frames to detect the prediction results of the size of the frames and the course angle;

wherein, classification loss L adopted by classification prediction _cls The following is shown:

wherein c _i Representing the true category of the target to which the candidate point corresponds,representing a class confidence of the network prediction;

the detection frame size prediction adopts regression loss, wherein the regression loss comprises position regression loss of a target center point, boundary frame size regression loss and course angle regression loss;

position regression loss of target center point and bounding box size regression loss Using the Smooth L1 loss function L _loc Calculating a Smooth L1 loss function L _loc The following is shown:

in loc _i Representing the offset of the target center point to which the candidate point belongs to the candidate point,representing model predicted offsets, szie _i Representing the size of the target bounding box to which the candidate point belongs,/->The size of the model predictions represented;

heading angle prediction loss L _angle The following is shown:

wherein R is _c Indicated is a corresponding interval of the real frame heading angle,representing the confidence of model prediction of each interval, the loss function is cross entropy classification loss, R _r Representing the deviation of the heading angle of the real frame in the corresponding interval,/->Representing the offset of the model prediction bounding box in the corresponding interval, etc.>For heading angle class loss, < >>Regression loss for heading angle;

further, filtering the generated candidate frame by using a post-processing algorithm, wherein the candidate frame is used as a region of interest, and the method comprises the following steps: filtering out candidate frames with classification confidence coefficient lower than a confidence coefficient threshold value by adopting a confidence coefficient filtering algorithm, and filtering out overlapped candidate frames by adopting a non-maximum suppression algorithm to obtain a region of interest.

Further, according to the region of interest, extracting characteristics of grid points of the region of interest, including the following steps: for each region of interest, generating a uniform distribution size (n) in the region of interest at a fixed ratio _x ,n _y ,n _z ) Where n is _x ,n _y ,n _z Respectively representing the number of grid points along an x-axis, a y-axis and a z-axis in a local coordinate system of the region of interest;

calculating the characteristics of the grid points at the voxels corresponding to the characteristic map by adopting the following steps:

wherein,representing point P _i Corresponding position on a feature map of scale l, where x _pi ,y _pi ,z _pi Representing point P _i Corresponding x-coordinate, y-coordinate, z-coordinate, d represents the downsampling multiple of the feature map,/-the feature map>Representing a rounding down calculation;

if one grid corresponding position is empty voxel, using 0 vector as the characteristic of grid point, otherwise adopting the characteristic of corresponding voxel as the characteristic of grid point and recording asRepresenting point P _i A feature corresponding in scale l;

for each sampling point, calculating the coordinate offset of the sampling point relative to the initial position and the attention weight of the sampling point according to the characteristics of the voxel corresponding to the characteristic map by the grid point;

calculating the characteristics of the corresponding positions of the sampling points in the feature map according to the coordinate offset of the sampling points relative to the initial positions, carrying out weighted summation on the characteristics of the corresponding positions of the sampling points in the feature map and the attention weights of the sampling points, taking the weighted summation as the characteristics of grid points, and splicing the characteristics of all grid points in the region of interest to obtain the characteristics of the grid points of the region of interest.

Further, the coordinate offset DeltaP of the sampling point relative to the initial position is calculated by the following formula _lk ：

Wherein MLP represents the linear layer and Relu represents the Relu activation function;

the attention weight for each sample point is calculated using:

wherein,representing P _i Attention weight vector for each sample point on scale l.

Further, the characteristics of the grid points of the region of interest are calculated using the following formula:

wherein F is _i Representing the characteristics of one grid point in the candidate box,representing that for each grid point features from different scales are stitched to form the final feature, +.>The attention weights calculated for the kth sample point of the ith point at scale l are represented by the grid points, the sum of the attention weights of all K sample points of point i at scale l being 1, ">Representing the coordinates of the sampling point at the scale l, < ->Representing the offset corresponding to the coordinates, x ^l A function representing the feature of acquiring the corresponding position on the three-dimensional sparse feature map of dimension l.

A deformable attention three-dimensional point cloud target detection system, comprising:

the data acquisition and processing module is used for acquiring point cloud data and preprocessing:

the prediction module is used for extracting three-dimensional voxel characteristics from the preprocessed point cloud data, extracting point cloud characteristics from the voxel characteristics, predicting candidate frame types according to the point cloud characteristics, detecting frame sizes and course angles, and obtaining candidate frames;

the filtering module is used for filtering the generated candidate frames by using a post-processing algorithm to serve as the region of interest;

the feature extraction module of the grid points is used for extracting the features of the grid points of the region of interest according to the region of interest;

and the position generation module of the candidate frame is used for adjusting the size and the confidence of the characteristics of the grid points of the region of interest, generating the position of the candidate frame and realizing the three-dimensional point cloud target detection of the deformable attention.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the deformable attention three-dimensional point cloud object detection system when the computer program is executed.

Compared with the prior art, the invention has the following beneficial effects:

extracting three-dimensional voxel characteristics from preprocessed point cloud data, extracting point cloud characteristics from the voxel characteristics, predicting candidate frame types according to the point cloud characteristics, detecting frame sizes and course angles, and obtaining candidate frames; filtering the generated candidate frames by using a post-processing algorithm to serve as a region of interest; according to the region of interest, extracting the characteristics of grid points of the region of interest, namely extracting the characteristics of grid points of the region of interest by adopting a deformable attention method, so as to further adjust the prediction result of the region of interest. Because the deformable attention method can adaptively adjust the grid point positions, compared with the traditional feature extraction method based on fixed grid points, the method can acquire the context features of the region of interest more abundantly, and for targets with smaller points and farther distance from a sensor, the deformable attention method can extract the context information around the targets through larger-scale offset, so that the features of the deformable attention method are enriched, and the prediction accuracy of the deformable attention method is improved. For targets of smaller scale, where one-stage prediction is inaccurate, the deformable attention method may adaptively adjust the region sampling point positions so that the sampling points tend to sample at more accurate positions, thereby extracting more sophisticated features to further refine such targets. In addition, in order to fully utilize the features acquired under different scales, the invention uses a deformable attention method on multiple scales, provides more comprehensive feature description for the target, and is beneficial to generating more accurate detection results.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a block diagram of a three-dimensional sparse feature extraction module of the method of the present invention;

FIG. 3 is a block diagram of a two-dimensional candidate block generation module according to the present invention;

FIG. 4 is a flow chart of a deformable attention region of interest feature extraction module of the present invention;

FIG. 5 is a visual experimental diagram of the detection result of 000006 frame data on a large public data set KITTI; wherein, (a) is 000006 frame label information, (b) is 000006 frame detection result of the invention, (c) is 000006 frame detection result of CT3D method;

FIG. 6 is a visual experimental diagram of the detection result of 000025 frame data on a large public data set KITTI; wherein, (a) 000025 frame label information, (b) is 000025 frame detection result of the invention, (c) is 000025 frame detection result of CT3D method;

FIG. 7 is a visual experimental diagram of the detection result of 000039 frame data on a large public data set KITTI; wherein, (a) is 000039 frame label information, (b) is 000039 frame detection result of the invention, (c) is 000039 frame detection result of CT3D method;

fig. 8 is a schematic diagram of a deformable attention three-dimensional point cloud object detection system of the present invention.

Detailed Description

In order that the invention may be readily understood, a more complete description of the invention will be rendered by reference to the appended drawings. The drawings illustrate preferred embodiments of the invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Referring to fig. 1, according to the method for detecting the three-dimensional point cloud target with deformable attention, the characteristics of a candidate region are extracted by using the deformable attention method, and the context information rich in the candidate region is obtained through self-adaptive offset and weight, so that a more accurate prediction result is generated on the basis of an original detection result, a network can detect a target with a small size and a long distance more accurately, and the detection precision of an original algorithm is greatly improved. The method comprises the following specific steps:

step 1: point cloud data acquisition and point cloud data preprocessing: the invention uses a large open source data set KITTI as samples to obtain 7481 training and test samples in total, wherein 3712 training samples and 3769 test samples are obtained. Each training and testing sample contains a frame of point cloud data and a target label corresponding to the frame. For each frame of point cloud data, preprocessing is firstly carried out, and for training data, data enhancement processing is needed. In addition, the training sample and the test sample are required to be subjected to voxelized treatment, and the specific process is as follows:

dividing three-dimensional point cloud data into three-dimensional voxels with uniform size along three dimensions of an x-axis, a y-axis and a z-axis, wherein the sizes of the three-dimensional voxels are respectively recorded as the total length H on the x-axis, the total length W on the y-axis, the total length D on the z-axis and the length V of each voxel on the x-axis _h Each voxel length V on the y-axis _w Each voxel length V in the z-axis _d Each voxel contains a number of points not exceeding the threshold N, and if the number of voxels exceeding the threshold N is discarded.

Preferably, the total length W in the KITTI data set on the y-axis is set to a size of 70.4 meters, and the range is set to [0,70.4 ]]The total length H in the x-axis is set to 80 meters and the range is set to [ -40,40]The overall length D of the profile in the z-axis is set to be 4 meters and the range is set to be [ -3,1]. Length V of each voxel on x-axis _h And each voxel length V on the y-axis _w Is set to 0.05m, each voxel length V in the z-axis _d Is set to 0.1m. The size of the threshold N is set to 5, i.e. when the number of points in a voxel exceeds 5, the redundant points are randomly discarded.

Step 2: and extracting three-dimensional voxel features of the preprocessed point cloud data by using a voxel feature extractor, extracting point cloud features of the voxel features by using a three-dimensional sparse convolution layer, and obtaining candidate frames by using a two-dimensional convolution layer according to the point cloud features.

The invention uses the mean voxel feature extractor to extract the three-dimensional voxel feature of each grid of the preprocessed point cloud data, and the basic principle can be described by the following formula:

wherein N is _p Representing the number of point clouds in a voxel, p _i Representing coordinates of a voxel midpoint, and f representing a three-dimensional voxel feature.

Because the three-dimensional sparse convolution layer feature extraction module adopted by the invention extracts the features, the processing of empty voxels which do not contain point cloud is not needed.

Because most three-dimensional voxels in the scene are empty, a three-dimensional sparse convolution layer feature extractor is used for processing the whole point cloud scene, and a three-dimensional sparse feature map, namely point cloud features, is generated. This greatly reduces the amount of computation required and reduces the occupation of computing resources.

For the three-dimensional sparse convolution layer feature extraction module, the invention adopts the design in the SECOND method, wherein the two types of sparse convolution layers are included, one is a sub-manifold sparse convolution layer, the other is a space sparse convolution layer, the difference between the two types of sparse convolution layers is that the input of the sub-manifold sparse convolution layer is activated only when the center of a convolution kernel is the same activation position, and the space sparse convolution layer can be activated only when the convolution kernel contains the activation position. It can be seen that the spatial sparse convolution layer can enable sparse data to grow at a very fast speed, and the sparsity of the data is destroyed. Therefore, in the process of extracting the three-dimensional features, the sub-manifold sparse convolution layer and the space sparse convolution layer are used in a matched mode, so that the data sparsity is ensured, and meanwhile interaction among all the activation positions is promoted. The specific structure is shown in fig. 2, features are extracted from three-dimensional voxel features by using two sub-manifold convolution layers, and interaction among all voxels is promoted by using a space sparse convolution layer according to the extracted features, so that a three-dimensional sparse feature map is obtained. The convolution kernel size of the sub-manifold convolution layer is 3, the step length is 1, the filling is 1, the convolution kernel size of the space sparse convolution layer is 3, the step length is 2, and the filling is 2. The structure can be seen that the size of the feature map after the sub-manifold convolution layer treatment does not change, and the feature map after the space sparse convolution layer treatment is reduced by 2 times compared with the original feature map. The number of channels for each convolutional layer increases exponentially as downsampling proceeds. Three such structures are used in the stack of the present invention, and therefore the resulting space size is downsampled by a factor of 8 compared to the input space. Feature maps of different scales are denoted as x ₁ ,x ₂ ,x ₃ Wherein x is ₃ A feature map of downsampling by a factor of 8 is shown. In a deformable feature extraction structureThe invention fully utilizes the three-dimensional feature diagrams with different scales so as to further enrich the features of the region of interest, thereby improving the detection precision.

Before using the two-dimensional convolution layer process, the obtained three-dimensional sparse feature map needs to be converted into a two-dimensional feature map under the bird's eye view angle. The specific transformation process is as follows: compressing the z-axis of the three-dimensional sparse feature map to obtain a two-dimensional feature map under the view angle of the bird's eye view map, and assuming that the size of the three-dimensional sparse feature map is (C, D, H, W), wherein C represents the channel number of the feature map, D represents the scale of the feature map on the z-axis, H represents the scale of the feature map on the y-axis, W represents the scale of the feature map on the x-axis, and after compressing the three-dimensional sparse feature map along the z-axis, the size of the two-dimensional feature map under the view angle of the bird's eye view map is (C x D, H, W), namely the channel number of the two-dimensional feature map is changed to C x D.

After obtaining the two-dimensional feature map under the view angle of the aerial view, as shown in fig. 3, a plurality of downsampling convolution layers and transposed convolution layers are used for extracting the features of the two-dimensional feature map under the view angle of the aerial view of different layers, the obtained features are spliced, and finally, a prediction result is output by using the two-dimensional convolution layers, wherein the prediction result mainly comprises three parts, namely category prediction, detection frame size prediction and course angle prediction.

In the training process, the classification loss L is adopted _cls The following is shown:

wherein c _i Representing the true category of the target to which the candidate point corresponds,representing the class confidence of the network predictions. The regression loss mainly comprises three parts, namely, the position regression loss of the target center point, the boundary box size regression loss and the course angle regression loss. Wherein the position regression loss and the bounding box size regression loss use a Smooth L1 loss function L _loc Calculations were performed as follows:

in loc _i Representing the offset of the target center point to which the candidate point belongs to the candidate point,representing model predicted offsets, szie _i Representing the size of the target bounding box to which the candidate point belongs,/->The model prediction size of the representation.

The prediction of the heading angle is divided into two parts, namely the classification of the heading angle and the offset in the category to which it belongs. The invention uniformly divides the course angle of the target into 12 sections, wherein the meaning of the classification loss is which section the course angle of the target belongs to, and the regression loss is the deviation of the course angle in the section. Therefore, heading angle prediction loss L _angle The following is shown:

wherein R is _c Indicated is a corresponding interval of the real frame heading angle,representing the confidence of model prediction of each interval, the loss function is cross entropy classification loss, R _r Representing the deviation of the heading angle of the real frame in the corresponding interval,/->The offset of the model prediction boundary box in the corresponding section is represented, and the corresponding loss function is Smooth L1 loss. />For heading angle class loss, < >>And (5) returning loss for the course angle.

Thus, regression loss L _reg The following is shown:

L _reg ＝L _loc +L _angle

step 3: and filtering the generated candidate frames by using a post-processing algorithm to obtain the final region of interest.

The method mainly comprises two post-processing algorithms, namely filtering out the classification confidence coefficient lower than a confidence coefficient threshold T by using a confidence coefficient filtering algorithm _x The confidence threshold T used in the present invention _s A candidate box of 0.3, i.e., with confidence below 0.3, is discarded directly and no longer participates in subsequent calculations. In addition, to further remove duplicate predictions for the same target, non-maximal suppression algorithms have been introduced to further process candidate boxes, specifically to filter out overlapping candidate boxes. The algorithm is widely applied to target detection neighborhoods, and the core idea is to select a frame with highest confidence from a group of overlapped prediction frames as a result of the group. The IoU index is used to evaluate the overlapping degree of the prediction frames, and when IoU between prediction frames exceeds a threshold value of 0.01, it is determined that two prediction frames belong to the same group. And finally, selecting the prediction result with the highest confidence in the same group as the final result of the group, wherein the rest detection results do not participate in the calculation of the second stage.

This step can reduce the number of regions of interest entered in the second stage and thus reduce the amount of computation required.

Step 4: and constructing a deformable attention-based region of interest feature extraction module for extracting the region of interest features according to the final region of interest.

The method aims at further enriching the characteristics of the target context by utilizing the prediction result of the previous stage and the characteristics extracted by the three-dimensional sparse convolution layer, so as to adjust the target prediction result and realize more accurate prediction.

See FIG. 4For each region of interest, a uniform distribution of dimensions (n) is first generated in the region of interest in a fixed proportion _x ,n _y ,n _z ) Where n is _x ,n _y ,n _z The number of grid points along the x-axis, y-axis, and z-axis in the local coordinate system of the region of interest is represented, respectively.

Used in the present invention number of grid points 6X 6. For each grid, the real coordinates of the region of interest in the three-dimensional scene are calculated by using the size and the position of the region of interest, and are denoted as P _i . For each grid point of the region of interest and the feature map of the corresponding scale, the feature of the grid point at the voxel corresponding to the feature map is calculated by the following formula:

wherein,representing point P _i Corresponding position on a feature map of scale l, where x _pi ,y _pi ,z _pi Equal representation point P _i Corresponding x-coordinate, y-coordinate, z-coordinate, d represents the downsampling multiple of the feature map,/-the feature map>Representing a rounding down calculation.

For each grid point there is a voxel corresponding to it. If one grid corresponding position is empty voxel, using 0 vector as the characteristic of the grid point, otherwise, using the characteristic of the corresponding voxel as the characteristic of the point, and marking the characteristic asRepresenting point P _i The corresponding feature on scale l.

Each grid point corresponds to K sampling points, and for each sampling point, the coordinate offset of the sampling point relative to the initial position and the coordinate offset of the sampling point are calculated by using the characteristics corresponding to the grid pointsThe attention weights of the sampling points enable the grid points to adaptively focus on important locations and important features when extracting features. Calculating the offset delta P of the sampling point by adopting the following method _lk ：

Where MLP represents the linear layer and Relu represents the Relu activation function, the output size of the equation is 3, the input size follows the size change of the feature map corresponding to the offset of the sample points.

After the corresponding offset of each sampling point is obtained, the feature of the corresponding position of the sampling point in the feature map is calculated. The method is denoted as x ^l (P _l +ΔP _lk ). The attention weight for each sample point is calculated using the linear layer using the following equation:

wherein,representing P _i The attention weight vector of each sampling point on the scale of l is provided with a length of K, and the sum of the weights of each sampling point is 1.

It can be seen that the characteristics of the grid points of the region of interest are calculated using the following formula:

wherein F is _i Representing the characteristics of one grid point in the candidate box,representation ofFor each grid point, features from different scales are stitched to form the final feature,/->Representing the attention weights calculated for the kth sample point of the ith grid point at scale l, the attention weights of all K sample points of grid point i at scale l summing to 1, < >>Representing the coordinates of the sampling point at the scale l, < ->Indicating the offset corresponding to the coordinate. X is x ^l A function representing the feature of acquiring the corresponding position on the three-dimensional sparse feature map of dimension l.

The features of the sampling points at the corresponding positions in the feature map are weighted and summed to form the features of the final grid points. And splicing the characteristics of all grid points in the region of interest to serve as the characteristics of the region of interest.

In addition, the method and the device fully utilize three-dimensional features of different scales, respectively calculate the features of grid points on the feature graphs of different scales, and splice the features of different scales to serve as the features of the final grid points of the region of interest.

And 5, taking the characteristics of grid points of the region of interest as the input of two multi-layer perceptrons, further adjusting the size and adjusting the confidence coefficient, generating the position of a candidate frame, and realizing three-dimensional point cloud target detection, thereby improving the prediction accuracy of the invention.

The step uses the characteristics of the grid points of the region of interest to further adjust the detection results and calculates the loss according to the prediction results.

Wherein the number n of grid points along the x-axis, the y-axis and the z-axis in the local coordinate system of the region of interest _x ,n _y ,n _z Is set to a size of 6,K and is set to a size of 27.

The structure of the detection head of the multi-layer perceptron is shown in fig. 3, and is composed of two parallel linear layers, wherein one linear layer is used for predicting the confidence of the region of interest, and the other linear layer is used for predicting the vector required by regression of the region of interest. The confidence label is calculated as follows:

therein, ioU _i Representing the current region of interest and the maximum value of IoU of the truth box. θ _L IoU threshold, θ, representing assignment of a region of interest as background _H Representing IoU threshold to assign a region of interest as foreground. The classification loss can be described by the following formula:

L _cls-rcnn ＝CrossEntropy(p _i ,S _i (IoU _i ))

wherein p is _i A predicted value representing the confidence of the region of interest. Cross Entropy means that the loss is calculated using cross entropy loss. The regression loss is calculated as follows:

wherein 1 (IoU) _i ≥θ _reg ) Representing an indication function, when IoU _i Greater than theta _reg Its value is 1, otherwise 0. Delta _i Indicating the predicted value of the detection head,representing the true value of the target regression.

According to the data to be detected of the three-dimensional target, the method can also realize the detection of the three-dimensional target.

Referring to fig. 8, the present invention further provides a three-dimensional point cloud object detection system of deformable attention, including:

The following is a specific example.

The neural network is realized based on a PyTorch framework, and a workstation used for training is provided with two 2080Ti type GPUs for operation acceleration. Experiments were performed on a large open source dataset KITTI dataset. Which contains 7481 training and test samples, 3712 training samples and 3769 test samples. The dataset contains three categories, car, pedestrian and rider, respectively. The targets are classified into three different grades, namely simple, medium and difficult according to the size and shielding degree of the targets. For the car class, the prediction result of IoU exceeding 0.7 is taken as the correct detection, the threshold value of the rest class is 0.5, and the average precision is calculated as the final evaluation result. Training was performed using an open source framework OpenPCDet, the learning rate was set to 0.003, and the training round number was set to 80. Common data enhancement methods are used in the training process to accelerate model convergence.

The target in the validation set is detected using the trained model, and the accuracy of the detection is calculated using the calculation method of Recall 40 provided by the official of the KITTI. Wherein the detection precision is shown in table 1, and compared with the prior method, the method has the advantage that the three-dimensional detection precision is remarkably improved.

Table 1KITTI dataset verification set mAP_R40 index experiment results

From the results shown in Table 1, the method provided by the invention greatly leads other methods in detection precision, and particularly for targets with serious shielding, long distance and small volume, the detection effect is far superior to that of the similar methods.

In addition, the present invention further illustrates the visualization of the detection result, in fig. 5, (a), (b) and (c) show the visualization effect of the present invention on the 000006 frame data in the KITTI data set, in fig. 6, (a), (b) and (c) show the detection effect of the present invention on the 000025 frame data in the KITTI data set, and in fig. 7, (a), (b) and (c) show the detection effect of the present invention on the 000039 frame data in the KITTI data set. Wherein the first row of pictures represents the label of the dataset, the second row of pictures represents the detection result of the invention, and the third row represents the detection result of CT 3D. As can be seen from fig. 5 and 7, the present invention detects difficult targets that are severely occluded except for a far distance, whereas the CT3D method is a detection. As can be seen from fig. 6, the detection results of the position, the size, the heading angle and the like of the detection frame are all based on the CT3D method.

The experimental result shows that the deformable attention method provided by the invention fully plays the advantages of the self-adaptive interested region feature extraction algorithm, and greatly surpasses the detection precision of the traditional grid-based method. The deformable attention provided by the invention can ensure that the model is not limited in the region of interest when the features of the region of interest are extracted, the region of interest can be adaptively adjusted, the position of interest in the feature extraction process can be adjusted, and different attention weights can be given to different positions, so that the interference of wrong prediction of the region of interest and background points is avoided. Experimental results show that the invention effectively improves the detection precision of small targets such as difficult samples, pedestrians, riders and the like, and can be applied to the fields of automatic driving, robot navigation and the like.

According to the invention, the characteristics of the candidate region are extracted by using the deformable attention method, and the context information rich in the candidate region is obtained through the self-adaptive offset and the weight, so that a more accurate prediction result is generated on the basis of the original detection result, the target with small size and long distance can be detected more accurately, and the detection precision is greatly improved.

The foregoing is illustrative of the preferred embodiments of the present invention and is not to be construed as limiting the claims. The present invention is not limited to the above embodiments, and the specific structure thereof is allowed to vary. It is intended that all such variations as fall within the scope of the appended claims.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

Claims

1. The method for detecting the deformable attention three-dimensional point cloud target is characterized by comprising the following steps of:

acquiring point cloud data and preprocessing the point cloud data:

2. The method for detecting a three-dimensional point cloud target with deformable attention according to claim 1, wherein the point cloud data includes training data and test data, wherein the training data is subjected to enhancement processing, and wherein the enhancement processed training data and test data are subjected to voxel processing;

3. The method of claim 1, wherein the three-dimensional voxel features are extracted by the following formula:

4. The method for detecting a three-dimensional point cloud target of deformable attention according to claim 1, wherein extracting three-dimensional voxel features from the preprocessed point cloud data, extracting point cloud features from the voxel features, predicting candidate frame types, detecting frame sizes and heading angles according to the point cloud features, and obtaining candidate frames, comprises the steps of: compressing the z-axis of the three-dimensional sparse feature map to obtain a two-dimensional feature map under the view angle of the aerial view, adopting a two-dimensional convolution layer to the two-dimensional feature map under the view angle of the aerial view to obtain candidate frames, and obtaining a category prediction according to the candidate frames to detect the prediction results of the size of the frames and the course angle;

in loc _i Representing the offset of the target center point to which the candidate point belongs to the candidate point,representing the offset, size, of model predictions _i Representing the size of the target bounding box to which the candidate point belongs,/->The size of the model predictions represented;

heading angle prediction loss L _angle The following is shown:

wherein R is _c Indicated is a corresponding interval of the real frame heading angle,representing the confidence of model prediction of each interval, the loss function is cross entropy classification loss, R _r Representing the deviation of the heading angle of the real frame in the corresponding interval,/->Representing the deviation of the model prediction bounding box in the corresponding interval, the loss function is L1 regression loss,>for the purpose of the heading angle class penalty,and (5) returning loss for the course angle.

5. The method for detecting a three-dimensional point cloud target of deformable attention according to claim 1, wherein the candidate frames generated by filtering the candidate frames by a post-processing algorithm are used as the region of interest, comprising the steps of: filtering out candidate frames with classification confidence coefficient lower than a confidence coefficient threshold value by adopting a confidence coefficient filtering algorithm, and filtering out overlapped candidate frames by adopting a non-maximum suppression algorithm to obtain a region of interest.

6. The method for detecting a three-dimensional point cloud target of deformable attention according to claim 1, wherein extracting features of grid points of a region of interest from the region of interest comprises the steps of: for each region of interest, generating a uniform distribution size (n) in the region of interest at a fixed ratio _x ,n _y ,n _z ) Where n is _x ,n _y ,n _z Respectively representing the number of grid points along an x-axis, a y-axis and a z-axis in a local coordinate system of the region of interest;

wherein,representing point P _i Corresponding position on the feature map of scale l, wherein +.>Representing point P _i Corresponding x-coordinate, y-coordinate, z-coordinate, d represents the downsampling multiple of the feature map,/-the feature map>Representing a rounding down calculation;

if one grid corresponding position is empty voxel, using 0 vector as the characteristic of grid point, otherwise adopting the characteristic of corresponding voxel as the characteristic of grid point and recording as Representing point P _i A feature corresponding in scale l;

7. The method for detecting a three-dimensional point cloud target of deformable attention according to claim 6, wherein the coordinate offset Δp of the sampling point with respect to the initial position is calculated using the following formula _lk ：

the attention weight for each sample point is calculated using:

8. The method of claim 6, wherein the characteristics of the grid points of the region of interest are calculated using the following equation:

9. A deformable attention three-dimensional point cloud target detection system, comprising:

10. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the deformable attention three-dimensional point cloud object detection system of any of claims 1 to 8 when the computer program is executed.