CN117710659A - Deformable attention three-dimensional point cloud target detection method, system and equipment - Google Patents

Deformable attention three-dimensional point cloud target detection method, system and equipment Download PDF

Info

Publication number
CN117710659A
CN117710659A CN202311822752.1A CN202311822752A CN117710659A CN 117710659 A CN117710659 A CN 117710659A CN 202311822752 A CN202311822752 A CN 202311822752A CN 117710659 A CN117710659 A CN 117710659A
Authority
CN
China
Prior art keywords
point cloud
dimensional
point
interest
region
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311822752.1A
Other languages
Chinese (zh)
Inventor
李垚辰
唐文能
李一帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202311822752.1A priority Critical patent/CN117710659A/en
Publication of CN117710659A publication Critical patent/CN117710659A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a method, a system and equipment for detecting a three-dimensional point cloud target with deformable attention, which comprise the steps of obtaining point cloud data and preprocessing the point cloud data: extracting three-dimensional voxel characteristics from the preprocessed point cloud data, extracting point cloud characteristics from the voxel characteristics, predicting candidate frame types according to the point cloud characteristics, detecting frame sizes and course angles, and obtaining candidate frames; filtering the generated candidate frames by using a post-processing algorithm to serve as a region of interest; extracting the characteristics of grid points of the region of interest according to the region of interest; and adjusting the size and the confidence of the characteristics of the grid points of the region of interest, generating the positions of candidate frames, and realizing the three-dimensional point cloud target detection of the deformable attention. The invention greatly improves the detection precision of the existing method on difficult samples and samples with smaller volume, further improves the safety of the automatic driving vehicle under extreme conditions, and has high practical application value.

Description

Deformable attention three-dimensional point cloud target detection method, system and equipment
Technical Field
The invention belongs to the technical field of computer vision and pattern recognition, and particularly relates to a method, a system and equipment for detecting a three-dimensional point cloud target with deformable attention.
Background
The three-dimensional object detection task is a key technology in the fields of automatic driving, robot navigation, face recognition and the like, and the three-dimensional object detection task needs to detect information such as the category, the position, the course angle and the like of an object from a three-dimensional space.
In recent years, as an important object in urban traffic scenes, a three-dimensional object detection task of an outdoor scene based on point clouds has also received attention of more and more researchers. Common three-dimensional object detection can be classified into a one-stage method and a two-stage method. A one-stage method is disclosed in a publication "SECOND: sparsely Embedded Convolutional Detection" by Yan et al on the journal of Sensor, wherein features of a three-dimensional grid are extracted by using a three-dimensional sparse convolution layer, then the three-dimensional grid is converted into a bird's-eye view angle, and then a two-dimensional convolution layer is used for generating a detection result. Two-stage methods such as the "Voxel r-cnn: towards high performance Voxel-based 3d object detection" by Shi et al, conference The AAAI Conference on Artificial Intelligent.2021, are used to further generate more accurate predictions by further extracting features of the region of interest generated in one stage, including features in three-dimensional sparse feature maps, three-dimensional points, etc., based on the one-stage method.
The two-stage method does not fully utilize the prediction result of the previous stage and the three-dimensional sparse features extracted by the sparse convolution layer, the extraction features are limited in the prediction result, abundant context information around the prediction target is difficult to obtain, and more accurate prediction three-dimensional targets are not easy to generate.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a three-dimensional point cloud target detection method, system and equipment with higher precision and deformable attention.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a three-dimensional point cloud target detection method of deformable attention comprises the following steps:
acquiring point cloud data and preprocessing the point cloud data:
extracting three-dimensional voxel characteristics from the preprocessed point cloud data, extracting point cloud characteristics from the voxel characteristics, predicting candidate frame types according to the point cloud characteristics, detecting frame sizes and course angles, and obtaining candidate frames;
filtering the generated candidate frames by using a post-processing algorithm to serve as a region of interest;
extracting the characteristics of grid points of the region of interest according to the region of interest;
and adjusting the size and the confidence of the characteristics of the grid points of the region of interest, generating the positions of candidate frames, and realizing the three-dimensional point cloud target detection of the deformable attention.
Further, the point cloud data comprises training data and test data, the training data is enhanced, and voxel processing is carried out on the enhanced training data and test data;
the concrete process of voxelization treatment of the training data and the test data after the enhancement treatment is as follows: the training data and the test data after the enhancement processing are uniformly divided into three-dimensional voxels with uniform sizes along three dimensions of an x axis, a y axis and a z axis, the sizes of the three-dimensional voxels are respectively recorded as the total length H on the x axis, the total length W on the y axis, the total length D on the z axis and the length V of each voxel on the x axis h Each voxel length V on the y-axis w Each voxel length V in the z-axis d Each voxel contains a number of points not exceeding the threshold N, and if the number of voxels exceeding the threshold N is discarded.
Further, three-dimensional voxel features are extracted by the following formula:
wherein N is p Representing the number of point clouds in a voxel, p i Representing coordinates of a voxel midpoint, and f representing three-dimensional voxel characteristics;
the point cloud features are extracted by the following process: extracting features from three-dimensional voxel features by adopting two sub-manifold convolution layers, and then promoting interaction among all voxels by adopting a space sparse convolution layer according to the extracted features to obtain a three-dimensional sparse feature map; the convolution kernel size of the sub-manifold convolution layer is 3, the step length is 1, the filling is 1, the convolution kernel size of the space sparse convolution layer is 3, the step length is 2, and the filling is 2.
Further, extracting three-dimensional voxel features from the preprocessed point cloud data, extracting point cloud features from the voxel features, predicting candidate frame types, detecting frame sizes and course angles according to the point cloud features, and obtaining candidate frames, wherein the method comprises the following steps: compressing the z-axis of the three-dimensional sparse feature map to obtain a two-dimensional feature map under the view angle of the aerial view, adopting a two-dimensional convolution layer to the two-dimensional feature map under the view angle of the aerial view to obtain candidate frames, and obtaining a category prediction according to the candidate frames to detect the prediction results of the size of the frames and the course angle;
wherein, classification loss L adopted by classification prediction cls The following is shown:
wherein c i Representing the true category of the target to which the candidate point corresponds,representing a class confidence of the network prediction;
the detection frame size prediction adopts regression loss, wherein the regression loss comprises position regression loss of a target center point, boundary frame size regression loss and course angle regression loss;
position regression loss of target center point and bounding box size regression loss Using the Smooth L1 loss function L loc Calculating a Smooth L1 loss function L loc The following is shown:
in loc i Representing the offset of the target center point to which the candidate point belongs to the candidate point,representing model predicted offsets, szie i Representing the size of the target bounding box to which the candidate point belongs,/->The size of the model predictions represented;
heading angle prediction loss L angle The following is shown:
wherein R is c Indicated is a corresponding interval of the real frame heading angle,representing the confidence of model prediction of each interval, the loss function is cross entropy classification loss, R r Representing the deviation of the heading angle of the real frame in the corresponding interval,/->Representing the offset of the model prediction bounding box in the corresponding interval, etc.>For heading angle class loss, < >>Regression loss for heading angle;
further, filtering the generated candidate frame by using a post-processing algorithm, wherein the candidate frame is used as a region of interest, and the method comprises the following steps: filtering out candidate frames with classification confidence coefficient lower than a confidence coefficient threshold value by adopting a confidence coefficient filtering algorithm, and filtering out overlapped candidate frames by adopting a non-maximum suppression algorithm to obtain a region of interest.
Further, according to the region of interest, extracting characteristics of grid points of the region of interest, including the following steps: for each region of interest, generating a uniform distribution size (n) in the region of interest at a fixed ratio x ,n y ,n z ) Where n is x ,n y ,n z Respectively representing the number of grid points along an x-axis, a y-axis and a z-axis in a local coordinate system of the region of interest;
calculating the characteristics of the grid points at the voxels corresponding to the characteristic map by adopting the following steps:
wherein,representing point P i Corresponding position on a feature map of scale l, where x pi ,y pi ,z pi Representing point P i Corresponding x-coordinate, y-coordinate, z-coordinate, d represents the downsampling multiple of the feature map,/-the feature map>Representing a rounding down calculation;
if one grid corresponding position is empty voxel, using 0 vector as the characteristic of grid point, otherwise adopting the characteristic of corresponding voxel as the characteristic of grid point and recording asRepresenting point P i A feature corresponding in scale l;
for each sampling point, calculating the coordinate offset of the sampling point relative to the initial position and the attention weight of the sampling point according to the characteristics of the voxel corresponding to the characteristic map by the grid point;
calculating the characteristics of the corresponding positions of the sampling points in the feature map according to the coordinate offset of the sampling points relative to the initial positions, carrying out weighted summation on the characteristics of the corresponding positions of the sampling points in the feature map and the attention weights of the sampling points, taking the weighted summation as the characteristics of grid points, and splicing the characteristics of all grid points in the region of interest to obtain the characteristics of the grid points of the region of interest.
Further, the coordinate offset DeltaP of the sampling point relative to the initial position is calculated by the following formula lk
Wherein MLP represents the linear layer and Relu represents the Relu activation function;
the attention weight for each sample point is calculated using:
wherein,representing P i Attention weight vector for each sample point on scale l.
Further, the characteristics of the grid points of the region of interest are calculated using the following formula:
wherein F is i Representing the characteristics of one grid point in the candidate box,representing that for each grid point features from different scales are stitched to form the final feature, +.>The attention weights calculated for the kth sample point of the ith point at scale l are represented by the grid points, the sum of the attention weights of all K sample points of point i at scale l being 1, ">Representing the coordinates of the sampling point at the scale l, < ->Representing the offset corresponding to the coordinates, x l A function representing the feature of acquiring the corresponding position on the three-dimensional sparse feature map of dimension l.
A deformable attention three-dimensional point cloud target detection system, comprising:
the data acquisition and processing module is used for acquiring point cloud data and preprocessing:
the prediction module is used for extracting three-dimensional voxel characteristics from the preprocessed point cloud data, extracting point cloud characteristics from the voxel characteristics, predicting candidate frame types according to the point cloud characteristics, detecting frame sizes and course angles, and obtaining candidate frames;
the filtering module is used for filtering the generated candidate frames by using a post-processing algorithm to serve as the region of interest;
the feature extraction module of the grid points is used for extracting the features of the grid points of the region of interest according to the region of interest;
and the position generation module of the candidate frame is used for adjusting the size and the confidence of the characteristics of the grid points of the region of interest, generating the position of the candidate frame and realizing the three-dimensional point cloud target detection of the deformable attention.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the deformable attention three-dimensional point cloud object detection system when the computer program is executed.
Compared with the prior art, the invention has the following beneficial effects:
extracting three-dimensional voxel characteristics from preprocessed point cloud data, extracting point cloud characteristics from the voxel characteristics, predicting candidate frame types according to the point cloud characteristics, detecting frame sizes and course angles, and obtaining candidate frames; filtering the generated candidate frames by using a post-processing algorithm to serve as a region of interest; according to the region of interest, extracting the characteristics of grid points of the region of interest, namely extracting the characteristics of grid points of the region of interest by adopting a deformable attention method, so as to further adjust the prediction result of the region of interest. Because the deformable attention method can adaptively adjust the grid point positions, compared with the traditional feature extraction method based on fixed grid points, the method can acquire the context features of the region of interest more abundantly, and for targets with smaller points and farther distance from a sensor, the deformable attention method can extract the context information around the targets through larger-scale offset, so that the features of the deformable attention method are enriched, and the prediction accuracy of the deformable attention method is improved. For targets of smaller scale, where one-stage prediction is inaccurate, the deformable attention method may adaptively adjust the region sampling point positions so that the sampling points tend to sample at more accurate positions, thereby extracting more sophisticated features to further refine such targets. In addition, in order to fully utilize the features acquired under different scales, the invention uses a deformable attention method on multiple scales, provides more comprehensive feature description for the target, and is beneficial to generating more accurate detection results.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a block diagram of a three-dimensional sparse feature extraction module of the method of the present invention;
FIG. 3 is a block diagram of a two-dimensional candidate block generation module according to the present invention;
FIG. 4 is a flow chart of a deformable attention region of interest feature extraction module of the present invention;
FIG. 5 is a visual experimental diagram of the detection result of 000006 frame data on a large public data set KITTI; wherein, (a) is 000006 frame label information, (b) is 000006 frame detection result of the invention, (c) is 000006 frame detection result of CT3D method;
FIG. 6 is a visual experimental diagram of the detection result of 000025 frame data on a large public data set KITTI; wherein, (a) 000025 frame label information, (b) is 000025 frame detection result of the invention, (c) is 000025 frame detection result of CT3D method;
FIG. 7 is a visual experimental diagram of the detection result of 000039 frame data on a large public data set KITTI; wherein, (a) is 000039 frame label information, (b) is 000039 frame detection result of the invention, (c) is 000039 frame detection result of CT3D method;
fig. 8 is a schematic diagram of a deformable attention three-dimensional point cloud object detection system of the present invention.
Detailed Description
In order that the invention may be readily understood, a more complete description of the invention will be rendered by reference to the appended drawings. The drawings illustrate preferred embodiments of the invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Referring to fig. 1, according to the method for detecting the three-dimensional point cloud target with deformable attention, the characteristics of a candidate region are extracted by using the deformable attention method, and the context information rich in the candidate region is obtained through self-adaptive offset and weight, so that a more accurate prediction result is generated on the basis of an original detection result, a network can detect a target with a small size and a long distance more accurately, and the detection precision of an original algorithm is greatly improved. The method comprises the following specific steps:
step 1: point cloud data acquisition and point cloud data preprocessing: the invention uses a large open source data set KITTI as samples to obtain 7481 training and test samples in total, wherein 3712 training samples and 3769 test samples are obtained. Each training and testing sample contains a frame of point cloud data and a target label corresponding to the frame. For each frame of point cloud data, preprocessing is firstly carried out, and for training data, data enhancement processing is needed. In addition, the training sample and the test sample are required to be subjected to voxelized treatment, and the specific process is as follows:
dividing three-dimensional point cloud data into three-dimensional voxels with uniform size along three dimensions of an x-axis, a y-axis and a z-axis, wherein the sizes of the three-dimensional voxels are respectively recorded as the total length H on the x-axis, the total length W on the y-axis, the total length D on the z-axis and the length V of each voxel on the x-axis h Each voxel length V on the y-axis w Each voxel length V in the z-axis d Each voxel contains a number of points not exceeding the threshold N, and if the number of voxels exceeding the threshold N is discarded.
Preferably, the total length W in the KITTI data set on the y-axis is set to a size of 70.4 meters, and the range is set to [0,70.4 ]]The total length H in the x-axis is set to 80 meters and the range is set to [ -40,40]The overall length D of the profile in the z-axis is set to be 4 meters and the range is set to be [ -3,1]. Length V of each voxel on x-axis h And each voxel length V on the y-axis w Is set to 0.05m, each voxel length V in the z-axis d Is set to 0.1m. The size of the threshold N is set to 5, i.e. when the number of points in a voxel exceeds 5, the redundant points are randomly discarded.
Step 2: and extracting three-dimensional voxel features of the preprocessed point cloud data by using a voxel feature extractor, extracting point cloud features of the voxel features by using a three-dimensional sparse convolution layer, and obtaining candidate frames by using a two-dimensional convolution layer according to the point cloud features.
The invention uses the mean voxel feature extractor to extract the three-dimensional voxel feature of each grid of the preprocessed point cloud data, and the basic principle can be described by the following formula:
wherein N is p Representing the number of point clouds in a voxel, p i Representing coordinates of a voxel midpoint, and f representing a three-dimensional voxel feature.
Because the three-dimensional sparse convolution layer feature extraction module adopted by the invention extracts the features, the processing of empty voxels which do not contain point cloud is not needed.
Because most three-dimensional voxels in the scene are empty, a three-dimensional sparse convolution layer feature extractor is used for processing the whole point cloud scene, and a three-dimensional sparse feature map, namely point cloud features, is generated. This greatly reduces the amount of computation required and reduces the occupation of computing resources.
For the three-dimensional sparse convolution layer feature extraction module, the invention adopts the design in the SECOND method, wherein the two types of sparse convolution layers are included, one is a sub-manifold sparse convolution layer, the other is a space sparse convolution layer, the difference between the two types of sparse convolution layers is that the input of the sub-manifold sparse convolution layer is activated only when the center of a convolution kernel is the same activation position, and the space sparse convolution layer can be activated only when the convolution kernel contains the activation position. It can be seen that the spatial sparse convolution layer can enable sparse data to grow at a very fast speed, and the sparsity of the data is destroyed. Therefore, in the process of extracting the three-dimensional features, the sub-manifold sparse convolution layer and the space sparse convolution layer are used in a matched mode, so that the data sparsity is ensured, and meanwhile interaction among all the activation positions is promoted. The specific structure is shown in fig. 2, features are extracted from three-dimensional voxel features by using two sub-manifold convolution layers, and interaction among all voxels is promoted by using a space sparse convolution layer according to the extracted features, so that a three-dimensional sparse feature map is obtained. The convolution kernel size of the sub-manifold convolution layer is 3, the step length is 1, the filling is 1, the convolution kernel size of the space sparse convolution layer is 3, the step length is 2, and the filling is 2. The structure can be seen that the size of the feature map after the sub-manifold convolution layer treatment does not change, and the feature map after the space sparse convolution layer treatment is reduced by 2 times compared with the original feature map. The number of channels for each convolutional layer increases exponentially as downsampling proceeds. Three such structures are used in the stack of the present invention, and therefore the resulting space size is downsampled by a factor of 8 compared to the input space. Feature maps of different scales are denoted as x 1 ,x 2 ,x 3 Wherein x is 3 A feature map of downsampling by a factor of 8 is shown. In a deformable feature extraction structureThe invention fully utilizes the three-dimensional feature diagrams with different scales so as to further enrich the features of the region of interest, thereby improving the detection precision.
Before using the two-dimensional convolution layer process, the obtained three-dimensional sparse feature map needs to be converted into a two-dimensional feature map under the bird's eye view angle. The specific transformation process is as follows: compressing the z-axis of the three-dimensional sparse feature map to obtain a two-dimensional feature map under the view angle of the bird's eye view map, and assuming that the size of the three-dimensional sparse feature map is (C, D, H, W), wherein C represents the channel number of the feature map, D represents the scale of the feature map on the z-axis, H represents the scale of the feature map on the y-axis, W represents the scale of the feature map on the x-axis, and after compressing the three-dimensional sparse feature map along the z-axis, the size of the two-dimensional feature map under the view angle of the bird's eye view map is (C x D, H, W), namely the channel number of the two-dimensional feature map is changed to C x D.
After obtaining the two-dimensional feature map under the view angle of the aerial view, as shown in fig. 3, a plurality of downsampling convolution layers and transposed convolution layers are used for extracting the features of the two-dimensional feature map under the view angle of the aerial view of different layers, the obtained features are spliced, and finally, a prediction result is output by using the two-dimensional convolution layers, wherein the prediction result mainly comprises three parts, namely category prediction, detection frame size prediction and course angle prediction.
In the training process, the classification loss L is adopted cls The following is shown:
wherein c i Representing the true category of the target to which the candidate point corresponds,representing the class confidence of the network predictions. The regression loss mainly comprises three parts, namely, the position regression loss of the target center point, the boundary box size regression loss and the course angle regression loss. Wherein the position regression loss and the bounding box size regression loss use a Smooth L1 loss function L loc Calculations were performed as follows:
in loc i Representing the offset of the target center point to which the candidate point belongs to the candidate point,representing model predicted offsets, szie i Representing the size of the target bounding box to which the candidate point belongs,/->The model prediction size of the representation.
The prediction of the heading angle is divided into two parts, namely the classification of the heading angle and the offset in the category to which it belongs. The invention uniformly divides the course angle of the target into 12 sections, wherein the meaning of the classification loss is which section the course angle of the target belongs to, and the regression loss is the deviation of the course angle in the section. Therefore, heading angle prediction loss L angle The following is shown:
wherein R is c Indicated is a corresponding interval of the real frame heading angle,representing the confidence of model prediction of each interval, the loss function is cross entropy classification loss, R r Representing the deviation of the heading angle of the real frame in the corresponding interval,/->The offset of the model prediction boundary box in the corresponding section is represented, and the corresponding loss function is Smooth L1 loss. />For heading angle class loss, < >>And (5) returning loss for the course angle.
Thus, regression loss L reg The following is shown:
L reg =L loc +L angle
step 3: and filtering the generated candidate frames by using a post-processing algorithm to obtain the final region of interest.
The method mainly comprises two post-processing algorithms, namely filtering out the classification confidence coefficient lower than a confidence coefficient threshold T by using a confidence coefficient filtering algorithm x The confidence threshold T used in the present invention s A candidate box of 0.3, i.e., with confidence below 0.3, is discarded directly and no longer participates in subsequent calculations. In addition, to further remove duplicate predictions for the same target, non-maximal suppression algorithms have been introduced to further process candidate boxes, specifically to filter out overlapping candidate boxes. The algorithm is widely applied to target detection neighborhoods, and the core idea is to select a frame with highest confidence from a group of overlapped prediction frames as a result of the group. The IoU index is used to evaluate the overlapping degree of the prediction frames, and when IoU between prediction frames exceeds a threshold value of 0.01, it is determined that two prediction frames belong to the same group. And finally, selecting the prediction result with the highest confidence in the same group as the final result of the group, wherein the rest detection results do not participate in the calculation of the second stage.
This step can reduce the number of regions of interest entered in the second stage and thus reduce the amount of computation required.
Step 4: and constructing a deformable attention-based region of interest feature extraction module for extracting the region of interest features according to the final region of interest.
The method aims at further enriching the characteristics of the target context by utilizing the prediction result of the previous stage and the characteristics extracted by the three-dimensional sparse convolution layer, so as to adjust the target prediction result and realize more accurate prediction.
See FIG. 4For each region of interest, a uniform distribution of dimensions (n) is first generated in the region of interest in a fixed proportion x ,n y ,n z ) Where n is x ,n y ,n z The number of grid points along the x-axis, y-axis, and z-axis in the local coordinate system of the region of interest is represented, respectively.
Used in the present invention number of grid points 6X 6. For each grid, the real coordinates of the region of interest in the three-dimensional scene are calculated by using the size and the position of the region of interest, and are denoted as P i . For each grid point of the region of interest and the feature map of the corresponding scale, the feature of the grid point at the voxel corresponding to the feature map is calculated by the following formula:
wherein,representing point P i Corresponding position on a feature map of scale l, where x pi ,y pi ,z pi Equal representation point P i Corresponding x-coordinate, y-coordinate, z-coordinate, d represents the downsampling multiple of the feature map,/-the feature map>Representing a rounding down calculation.
For each grid point there is a voxel corresponding to it. If one grid corresponding position is empty voxel, using 0 vector as the characteristic of the grid point, otherwise, using the characteristic of the corresponding voxel as the characteristic of the point, and marking the characteristic asRepresenting point P i The corresponding feature on scale l.
Each grid point corresponds to K sampling points, and for each sampling point, the coordinate offset of the sampling point relative to the initial position and the coordinate offset of the sampling point are calculated by using the characteristics corresponding to the grid pointsThe attention weights of the sampling points enable the grid points to adaptively focus on important locations and important features when extracting features. Calculating the offset delta P of the sampling point by adopting the following method lk
Where MLP represents the linear layer and Relu represents the Relu activation function, the output size of the equation is 3, the input size follows the size change of the feature map corresponding to the offset of the sample points.
After the corresponding offset of each sampling point is obtained, the feature of the corresponding position of the sampling point in the feature map is calculated. The method is denoted as x l (P l +ΔP lk ). The attention weight for each sample point is calculated using the linear layer using the following equation:
wherein,representing P i The attention weight vector of each sampling point on the scale of l is provided with a length of K, and the sum of the weights of each sampling point is 1.
It can be seen that the characteristics of the grid points of the region of interest are calculated using the following formula:
wherein F is i Representing the characteristics of one grid point in the candidate box,representation ofFor each grid point, features from different scales are stitched to form the final feature,/->Representing the attention weights calculated for the kth sample point of the ith grid point at scale l, the attention weights of all K sample points of grid point i at scale l summing to 1, < >>Representing the coordinates of the sampling point at the scale l, < ->Indicating the offset corresponding to the coordinate. X is x l A function representing the feature of acquiring the corresponding position on the three-dimensional sparse feature map of dimension l.
The features of the sampling points at the corresponding positions in the feature map are weighted and summed to form the features of the final grid points. And splicing the characteristics of all grid points in the region of interest to serve as the characteristics of the region of interest.
In addition, the method and the device fully utilize three-dimensional features of different scales, respectively calculate the features of grid points on the feature graphs of different scales, and splice the features of different scales to serve as the features of the final grid points of the region of interest.
And 5, taking the characteristics of grid points of the region of interest as the input of two multi-layer perceptrons, further adjusting the size and adjusting the confidence coefficient, generating the position of a candidate frame, and realizing three-dimensional point cloud target detection, thereby improving the prediction accuracy of the invention.
The step uses the characteristics of the grid points of the region of interest to further adjust the detection results and calculates the loss according to the prediction results.
Wherein the number n of grid points along the x-axis, the y-axis and the z-axis in the local coordinate system of the region of interest x ,n y ,n z Is set to a size of 6,K and is set to a size of 27.
The structure of the detection head of the multi-layer perceptron is shown in fig. 3, and is composed of two parallel linear layers, wherein one linear layer is used for predicting the confidence of the region of interest, and the other linear layer is used for predicting the vector required by regression of the region of interest. The confidence label is calculated as follows:
therein, ioU i Representing the current region of interest and the maximum value of IoU of the truth box. θ L IoU threshold, θ, representing assignment of a region of interest as background H Representing IoU threshold to assign a region of interest as foreground. The classification loss can be described by the following formula:
L cls-rcnn =CrossEntropy(p i ,S i (IoU i ))
wherein p is i A predicted value representing the confidence of the region of interest. Cross Entropy means that the loss is calculated using cross entropy loss. The regression loss is calculated as follows:
wherein 1 (IoU) i ≥θ reg ) Representing an indication function, when IoU i Greater than theta reg Its value is 1, otherwise 0. Delta i Indicating the predicted value of the detection head,representing the true value of the target regression.
According to the data to be detected of the three-dimensional target, the method can also realize the detection of the three-dimensional target.
Referring to fig. 8, the present invention further provides a three-dimensional point cloud object detection system of deformable attention, including:
the data acquisition and processing module is used for acquiring point cloud data and preprocessing:
the prediction module is used for extracting three-dimensional voxel characteristics from the preprocessed point cloud data, extracting point cloud characteristics from the voxel characteristics, predicting candidate frame types according to the point cloud characteristics, detecting frame sizes and course angles, and obtaining candidate frames;
the filtering module is used for filtering the generated candidate frames by using a post-processing algorithm to serve as the region of interest;
the feature extraction module of the grid points is used for extracting the features of the grid points of the region of interest according to the region of interest;
and the position generation module of the candidate frame is used for adjusting the size and the confidence of the characteristics of the grid points of the region of interest, generating the position of the candidate frame and realizing the three-dimensional point cloud target detection of the deformable attention.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the deformable attention three-dimensional point cloud object detection system when the computer program is executed.
The following is a specific example.
The neural network is realized based on a PyTorch framework, and a workstation used for training is provided with two 2080Ti type GPUs for operation acceleration. Experiments were performed on a large open source dataset KITTI dataset. Which contains 7481 training and test samples, 3712 training samples and 3769 test samples. The dataset contains three categories, car, pedestrian and rider, respectively. The targets are classified into three different grades, namely simple, medium and difficult according to the size and shielding degree of the targets. For the car class, the prediction result of IoU exceeding 0.7 is taken as the correct detection, the threshold value of the rest class is 0.5, and the average precision is calculated as the final evaluation result. Training was performed using an open source framework OpenPCDet, the learning rate was set to 0.003, and the training round number was set to 80. Common data enhancement methods are used in the training process to accelerate model convergence.
The target in the validation set is detected using the trained model, and the accuracy of the detection is calculated using the calculation method of Recall 40 provided by the official of the KITTI. Wherein the detection precision is shown in table 1, and compared with the prior method, the method has the advantage that the three-dimensional detection precision is remarkably improved.
Table 1KITTI dataset verification set mAP_R40 index experiment results
From the results shown in Table 1, the method provided by the invention greatly leads other methods in detection precision, and particularly for targets with serious shielding, long distance and small volume, the detection effect is far superior to that of the similar methods.
In addition, the present invention further illustrates the visualization of the detection result, in fig. 5, (a), (b) and (c) show the visualization effect of the present invention on the 000006 frame data in the KITTI data set, in fig. 6, (a), (b) and (c) show the detection effect of the present invention on the 000025 frame data in the KITTI data set, and in fig. 7, (a), (b) and (c) show the detection effect of the present invention on the 000039 frame data in the KITTI data set. Wherein the first row of pictures represents the label of the dataset, the second row of pictures represents the detection result of the invention, and the third row represents the detection result of CT 3D. As can be seen from fig. 5 and 7, the present invention detects difficult targets that are severely occluded except for a far distance, whereas the CT3D method is a detection. As can be seen from fig. 6, the detection results of the position, the size, the heading angle and the like of the detection frame are all based on the CT3D method.
The experimental result shows that the deformable attention method provided by the invention fully plays the advantages of the self-adaptive interested region feature extraction algorithm, and greatly surpasses the detection precision of the traditional grid-based method. The deformable attention provided by the invention can ensure that the model is not limited in the region of interest when the features of the region of interest are extracted, the region of interest can be adaptively adjusted, the position of interest in the feature extraction process can be adjusted, and different attention weights can be given to different positions, so that the interference of wrong prediction of the region of interest and background points is avoided. Experimental results show that the invention effectively improves the detection precision of small targets such as difficult samples, pedestrians, riders and the like, and can be applied to the fields of automatic driving, robot navigation and the like.
According to the invention, the characteristics of the candidate region are extracted by using the deformable attention method, and the context information rich in the candidate region is obtained through the self-adaptive offset and the weight, so that a more accurate prediction result is generated on the basis of the original detection result, the target with small size and long distance can be detected more accurately, and the detection precision is greatly improved.
The foregoing is illustrative of the preferred embodiments of the present invention and is not to be construed as limiting the claims. The present invention is not limited to the above embodiments, and the specific structure thereof is allowed to vary. It is intended that all such variations as fall within the scope of the appended claims.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

Claims (10)

1. The method for detecting the deformable attention three-dimensional point cloud target is characterized by comprising the following steps of:
acquiring point cloud data and preprocessing the point cloud data:
extracting three-dimensional voxel characteristics from the preprocessed point cloud data, extracting point cloud characteristics from the voxel characteristics, predicting candidate frame types according to the point cloud characteristics, detecting frame sizes and course angles, and obtaining candidate frames;
filtering the generated candidate frames by using a post-processing algorithm to serve as a region of interest;
extracting the characteristics of grid points of the region of interest according to the region of interest;
and adjusting the size and the confidence of the characteristics of the grid points of the region of interest, generating the positions of candidate frames, and realizing the three-dimensional point cloud target detection of the deformable attention.
2. The method for detecting a three-dimensional point cloud target with deformable attention according to claim 1, wherein the point cloud data includes training data and test data, wherein the training data is subjected to enhancement processing, and wherein the enhancement processed training data and test data are subjected to voxel processing;
the concrete process of voxelization treatment of the training data and the test data after the enhancement treatment is as follows: the training data and the test data after the enhancement processing are uniformly divided into three-dimensional voxels with uniform sizes along three dimensions of an x axis, a y axis and a z axis, the sizes of the three-dimensional voxels are respectively recorded as the total length H on the x axis, the total length W on the y axis, the total length D on the z axis and the length V of each voxel on the x axis h Each voxel length V on the y-axis w Each voxel length V in the z-axis d Each voxel contains a number of points not exceeding the threshold N, and if the number of voxels exceeding the threshold N is discarded.
3. The method of claim 1, wherein the three-dimensional voxel features are extracted by the following formula:
wherein N is p Representing the number of point clouds in a voxel, p i Representing coordinates of a voxel midpoint, and f representing three-dimensional voxel characteristics;
the point cloud features are extracted by the following process: extracting features from three-dimensional voxel features by adopting two sub-manifold convolution layers, and then promoting interaction among all voxels by adopting a space sparse convolution layer according to the extracted features to obtain a three-dimensional sparse feature map; the convolution kernel size of the sub-manifold convolution layer is 3, the step length is 1, the filling is 1, the convolution kernel size of the space sparse convolution layer is 3, the step length is 2, and the filling is 2.
4. The method for detecting a three-dimensional point cloud target of deformable attention according to claim 1, wherein extracting three-dimensional voxel features from the preprocessed point cloud data, extracting point cloud features from the voxel features, predicting candidate frame types, detecting frame sizes and heading angles according to the point cloud features, and obtaining candidate frames, comprises the steps of: compressing the z-axis of the three-dimensional sparse feature map to obtain a two-dimensional feature map under the view angle of the aerial view, adopting a two-dimensional convolution layer to the two-dimensional feature map under the view angle of the aerial view to obtain candidate frames, and obtaining a category prediction according to the candidate frames to detect the prediction results of the size of the frames and the course angle;
wherein, classification loss L adopted by classification prediction cls The following is shown:
wherein c i Representing the true category of the target to which the candidate point corresponds,representing a class confidence of the network prediction;
the detection frame size prediction adopts regression loss, wherein the regression loss comprises position regression loss of a target center point, boundary frame size regression loss and course angle regression loss;
position regression loss of target center point and bounding box size regression loss Using the Smooth L1 loss function L loc Calculating a Smooth L1 loss function L loc The following is shown:
in loc i Representing the offset of the target center point to which the candidate point belongs to the candidate point,representing the offset, size, of model predictions i Representing the size of the target bounding box to which the candidate point belongs,/->The size of the model predictions represented;
heading angle prediction loss L angle The following is shown:
wherein R is c Indicated is a corresponding interval of the real frame heading angle,representing the confidence of model prediction of each interval, the loss function is cross entropy classification loss, R r Representing the deviation of the heading angle of the real frame in the corresponding interval,/->Representing the deviation of the model prediction bounding box in the corresponding interval, the loss function is L1 regression loss,>for the purpose of the heading angle class penalty,and (5) returning loss for the course angle.
5. The method for detecting a three-dimensional point cloud target of deformable attention according to claim 1, wherein the candidate frames generated by filtering the candidate frames by a post-processing algorithm are used as the region of interest, comprising the steps of: filtering out candidate frames with classification confidence coefficient lower than a confidence coefficient threshold value by adopting a confidence coefficient filtering algorithm, and filtering out overlapped candidate frames by adopting a non-maximum suppression algorithm to obtain a region of interest.
6. The method for detecting a three-dimensional point cloud target of deformable attention according to claim 1, wherein extracting features of grid points of a region of interest from the region of interest comprises the steps of: for each region of interest, generating a uniform distribution size (n) in the region of interest at a fixed ratio x ,n y ,n z ) Where n is x ,n y ,n z Respectively representing the number of grid points along an x-axis, a y-axis and a z-axis in a local coordinate system of the region of interest;
calculating the characteristics of the grid points at the voxels corresponding to the characteristic map by adopting the following steps:
wherein,representing point P i Corresponding position on the feature map of scale l, wherein +.>Representing point P i Corresponding x-coordinate, y-coordinate, z-coordinate, d represents the downsampling multiple of the feature map,/-the feature map>Representing a rounding down calculation;
if one grid corresponding position is empty voxel, using 0 vector as the characteristic of grid point, otherwise adopting the characteristic of corresponding voxel as the characteristic of grid point and recording as Representing point P i A feature corresponding in scale l;
for each sampling point, calculating the coordinate offset of the sampling point relative to the initial position and the attention weight of the sampling point according to the characteristics of the voxel corresponding to the characteristic map by the grid point;
calculating the characteristics of the corresponding positions of the sampling points in the feature map according to the coordinate offset of the sampling points relative to the initial positions, carrying out weighted summation on the characteristics of the corresponding positions of the sampling points in the feature map and the attention weights of the sampling points, taking the weighted summation as the characteristics of grid points, and splicing the characteristics of all grid points in the region of interest to obtain the characteristics of the grid points of the region of interest.
7. The method for detecting a three-dimensional point cloud target of deformable attention according to claim 6, wherein the coordinate offset Δp of the sampling point with respect to the initial position is calculated using the following formula lk
Wherein MLP represents the linear layer and Relu represents the Relu activation function;
the attention weight for each sample point is calculated using:
wherein,representing P i Attention weight vector for each sample point on scale l.
8. The method of claim 6, wherein the characteristics of the grid points of the region of interest are calculated using the following equation:
wherein F is i Representing the characteristics of one grid point in the candidate box,representing that for each grid point features from different scales are stitched to form the final feature, +.>The attention weights calculated for the kth sample point of the ith point at scale l are represented by the grid points, the sum of the attention weights of all K sample points of point i at scale l being 1, ">Representing the coordinates of the sampling point at the scale l, < ->Representing the offset corresponding to the coordinates, x l A function representing the feature of acquiring the corresponding position on the three-dimensional sparse feature map of dimension l.
9. A deformable attention three-dimensional point cloud target detection system, comprising:
the data acquisition and processing module is used for acquiring point cloud data and preprocessing:
the prediction module is used for extracting three-dimensional voxel characteristics from the preprocessed point cloud data, extracting point cloud characteristics from the voxel characteristics, predicting candidate frame types according to the point cloud characteristics, detecting frame sizes and course angles, and obtaining candidate frames;
the filtering module is used for filtering the generated candidate frames by using a post-processing algorithm to serve as the region of interest;
the feature extraction module of the grid points is used for extracting the features of the grid points of the region of interest according to the region of interest;
and the position generation module of the candidate frame is used for adjusting the size and the confidence of the characteristics of the grid points of the region of interest, generating the position of the candidate frame and realizing the three-dimensional point cloud target detection of the deformable attention.
10. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the deformable attention three-dimensional point cloud object detection system of any of claims 1 to 8 when the computer program is executed.
CN202311822752.1A 2023-12-27 2023-12-27 Deformable attention three-dimensional point cloud target detection method, system and equipment Pending CN117710659A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311822752.1A CN117710659A (en) 2023-12-27 2023-12-27 Deformable attention three-dimensional point cloud target detection method, system and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311822752.1A CN117710659A (en) 2023-12-27 2023-12-27 Deformable attention three-dimensional point cloud target detection method, system and equipment

Publications (1)

Publication Number Publication Date
CN117710659A true CN117710659A (en) 2024-03-15

Family

ID=90160683

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311822752.1A Pending CN117710659A (en) 2023-12-27 2023-12-27 Deformable attention three-dimensional point cloud target detection method, system and equipment

Country Status (1)

Country Link
CN (1) CN117710659A (en)

Similar Documents

Publication Publication Date Title
CN111832655B (en) Multi-scale three-dimensional target detection method based on characteristic pyramid network
CN110188705B (en) Remote traffic sign detection and identification method suitable for vehicle-mounted system
US20230099113A1 (en) Training method and apparatus for a target detection model, target detection method and apparatus, and medium
CN110728658A (en) High-resolution remote sensing image weak target detection method based on deep learning
He et al. Rail transit obstacle detection based on improved CNN
CN113920499A (en) Laser point cloud three-dimensional target detection model and method for complex traffic scene
CN115457395A (en) Lightweight remote sensing target detection method based on channel attention and multi-scale feature fusion
CN116783620A (en) Efficient three-dimensional object detection from point clouds
CN113592905B (en) Vehicle driving track prediction method based on monocular camera
CN114387265A (en) Anchor-frame-free detection and tracking unified method based on attention module addition
CN110674674A (en) Rotary target detection method based on YOLO V3
CN116503760A (en) Unmanned aerial vehicle cruising detection method based on self-adaptive edge feature semantic segmentation
CN115393601A (en) Three-dimensional target detection method based on point cloud data
CN117496477B (en) Point cloud target detection method and device
CN114820931B (en) Virtual reality-based CIM (common information model) visual real-time imaging method for smart city
CN114913519B (en) 3D target detection method and device, electronic equipment and storage medium
CN116310552A (en) Three-dimensional target detection method based on multi-scale feature fusion
CN116740665A (en) Point cloud target detection method and device based on three-dimensional cross-correlation ratio
CN117710659A (en) Deformable attention three-dimensional point cloud target detection method, system and equipment
CN114565753A (en) Unmanned aerial vehicle small target identification method based on improved YOLOv4 network
CN114155524A (en) Single-stage 3D point cloud target detection method and device, computer equipment and medium
Zhang et al. ResNet-based surface normal estimator with multilevel fusion approach with adaptive median filter region growth algorithm for road scene segmentation
Pan et al. Fast vanishing point estimation based on particle swarm optimization
CN116758497A (en) Laser point cloud three-dimensional target detection method based on unmanned system
CN118038437A (en) Three-dimensional target detection method based on improved point density sensing network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination