CN113920499A

CN113920499A - Laser point cloud three-dimensional target detection model and method for complex traffic scene

Info

Publication number: CN113920499A
Application number: CN202111255417.9A
Authority: CN
Inventors: 王海; 陈智宇; 蔡英凤; 陈龙; 刘擎超; 李祎承
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2022-01-11

Abstract

The invention discloses a laser point cloud three-dimensional target detection model and method for complex traffic scenes, wherein a three-dimensional encoder in the model is favorable for detecting a long-distance target and a small target, sparse convolution and sub-manifold convolution can greatly improve the encoding efficiency of voxel characteristics, a residual error structure of a two-dimensional encoder can keep more complete information and is favorable for detecting the long-distance target and the small target, a network is easier to optimize, the self-calibration convolution can be used for respectively extracting characteristics in an original scale and a self-calibration scale, the receptive field is enlarged, more complete and rich characteristics are extracted, and useful characteristic expression is enhanced in the space direction and the channel direction through space attention and channel attention, so that useless information is inhibited. The final detection precision of the intelligent vehicle-mounted vehicle is 81.88%, the pedestrian is 47.82%, the rider is 69.81%, the average precision is 66.25%, 9.9% is improved compared with the existing VOXEL RCNN algorithm, 13.8FPS is achieved on RTX 2080Ti display, and the detection precision and speed can meet the perception requirements of the intelligent vehicle in a complex traffic environment.

Description

Laser point cloud three-dimensional target detection model and method for complex traffic scene

Technical Field

The invention belongs to the field of intelligent automobile perception, and particularly relates to a laser point cloud three-dimensional target detection model and method for a complex traffic scene.

Background

An intelligent car is a complex system comprising sensing, decision-making and control. The environment perception technology is the basis of the intelligent vehicle and provides necessary environment information for subsequent decision and control. The precision of the traditional machine learning algorithm cannot meet the operation requirement of the current intelligent automobile. Therefore, the perception algorithm based on deep learning is rapidly developed, and great progress is made in the field of two-dimensional target detection and segmentation. However, the camera is affected by the conditions of night, rain, fog, strong light and the like, and the detection effect is influenced. With the reduction of the cost of the laser radar and the improvement of the computer computing power, the three-dimensional target detection is continuously applied to the intelligent automobile. The two-dimensional target detection can only provide the position of the target in the two-dimensional image, and the three-dimensional target detection can provide the position, the shape and the course angle of the target in the real environment, which has very important effect on subsequent decision and planning.

Disclosure of Invention

The method aims at solving the problems that most of the existing three-dimensional target detection algorithms cannot adapt to complex traffic scenes, the detection effects of long-distance vehicles and short-distance small targets such as pedestrians and riders are poor, and the perception requirements of intelligent automobiles under complex traffic conditions cannot be met. The invention improves and designs a three-dimensional target detection model algorithm facing to the complex traffic environment based on the VOXEL RCNN algorithm, so that the detection precision of a long-distance target and a small target can be improved, and the method can be suitable for the complex traffic environment. The invention achieves the technical purpose through the following technical scheme.

The invention provides a laser point cloud three-dimensional target detection model facing a complex traffic scene, which comprises a voxelization processing module, a three-dimensional encoder module, a two-dimensional encoder module, a candidate frame generation module, a candidate frame pooling module and a full-connection layer module;

the voxelization processing module: the system is used for carrying out voxelization processing on input laser radar point cloud data;

the three-dimensional encoder module: extracting the characteristics of the non-empty voxels, and projecting the obtained voxel characteristics to a bird's-eye view to generate a pseudo-image expression;

the two-dimensional encoder module: carrying out feature extraction on the pseudo-image expression to obtain a two-dimensional feature expression;

the candidate box generation module: performing target classification and regression by using the two-dimensional features to generate a high-quality candidate frame;

the candidate frame pooling module: performing voxel pooling on the candidate frames to obtain pooling characteristics;

the full connection layer module: and refining the candidate frame aiming at the pooled features.

Further, the voxelization processing module performs voxelization processing on the laser radar point cloud data in the following specific process:

the method comprises the steps of carrying out voxelization pretreatment on input point cloud along the direction X, Y, Z, dividing the whole point cloud scene into uniform voxels, taking the center of a laser radar as an origin, and taking the forward direction of a vehicle as an X axis, the leftward direction as a Y axis and the upward direction as a Z axis. The range of the whole point cloud scene (X, Y, Z) is [ (-75.2,75.2), (-75.2,75.2), (-5,3) ] meters, the size of each voxel is set to be (0.1,0.1,0.2) meter, therefore, the whole scene is divided into 1504 × 1504 × 40 voxels with equal size, if the number of original points in each voxel exceeds 5, the down sampling is carried out to 5, and the mean value of X, Y, Z, coordinates and reflection intensity of all points in each voxel after the down sampling is taken as the original characteristic of the voxel.

Further, the three-dimensional encoder module comprises a residual error structure, a sparse convolution and a sub-manifold convolution, the input of the three-dimensional encoder module is a 40X 1504 voxel expression, each voxel has an original characteristic of 4-dimensional (X, Y, Z, reflection intensity), after the input, the original characteristic of the voxel is extracted and down sampled by 1 time through a 3X 3 sub-manifold convolution, then the characteristic extraction is carried out by 2 continuous residual sub-manifold blocks, then the down sampling is carried out by 2 times, 4 times and 8 times along the X, Y and Z directions by 3X 3 sparse convolutions respectively to obtain the multi-scale three-dimensional characteristic, and each sparse convolution is followed by 2 continuous residual sub-manifold blocks for characteristic extraction; the structure of the residual sub-manifold block is formed by 2 sub-manifold convolutions of 3 multiplied by 3, and the input of the first sub-manifold convolution and the output of the second sub-manifold convolution are added, meanwhile, each sub-manifold convolution is followed by a BatchNorm layer and a ReLU layer; finally, the Z direction is subjected to 2 times of downsampling by using a sparse convolution, and the downsampled Z direction is projected on a bird's eye view and converted into a pseudo-image expression of 256 multiplied by 188.

Further, the two-dimensional encoder module comprises five parts, namely a residual structure, two-dimensional convolution, self-calibration convolution, space attention and channel attention mechanism, the module carries out 1-time, 2-time and 4-time down-sampling on a 256 × 188 × 188 pseudo-image expression obtained by the three-dimensional encoder module by using 13 × 3 two-dimensional convolution respectively to obtain multi-scale features, then 2 continuous self-calibration convolution are used for extracting the multi-scale features after each two-dimensional convolution, the input of the first self-calibration convolution and the output of the second self-calibration convolution are added, space attention and channel attention mechanisms are added after the self-calibration convolution, and useful feature expression and useless information reduction are achieved in the space and channel directions; and finally, returning the multi-scale features to the same size through deconvolution, splicing along the channel direction, and performing dimension compression by using a two-dimensional convolution to obtain a 64X 188 two-dimensional feature expression.

Further, the candidate frame generation module, aiming at the two-dimensional feature expression with the size of 64 × 188 × 188 obtained by the two-dimensional encoder module, places 10 anchor frames in each pixel of the 188 × 188 two-dimensional feature map, wherein the anchor frames are respectively 5 categories of cars, trucks, buses, pedestrians and riders, and each category is in two directions of 0 ° and 180 °, and performs category prediction and size regression (x, y, z, w, h, l and θ) on each anchor frame by using the two-dimensional feature map to obtain 10 × 188 × 188 three-dimensional frames, and in the training process, 9000 three-dimensional frames with the highest classification score are selected and sent to the non-maximum suppression module to obtain 512 high-quality candidate frames.

Further, the candidate frame pooling module randomly samples 128 candidate frames into a refinement stage for 512 high-quality candidate frames generated by the candidate frame generation module, wherein 64 of the candidate frames are positive samples, and 64 of the candidate frames are negative samples, and the voxel candidate frame pooling module is used for extracting multi-scale voxel characteristics: firstly, uniformly sampling 6 x 6 grid points for each candidate frame, then inquiring non-empty voxels within a certain distance around the grid points, and then extracting the non-empty voxels features by using a pointnet, a full connection layer and a maximum pooling layer to obtain pooling features; when inquiring the voxels around the grid point, inquiring the voxels sampled 2 times, 4 times and 8 times respectively to obtain the multi-scale voxel characteristics.

Further, the full-link layer module takes the pooling of the 128 candidate frames randomly sampled by the candidate frame pooling module as input, performs confidence prediction and regression frame refinement through the 2-layer full-link layer, and predicts the classification confidence c of each candidate frame for the confidence prediction branch_iThe target is obtained by the following formula:

wherein

A confidence target value of the ith candidate box; IoU_iThe intersection ratio of the ith candidate box and its true value box. Theta_H,θ_LThe intersection ratio threshold of the foreground and the background is 0.75 and 0.25 respectively;

the intersection ratio of the prediction box and the truth value is associated with the classification execution degree, so that the problem that the classification confidence degree and the positioning accuracy are not matched is solved; for the regression box refinement branch, more accurate position information (x, y, z, w, h, l, θ) is predicted by the fully connected layers.

Furthermore, the data set used by the model adopts an ONCE data set as a training data set and a verification data set, and a class balance sampling enhancement method is added during training, namely training scenes with few classes are randomly copied and sent into the training data set, so that the class imbalance is relieved, and the number of training samples is expanded;

further, the model training adopts end-to-end training, and five categories of cars, trucks, buses, riders and pedestrians are trained simultaneously; 2 RTX 2080Ti GPUS training 90 rounds of the network, wherein an optimizer is Adam, a learning rate change mode adopts a cosine fire mode, the maximum learning rate is 0.003, a frequency division coefficient is 10, momentum is from 0.95 to 0.85, weight attenuation is 0.02, and batch size is 6; total loss L thereof_TOTALIncluding RPN loss L_RPNAnd RCNN loss L_RCNNIn the RPN phase, the classification Loss L is calculated by Focal local_clsCalculating the regression Loss L by using Smooth L1 Loss_reg1In the RCNN stage, the confidence loss L is calculated by using the binary cross entropy loss_iouCalculating the regression Loss L by using Smooth L1 Loss_reg2：

L_TOTAL＝L_RPN+L_RCNN

Wherein N is_fgIs the number of foreground candidate frames, s_iIs the predicted category of the candidate box,

is the category true value of the candidate box, alpha represents the regression loss calculated for only the candidate box in the foreground,

are the predicted candidate box regression parameters,

is the true value of the candidate box regression parameters;

wherein N is_sIs the number of candidate boxes for the sample, c_iIs the confidence level of the prediction that the prediction is,

is the target value for confidence, beta represents the regression loss calculated for only the candidate boxes of the sampled foreground,

are the predicted candidate box regression parameters,

is the true value of the candidate box regression parameters.

Based on the model, the invention provides a detection method of a laser point cloud three-dimensional target facing a complex traffic scene, which comprises the following steps:

step 1, selecting or making a data set as a training data set and a verification data set of a detection network, and adding a class balance sampling enhancement method during training.

And 2, performing voxelization on the laser radar point cloud input into the network.

And 3, extracting the characteristics of the non-empty voxels by using a three-dimensional encoder consisting of a residual structure, a sparse convolution and a sub-manifold convolution, and projecting the obtained voxel characteristics onto the aerial view to generate a pseudo-image expression.

And 4, performing feature extraction on the pseudo-image expression by using a two-dimensional encoder consisting of a residual error structure, a two-dimensional convolution, a self-calibration convolution, a space attention and a channel attention mechanism to obtain a two-dimensional feature expression.

And 5, classifying and regressing the target by using the two-dimensional characteristics to generate a high-quality candidate frame.

And 6, carrying out voxel pooling on the candidate frames to obtain pooling characteristics.

And 7, refining the candidate frame by utilizing the full connection layer aiming at the pooling characteristics.

And 8, training the network from the step 2 to the step 7.

And 9, visualizing the detection effect.

The specific implementation process of each step is described in the detailed description of the specific embodiments.

The invention has the beneficial effects that:

1. the invention adopts the automatic driving data set ONCE to carry out network training, the data sets collected on different roads, different weather and different time can represent the complex traffic condition, and the network trained by the data sets can be suitable for the complex traffic condition.

2. The three-dimensional encoder designed by the invention is composed of a residual error structure, sparse convolution and sub-manifold convolution, more complete information can be kept through the residual error structure, the detection of a long-distance target and a small target is facilitated, the coding efficiency of voxel characteristics can be greatly improved through the sparse convolution and the sub-manifold convolution, and the real-time detection is facilitated.

3. The two-dimensional encoder designed by the invention consists of a residual error structure, two-dimensional convolution, self-calibration convolution, space attention and channel attention. More complete information can be kept through the residual error structure, the detection effect of a long-distance target and a small target is facilitated, and meanwhile, the network is easier to optimize. The self-calibration convolution can be used for respectively extracting the features in the original scale and the self-calibration scale, so that the receptive field is enlarged, and more complete and richer features can be extracted. The useful feature expression can be enhanced in the spatial direction and the channel direction through the spatial attention and the channel attention, and the useless information is suppressed.

4. In the training process, a class balance sampling method is added, so that on one hand, class imbalance is relieved, the detection effect of pedestrians and riders with small sample amount in a data set is improved, and on the other hand, the robustness of an algorithm is improved by expanding the training samples of the data set.

5. The final detection precision of the intelligent vehicle-mounted vehicle is 81.88%, the pedestrian is 47.82%, the rider is 69.81%, the average precision is 66.25%, the VOXEL RCNN algorithm is 9.9% higher than that of an original VOXEL RCNN algorithm, 13.8FPS is achieved on RTX 2080Ti display, and the detection precision and speed can meet the perception requirements of the intelligent vehicle in a complex traffic environment.

Drawings

FIG. 1 is an algorithm flow of a laser point cloud three-dimensional target detection method for a complex traffic scene.

Fig. 2 is a network configuration diagram of the three-dimensional encoder of the present invention.

Fig. 3 is a network configuration diagram of the two-dimensional encoder of the present invention.

Fig. 4 is a diagram of the detection visualization effect of the present invention.

Detailed Description

The invention will be further explained with reference to the drawings.

The invention provides a laser point cloud three-dimensional target detection method facing a complex traffic environment, which specifically comprises the following processes as shown in figure 1:

step 1, selecting the Hua-ONCE data set as a training data set and a verification data set of a detection network, and adding a class balance sampling enhancement method during training.

Hua shi ONCE data set was acquired by 7 cameras and 1 lidar of 40 lines, and it contained 5 categories, car, truck, bus, pedestrian and cyclist respectively. The data set collects scenes of different weather (sunny days, cloudy days and rainy days), different time (morning, Chinese, afternoon and evening) and different road conditions (city center, suburb, expressway, tunnel and bridge), and can well represent complex traffic conditions. The ONCE data set collected a 144 hour autopilot scenario with approximately 100 ten thousand frame samples. Wherein 16000 frames have labeled samples, 5000 frames are used for training, 3000 frames are used for verification, and 8000 frames are used for testing. In addition, aiming at the fact that the number of car samples in training samples is large, and the number of training samples of other trucks, buses, pedestrians and riders is small, a class balance sampling enhancement method is adopted in the training process, namely training scenes with few classes are randomly copied and sent into a training data set, so that class imbalance is relieved, the detection precision of few classes is improved, meanwhile, due to the fact that the number of the training samples is expanded, the training samples are expanded to 16600 frames from 5000 frames, 3.3 times are expanded, and the robustness of a model is improved due to the fact that the number of the training samples is increased.

The input point cloud is subjected to voxelization preprocessing along the direction X, Y, Z, and the whole point cloud scene is divided into uniform voxels, so that the point cloud is subjected to efficient feature extraction through sparse convolution and sub-manifold convolution. The center of the laser radar is used as an origin, the forward direction of the vehicle is an X axis, the leftward direction of the vehicle is a Y axis, and the upward direction of the vehicle is a Z axis. The range of the whole point cloud scene (X, Y, Z) is [ (-75.2,75.2), (-75.2,75.2), (-5,3) ] meters, the size of each voxel is set to be (0.1,0.1,0.2) meter, therefore, the whole scene is divided into 1504 × 1504 × 40 voxels with equal size. And if the number of the original points in each voxel exceeds 5, downsampling to 5, taking the mean values of x, y, z, coordinates and reflection intensity of all the points in each voxel after downsampling as the original features of the voxel, greatly improving the extraction efficiency of point cloud features through voxelization, and facilitating real-time detection in an automatic driving scene.

Wherein, the expression form in fig. 2 is like sub m (a, b) -c × d × e- (f, g, h), sub m represents sub manifold convolution, a and b respectively and input-output dimension, c × d × e represents convolution kernel size in x, y, z direction, and (f, g, h) represents step size in x, y, z direction; FIG. 2 is a representation in the form of SpConv (a, b) -c × d × e- (f, g, h), where SpConv represents sparse convolution, a and b respectively correspond to input and output dimensions, c × d × e represents convolution kernel sizes in the x, y, z directions, and (f, g, h) represents step sizes in the x, y, z directions; fig. 2 is an expression like a Residual sub-Block (a, b) representing Residual sub-blocks, a and b and the input-output dimension, the specific composition of which is shown by the dashed boxes in fig. 2.

Input to the three-dimensional encoder is a 40 × 1504 × 1504 representation of voxels, each having 4 dimensions (x, y, z, reflection intensity) of primitive features. The network structure of the three-dimensional encoder is shown in fig. 2. The extraction of the original voxel features is performed by a 3 × 3 × 3 sub-manifold convolution and 1-fold down-sampling, followed by feature extraction with 2 consecutive residual sub-manifold blocks. Then 3 times, 4 times and 8 times of sparse convolution are respectively carried out along the X, Y and Z directions to obtain multi-scale three-dimensional features, and each sparse convolution is followed by 2 continuous residual sub-flow blocks to carry out feature extraction. Specifically, the structure of the residual sub-manifold block is formed by 2 sub-manifold convolutions of 3 × 3 × 3, the input of the first sub-manifold convolution is added to the output of the second sub-manifold convolution, and the residual structure is favorable for keeping more complete information in the down-sampling process and improving the detection effect of small targets and long-distance targets. Meanwhile, each sub-manifold convolution is followed by a BatchNorm layer and a ReLU layer. The BatchNorm layer and the ReLU layer are beneficial to accelerating the convergence speed of the network, preventing gradient explosion and gradient disappearance and preventing overfitting. Finally, the Z direction is subjected to 2 times of downsampling by using a sparse convolution, and the downsampled Z direction is projected on a bird's eye view and converted into a pseudo-image expression of 256 multiplied by 188. In a three-dimensional encoder, the use of a large number of sparse convolutions and sub-manifold convolutions can improve the speed of extracting non-empty voxel features.

And (3) sending the 256 × 188 × 188 pseudo image expression obtained in the step (3) to a two-dimensional encoder for feature extraction. The network structure of the two-dimensional encoder is shown in fig. 3. The original pseudo-image representation is down-sampled 1, 2, 4 times with 13 x 3 two-dimensional convolution to obtain multi-scale features. Then, after each two-dimensional convolution, 2 successive self-calibration convolutions are used to extract the multi-scale features, and the input of the first self-calibration convolution and the output of the second self-calibration convolution are added. The self-calibration convolution is used for extracting the features in the original scale space and the self-calibration scale space, so that the receptive field is greatly increased, and more remarkable features are obtained. In addition, a spatial attention and channel attention mechanism is added after the self-calibration convolution, so that useful feature expression is enhanced in the spatial and channel directions, and useless information is reduced. And finally, returning the multi-scale features to the same size through deconvolution, splicing along the channel direction, and performing dimension compression by using a two-dimensional convolution to obtain a 64X 188 two-dimensional feature expression.

Wherein, fig. 3 is an expression form of Conv (a, b) -c × d- (e, f), Conv represents two-dimensional convolution, a and b are input and output dimensions, respectively, c × d represents convolution and size in x, y directions, and (e, f) represents step size in x, y directions; FIG. 3 is an expression in the form of Conv.T (a, b) -c × d- (e, f), where Conv.T stands for deconvolution, a and b are input and output dimensions, respectively, c × d stands for convolution and size in the x, y directions, and (e, f) stands for step sizes in the x, y directions; FIG. 3 is an expression in the form of SC-Conv (a, b), where SC-Conv stands for self-calibrating convolution and a and b are the input and output dimensions, respectively; FIG. 3 is an expression of the form CBAM (a, b), where CBAM represents channel and spatial attention, and a and b are input and output dimensions, respectively; the shape in fig. 3 is expressed as a × b × c, where a denotes the feature dimension, and b and c denote the dimensions of the feature in the x and y directions, respectively.

The two-dimensional feature expression is obtained through the step 4, the size of the two-dimensional feature expression is 64 x 188, 10 anchor frames (5 categories including cars, trucks, buses, pedestrians and riders, and 0-degree and 180-degree directions of each category) are placed in each pixel of the 188 x 188 two-dimensional feature map, category prediction and size regression (x, y, z, w, h, l and theta) are carried out on each anchor frame through the two-dimensional feature map, and 10 x 188 three-dimensional frames are obtained. In training, 9000 three-dimensional frames with the highest classification scores are selected and sent to a non-maximum suppression module to obtain 512 high-quality candidate frames.

For the 512 candidate boxes generated in step 5, 128 random samples are sent to the refinement stage, 64 positive samples and 64 negative samples. And extracting multi-scale voxel characteristics by using a voxel candidate frame pooling module. Firstly, uniformly sampling 6 x 6 grid points for each candidate box, then inquiring non-empty voxels within a certain distance around the grid points, and then extracting the non-empty voxels features by using a pointnet, a full connection layer and a maximum pooling layer to obtain pooled features. When the voxels around the grid points are inquired, the voxels sampled 2 times, 4 times and 8 times are inquired respectively, so that the multi-scale voxel characteristics are obtained, and the target detection with different sizes is facilitated.

For the 128 candidate boxes in step 6, their pooling is used as input, and confidence prediction and refinement of the regression box are performed by the 2-layer fully-connected layer. For confidence-predicted branches, predict the classification confidence c of each candidate box_iThe target is obtained by the following formula:

wherein

A confidence target value of the ith candidate box; IoU_iThe intersection ratio of the ith candidate box and its true value box. Theta_H,θ_LThe threshold values of the foreground and background intersection ratio are 0.75 and 0.25 respectively.

The intersection ratio of the prediction box and the truth value is associated with the classification execution degree, so that the problem that the classification confidence degree and the positioning accuracy are not matched can be relieved. For the regression box refinement branch, more accurate position information (x, y, z, w, h, l, θ) is predicted by the full-link layer.

And 8, training the network from the step 2 to the step 7.

The steps 2 to 7 form a network model of the invention, and the network is trained end to end, and five categories of cars, trucks, buses, riders and pedestrians are trained simultaneously. Train this network 90 rounds at 2 RTX 2080Ti GPUS, whichThe optimizer is Adam, the learning rate variation mode adopts a cosine fire mode, the maximum learning rate is 0.003, the frequency division coefficient is 10, the momentum is from 0.95 to 0.85, the weight attenuation is 0.02, and the batch size is 6. Total loss L thereof_TOTALIncluding RPN loss L_RPNAnd RCNN loss L_RCNNIn the RPN phase, the classification Loss L is calculated by Focal local_clsCalculating the regression Loss L by using Smooth L1 Loss_reg1In the RCNN stage, the confidence loss L is calculated by using the binary cross entropy loss_iouCalculating the regression Loss L by using Smooth L1 Loss_reg2。

L_TOTAL＝L_RPN+L_RCNN

are the predicted candidate box regression parameters,

is the true value of the candidate box regression parameters.

are the predicted candidate box regression parameters,

is the true value of the candidate box regression parameters.

And 8, visualizing the detection effect.

And inputting the point cloud scene into the trained network to obtain the category, the confidence coefficient and the candidate frame of the refined target. At this time, candidate frames of a plurality of targets in the point cloud road scene are sorted from high to low according to the classification confidence degrees, frames with low confidence degrees are filtered, then, a frame with the highest score is selected from the rest frames, meanwhile, the frame which is intersected with the frame and is larger than a certain threshold value is deleted, then, the operation (non-maximum suppression) is circularly carried out in the rest frames, and a final detection result is generated and visualized. The visualization effect is shown in fig. 4, where green is vehicle (car, truck, bus are collectively classified as vehicle), yellow is pedestrian, and blue is rider. Table 1 is a comparison graph of the detection accuracy of the original VOXEL RCNN and the detection model algorithm of the present invention at different distances.

Table 1 comparison of detection accuracy of VOXEL RCNN from original edition to the present algorithm at different distances

As can be seen from table 1, compared with the original VOXEL RCNN, the detection accuracy of the algorithm for vehicles is improved by 7.58%, the detection accuracy for pedestrians is improved by 12.16%, the detection accuracy for riders is improved by 9.96%, and the average accuracy is improved by 9.9%. The detection precision can meet the perception requirement of the intelligent automobile under the complex traffic condition, and secondly, through the detection time statistics of each frame point cloud, 13.8FPS is achieved on an RTX 2080Ti display card, and the requirement of real-time performance can be met.

The above-listed series of detailed descriptions are merely specific illustrations of possible embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent means or modifications that do not depart from the technical spirit of the present invention are intended to be included within the scope of the present invention.

Claims

1. A laser point cloud three-dimensional target detection model facing a complex traffic scene is characterized by comprising a voxelization processing module, a three-dimensional encoder module, a two-dimensional encoder module, a candidate frame generating module, a candidate frame pooling module and a full-connection layer module;

2. The complex traffic scene-oriented laser point cloud three-dimensional target detection model as recited in claim 1, wherein the voxelization processing module performs voxelization processing on the laser radar point cloud data in the following specific process:

the method comprises the steps of performing voxelization preprocessing on an input point cloud along the direction X, Y, Z, dividing an entire point cloud scene into uniform voxels, taking the center of a laser radar as an origin, an X axis along the front direction of a vehicle, a Y axis along the left direction and a Z axis along the upper direction, setting the size of each voxel to be (0.1,0.1,0.2) meters, and (5, 3) of the entire point cloud scene, wherein the range of the entire point cloud scene is [ (-75.2,75.2), (-75.2,75.2), and (-5,3) ], dividing the entire scene into 1504 × 1504 × 40 voxels with equal size, downsampling to 5 if the number of original points in each voxel exceeds 5, and taking the mean value of X, Y, Z, coordinates and reflection intensity of all the points in each voxel after downsampling as the original characteristics of the voxel.

3. The laser point cloud three-dimensional target detection model for the complex traffic scene according to claim 1, characterized in that the three-dimensional encoder module comprises a residual error structure, a sparse convolution and a sub-manifold convolution, the input of the three-dimensional encoder module is a 40 x 1504 voxel expression, each voxel has original characteristics of 4 dimensions (x, y, z, reflection intensity), after the input, the extraction of the original characteristics of the voxel is firstly carried out by a 3 x 3 sub-manifold convolution and 1 time down sampling is carried out, then 2 continuous residual sub-manifold blocks are used for characteristic extraction, then 3 times, 4 times and 8 times of sparse convolution are respectively carried out along the X, Y and Z directions to obtain multi-scale three-dimensional characteristics, and each sparse convolution is followed by 2 continuous residual sub-flow blocks to carry out characteristic extraction; the structure of the residual sub-manifold block is formed by 2 sub-manifold convolutions of 3 multiplied by 3, and the input of the first sub-manifold convolution and the output of the second sub-manifold convolution are added, meanwhile, each sub-manifold convolution is followed by a BatchNorm layer and a ReLU layer; finally, the Z direction is subjected to 2 times of downsampling by using a sparse convolution, and the downsampled Z direction is projected on a bird's eye view and converted into a pseudo-image expression of 256 multiplied by 188.

4. The laser point cloud three-dimensional target detection model for the complex traffic scene according to claim 1, the two-dimensional encoder module is characterized by comprising five parts of a residual error structure, two-dimensional convolution, self-calibration convolution, space attention and channel attention mechanism, the module down-samples the 256 x 188 pseudo-image representation obtained by the three-dimensional encoder module by 1, 2, 4 times with 13 x 3 two-dimensional convolution to obtain multi-scale features, then extracting the multi-scale features by 2 continuous self-calibration convolutions after each two-dimensional convolution, and adding the input of the first self-calibration convolution and the output of the second self-calibration convolution, adding a space attention and channel attention mechanism after the self-calibration convolution, enhancing useful feature expression and reducing useless information in the space and channel directions; and finally, returning the multi-scale features to the same size through deconvolution, splicing along the channel direction, and performing dimension compression by using a two-dimensional convolution to obtain a 64X 188 two-dimensional feature expression.

5. The laser point cloud three-dimensional target detection model oriented to the complex traffic scene as claimed in claim 1, wherein the candidate frame generation module is configured to, for a two-dimensional feature expression with a size of 64 × 188 × 188 obtained by the two-dimensional encoder module, place 10 anchor frames in each pixel of a 188 × 188 two-dimensional feature map, wherein the anchor frames are 5 categories of cars, trucks, buses, pedestrians, and riders, and each category is subjected to category prediction and size regression (x, y, z, w, h, l, θ) by using the two-dimensional feature map in two directions of 0 ° and 180 ° to obtain 10 × 188 × 188 three-dimensional frames, and when training, 9000 three-dimensional frames with the highest classification scores are selected and sent to the non-maximum suppression module to obtain 512 high-quality candidate frames.

6. The complex traffic scene-oriented laser point cloud three-dimensional target detection model as claimed in claim 1, wherein the candidate frame pooling module randomly samples 128 candidate frames into a refinement stage for 512 high-quality candidate frames generated by the candidate frame generation module, wherein 64 of the candidate frames are positive samples and 64 of the candidate frames are negative samples, and the voxel candidate frame pooling module is used for extracting multi-scale voxel features: firstly, uniformly sampling 6 x 6 grid points for each candidate frame, then inquiring non-empty voxels within a certain distance around the grid points, and then extracting the non-empty voxels features by using a pointnet, a full connection layer and a maximum pooling layer to obtain pooling features; when inquiring the voxels around the grid point, inquiring the voxels sampled 2 times, 4 times and 8 times respectively to obtain the multi-scale voxel characteristics.

7. The laser point cloud three-dimensional target detection model for the complex traffic scene according to claim 1,the full-connection layer module is characterized in that the full-connection layer module takes the pooling of 128 candidate frames randomly sampled by the candidate frame pooling module as input, carries out confidence prediction and regression frame refinement through a 2-layer full-connection layer, and predicts the classification confidence c of each candidate frame for a confidence prediction branch_iThe target is obtained by the following formula:

wherein

A confidence target value of the ith candidate box; IoU_iThe intersection ratio of the ith candidate box and its true value box. Theta_H，θ_LThe intersection ratio threshold of the foreground and the background is 0.75 and 0.25 respectively;

8. The complex traffic scene-oriented laser point cloud three-dimensional target detection model as claimed in any one of claims 1 to 7, wherein a dataset used by the model adopts an ONCE dataset as a training dataset and a verification dataset, and a class balance sampling enhancement method is added during training, namely training scenes with few classes are randomly copied and sent into the training dataset, so that class imbalance is relieved, and the number of training samples is expanded.

9. The complex traffic scene-oriented laser point cloud three-dimensional target detection model as claimed in any one of claims 1 to 7, wherein the model training adopts end-to-end training to train five categories of cars, trucks, buses, riders and pedestrians simultaneously; training the net in 2 blocks of RTX 2080Ti GPUSThe optimizer of the round 90 is Adam, the learning rate change mode adopts a cosine fire mode, the maximum learning rate is 0.003, the frequency division coefficient is 10, the momentum is from 0.95 to 0.85, the weight attenuation is 0.02, and the batch size is 6; total loss L thereof_TOTALIncluding RPN loss L_RPNAnd RCNN loss L_RCNNIn the RPN stage, FocalLoss is used to calculate the classification loss L_clsCalculating the regression Loss L by using Smooth L1 Loss_reg1In the RCNN stage, the confidence loss L is calculated by using the binary cross entropy loss_iouCalculating the regression Loss L by using Smooth L1 Loss_reg2：

L_TOTAL＝L_RPN+L_RCNN

are the predicted candidate box regression parameters,

is the true value of the candidate box regression parameters;

is a target value for confidence, beta stands forA regression loss is calculated for the candidate boxes of the sampled foreground,

are the predicted candidate box regression parameters,

is the true value of the candidate box regression parameters.

10. A detection method based on a laser point cloud three-dimensional target detection model facing a complex traffic scene is characterized by comprising the following steps:

s1, performing voxelization on the laser radar point cloud input into the network;

performing voxelization preprocessing on the input point cloud along the direction X, Y, Z, dividing the whole point cloud scene into uniform voxels, wherein the range of the whole point cloud scene (X, Y, Z) is [ (-75.2,75.2), (-75.2,75.2), (-5,3) ], the size of each voxel is set to be (0.1,0.1,0.2) meter, therefore, the whole scene is divided into 1504 × 1504 × 40 voxels with equal size, if the number of original points in each voxel exceeds 5, the whole scene is down-sampled to 5, and the mean values of X, Y, Z, coordinates and reflection intensity of all the points in each voxel after down-sampling are taken as the original characteristics of the voxels;

s2, extracting the features of the non-empty voxels by using a three-dimensional encoder, and projecting the obtained voxel features onto a bird' S-eye view to generate a pseudo-image expression;

the input of the three-dimensional encoder is a voxel expression of 40 multiplied by 1504, each voxel has original characteristics of 4 dimensions (X, Y, Z, reflection intensity), after the input, the extraction of the original characteristics of the voxel is firstly carried out through a 3 multiplied by 3 sub-manifold convolution and 1 time down sampling is carried out, then 2 continuous residual error sub-manifold blocks are used for carrying out characteristic extraction, then 3 multiplied by 3 sparse convolution is respectively carried out along the X, Y and Z directions for carrying out 2 times, 4 times and 8 times down sampling to obtain multi-scale three-dimensional characteristics, and each sparse convolution is followed by 2 continuous residual error sub-manifold blocks for carrying out characteristic extraction; the structure of the residual sub-manifold block is formed by 2 sub-manifold convolutions of 3 multiplied by 3, and the input of the first sub-manifold convolution and the output of the second sub-manifold convolution are added, meanwhile, each sub-manifold convolution is followed by a BatchNorm layer and a ReLU layer; finally, carrying out 2-time down-sampling on the Z direction by using a sparse convolution, projecting the Z direction on the aerial view, and converting the Z direction into a 256X 188 pseudo-image expression;

s3, performing feature extraction on the pseudo-image expression by using a two-dimensional encoder to obtain two-dimensional feature expression;

the method comprises the steps that 256 × 188 × 188 pseudo-image expressions obtained by a three-dimensional encoder are respectively subjected to 1-time, 2-time and 4-time down-sampling by 13 × 3 two-dimensional convolution to obtain multi-scale features, then 2 continuous self-calibration convolutions are used for extracting the multi-scale features after each two-dimensional convolution, the input of the first self-calibration convolution and the output of the second self-calibration convolution are added, and space attention and channel attention mechanisms are added after the self-calibration convolutions, so that useful feature expressions are enhanced in the space and channel directions, and useless information is reduced; finally, returning the multi-scale features to the same size through deconvolution, splicing along the channel direction, and performing dimension compression by using a two-dimensional convolution to obtain a 64X 188 two-dimensional feature expression;

s4, classifying and regressing the target by using the two-dimensional characteristics to generate a high-quality candidate frame;

aiming at the two-dimensional feature expression with the size of 64 x 188 obtained by a two-dimensional encoder module, 10 anchor frames are placed in each pixel of a 188 x 188 two-dimensional feature map, wherein the anchor frames are respectively 5 categories of cars, trucks, buses, pedestrians and riders, and each category is in two directions of 0 degrees and 180 degrees, category prediction and size regression (x, y, z, w, h, l and theta) are carried out on each anchor frame by using the two-dimensional feature map to obtain 10 x 188 three-dimensional frames, and 9000 three-dimensional frames with the highest classification scores are selected and sent to a non-maximum suppression module to obtain 512 high-quality candidate frames during training.

S5, voxel pooling is carried out on the candidate frames to obtain pooling characteristics;

for 512 high-quality candidate frames generated by the candidate frame generation module, randomly sampling 128 candidate frames and sending the candidate frames to a refinement stage, wherein 64 of the candidate frames are positive samples, and 64 of the candidate frames are negative samples, and extracting multi-scale voxel characteristics by using a voxel candidate frame pooling module: firstly, uniformly sampling 6 x 6 grid points for each candidate frame, then inquiring non-empty voxels within a certain distance around the grid points, and then extracting the non-empty voxels features by using a pointnet, a full connection layer and a maximum pooling layer to obtain pooling features; when the voxels around the grid points are inquired, the voxels sampled by 2 times, 4 times and 8 times are inquired respectively, so that the multi-scale voxel characteristics are obtained;

s6, refining the candidate frame by using the full connection layer according to the pooling characteristics;

and aiming at 128 candidate frames randomly sampled by the candidate frame pooling module, taking the pooling of the 128 candidate frames as input, performing confidence prediction and refinement of regression frames through a 2-layer full-connection layer, and predicting the classification confidence c of each candidate frame for confidence prediction branches_iThe target is obtained by the following formula:

wherein

A confidence target value of the ith candidate box; IoU_iIs the intersection ratio of the ith candidate frame and its true value frame, theta_H，θ_LThe intersection ratio threshold of the foreground and the background is 0.75 and 0.25 respectively;

the intersection ratio of the prediction box and the truth value is associated with the classification execution degree, so that the problem that the classification confidence degree and the positioning accuracy are not matched is solved; for the regression frame refinement branch, predicting more accurate position information (x, y, z, w, h, l, theta) through the full connection layer;

s7, visualizing the detection effect;

sorting candidate frames of a plurality of targets in the refined point cloud road scene from large to small according to the classification confidence degrees, filtering frames with low confidence degrees, selecting frames with the highest score from the rest frames, deleting the frames which are intersected with the frames and have the ratio larger than a threshold value, and circularly performing the non-maximum value inhibition operation in the rest frames to generate a final detection result for visualization.