CN111862101A

CN111862101A - 3D point cloud semantic segmentation method under aerial view coding visual angle

Info

Publication number: CN111862101A
Application number: CN202010681588.7A
Authority: CN
Inventors: 杨树明; 李述胜; 袁野; 王腾; 胡鹏宇
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2020-07-15
Filing date: 2020-07-15
Publication date: 2020-10-30

Abstract

The invention discloses a 3D point cloud semantic segmentation method under a bird's-eye view coding view angle, which converts an input 3D point cloud into the bird's-eye view angle through a voxel-based coding mode, extracts the characteristics of each voxel through a simplified PointNet network, converts the characteristics into a characteristic image which can be directly processed by a 2D convolution network, processes the coded characteristic image by a full convolution network structure consisting of residual modules reconstructed by means of decomposition convolution and cavity convolution, obtains an end-to-end pixel level semantic segmentation result, can accelerate the point cloud network semantic segmentation, and achieves a point cloud segmentation task under a high-precision real-time large scene under the condition of limited hardware. The method can be directly used for tasks such as robots, unmanned driving, unordered grabbing and the like, and due to the design of the method on a coding mode and a network structure, the method has low system overhead while having high-precision point cloud semantic segmentation, and is more suitable for hardware-limited scenes such as robots, unmanned driving and the like.

Description

3D point cloud semantic segmentation method under aerial view coding visual angle

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a 3D point cloud semantic segmentation method under a bird's-eye view encoding visual angle.

Background

In 2014, the R-CNN convolutional neural network is proposed, and the original traditional manual feature extraction method which is not before the beginning of the 2010 is slowly replaced by the feature extraction method based on the convolutional neural network. The method for processing two-dimensional images based on the convolutional neural network begins to dominate the development of computer vision technology, and the successful key lies in effective extraction of image features by convolution operation, accurate fitting of a network model fitting method based on data driving to model parameters and high robustness and expansibility brought by a redundant structure of a depth network. The convolutional neural network has the characteristics that the convolutional neural network can well complete the understanding task of a computer to the environment through a large amount of data and an exquisite structure.

Two-dimensional convolution has achieved great success in the image field, and when understanding and analyzing large-scale three-dimensional scenes, two-dimensional convolution is expanded into three-dimensional convolution naturally, and point clouds are processed directly by using the three-dimensional convolution. However, since the point cloud is generally highly sparse and lacks surface texture information, the method for directly processing the point cloud by using the three-dimensional convolution has too large system overhead, and is difficult to process the point cloud information in real time.

In order to reduce the overhead of directly processing the point cloud by three-dimensional convolution, some studies propose to perform voxel division on the point cloud and then perform three-dimensional convolution on a voxel grid. This approach may reduce the overhead somewhat, but still has certain limitations. The system overhead and prediction precision of the voxel grid-based processing method are closely related to the grid division precision, so that researchers need to balance model prediction precision and model operation efficiency. Subsequent studies have proposed techniques that utilize octrees to encode point cloud structures, which are then still insufficient to ensure efficient processing of large areas of point clouds.

Qi et al, Stanford university in 2017 propose PointNet, an pioneering network structure for processing disordered point clouds, and explore a new idea for extracting end-to-end point cloud features. The network is difficult to acquire structural features between points and insufficient to acquire global features, so that the network is difficult to be directly applied to semantic segmentation of large-scale point clouds, but the idea inspires many other point cloud segmentation networks and often appears in a point cloud feature extraction network.

The point cloud data is usually obtained by sensors such as a binocular camera, a depth camera and a laser radar, and the point cloud data obtained by the sensors usually only represents the surface information of an object, so that the point cloud data is highly sparse; meanwhile, the point cloud is discontinuous, so that the surface texture information of the object is difficult to express. These are in contrast to three-dimensional data of true voxel morphology, such as medical images. Therefore, the point cloud or the voxel is directly processed by the three-dimensional convolution, and a plurality of invalid operations are generated.

In order to solve the problem of overlarge three-dimensional convolution overhead, a lot of researches are recently carried out to project point clouds to top views (aerial views) or front views, then the point clouds are divided into voxels or volume columns with fixed formats, and manual feature extraction is carried out on points in the fixed voxels to form feature maps which can be directly processed by two-dimensional convolution.

(1) Three-dimensional point cloud semantic segmentation

The semantic segmentation of the three-dimensional point cloud is one of important directions for understanding of a three-dimensional scene of a computer, but because point cloud data is a high-redundancy and non-uniform data structure, a traditional segmentation method based on manual feature extraction is difficult to obtain a satisfactory result in a complex scene. In recent years, with the rise of deep networks, various point cloud segmentation technologies based on deep networks are proposed, such as PointNet, PointNet + + and VoxelNet, and the network models have rapidly enhanced comprehension capability for complex scenes.

(2) Development of deep networks for three-dimensional scene understanding

As deep networks continue to evolve in the field of computer vision, more and more research is beginning to focus on the processing and understanding of three-dimensional scene data by deep learning models. Three-dimensional data has certain complexity, a two-dimensional image can be easily expressed in a matrix form, and the expression form of the three-dimensional data is usually changed due to different scenes and comprises the expression forms of point clouds, triangular surface patches, voxels, multi-angle pictures and the like. Different depth network models can be divided into a two-step network and an end-to-end network according to the steps of processing the point cloud, and can also be divided into a three-dimensional convolution network and a two-dimensional convolution network according to the difference of convolution kernels of the depth network models.

Since end-to-end networks implement feature extraction and prediction using one network model, it is generally more efficient. The two-dimensional convolution parameters are far smaller than the three-dimensional convolution, so that an end-to-end two-dimensional convolution network is more suitable for completing real-time large-scale point cloud processing.

Disclosure of Invention

The invention provides a 3D point cloud semantic segmentation method under a bird's-eye view coding view angle, aiming at solving the technical problems that the prior point cloud scene understanding is easily limited by data high sparsity, local feature robustness is insufficient and system overhead is too large, so that large-scale point cloud is difficult to process in real time, and the like. Under the model framework, the semantic segmentation task of the large-scale point cloud scene can be rapidly, accurately and in real time.

The invention is realized by adopting the following technical scheme:

A3D point cloud semantic segmentation method under a bird's-eye view coding view angle comprises the following steps:

(1) and (3) projecting the point cloud codes under the view angle of the aerial view to construct a feature map which can be directly processed by 2D convolution: firstly, under a world coordinate system, grids are divided under a bird's-eye view, each point in the point cloud is distributed to different voxels obtained by grid division according to x, y and z coordinates of the point cloud, and the features of all the points in each voxel are extracted by utilizing a simplified version of PointNet to form a feature map (H, W and C) which can be directly processed by 2D convolution;

(2) point cloud semantic segmentation network: the data processed by the network is a characteristic diagram in an (H, W, C) form obtained in the step (1), and the network structure consists of an encoder and a decoder which are built by a residual error module consisting of decomposition convolution and cavity convolution; the residual error module and the down-sampling module form an encoder, the residual error module and the up-sampling module form a decoder, and the residual error module and the up-sampling module form an end-to-end pixel-level point cloud semantic segmentation network;

(3) network training:

the network input is disordered point cloud data, model training is carried out by using the data as a driving method, a cross entropy function is used as a loss function for the model, and meanwhile, the phenomenon of data distribution imbalance is relieved by adding punishment weight to error losses of different types:

Wherein

The subscript c represents the class, w_cRepresents a penalty weight, represented by_cDetermination of f_cRepresenting the frequency of occurrence of objects in category c in the data set;

representing a network forecast value;

and calculating the total error of the network by using a loss function formula, updating network parameters by using an error back propagation and random gradient descent method, continuously iterating until the loss function of the model is converged, and finishing training.

The further improvement of the invention is that the specific implementation method of the step (1) is as follows:

(1.1) grid division under the view angle of the aerial view:

dividing grids under the view angle of the aerial view, and dividing the grids under an x-y plane according to a set size; for in the point cloudEach point p comprises the characteristics of three dimensions of x, y and z, and is distributed into different voxels obtained by grid division according to the x and y coordinates of the point p; then, for the voxel containing points inside, calculating the average value of all the point coordinates inside the voxel and marking the average value as x_c,y_c,z_cSimultaneously calculating the deviation of each point in the voxel to the voxel central point x, y, and recording as x_pAnd y_pAfter expansion, the points in the point cloud have at least the characteristic of D being 9 dimensions;

(1.2) converting the divided point cloud into a feature map:

limiting the maximum number of points in each voxel to N, and converting the point cloud into a (P, N, D) tensor form; then, mapping the features of each point to a high-dimensional feature space by using a simplified PointNet network, wherein the obtained output is in a (P, N, C) form, if the point cloud features are more in consideration of overlapping of a plurality of modules, a maximum pooling layer processing channel N is added at the end of each module to reserve the most prominent features in all points in each voxel, and the obtained output is in the (P, C) form; and finally, reassigning the P voxels to the original positions according to the indexes of the x-y plane positions of the previous voxels to obtain a feature map in the shape of (H, W, C), wherein H and W respectively represent the height and width of the feature map, and for the positions of the empty voxels, filling with zeros.

The further improvement of the invention is that the specific implementation method of the step (2) is as follows:

(2.1) reconstructing residual module architecture:

the reconstructed residual error module replaces 3 × 3 convolution in the traditional residual error module by using 1 × 3 and 3 × 1 decomposition convolution, the perception field is expanded by adding cavity convolution, and the size of a feature map is unchanged before and after the feature map passes through the reconstructed residual error module; meanwhile, by introducing the cavity convolution into the reconstructed residual error module convolution layer, under the condition of keeping parameters unchanged, the perception field of the network is increased, so that the network is wider and has stronger feature extraction and identification capabilities; the residual error module uses a jump connection structure to accelerate deep network training and predict the details of a target;

(2.2) downsampling module architecture:

the down-sampling module is formed by connecting a maximum pooling layer with the step length of 2 and a convolution layer in parallel, and a feature graph obtained by processing the pooling layer and a feature graph obtained by the convolution layer are spliced on the dimension of a feature channel C to form a new feature graph and then output the new feature graph; the step length of the pooling layer and the convolution layer is 2, the size of the feature graph processed by the down-sampling module is reduced by one time, and high-dimensional features are extracted; the pooling layer and the convolution layer are processed in parallel, and network detail characteristics are reserved;

(2.3) overall network architecture:

the whole network consists of an encoder and a decoder, wherein the encoder consists of a down-sampling module and a reconstructed residual error module; the multiple reconstructed residual error modules can extract high-dimensional features more accurately and quickly; the decoder consists of an up-sampling module and a reconstructed residual error module, wherein the up-sampling module consists of a deconvolution layer and aims to restore the spatial dimension of the feature and restore the size of the feature map; meanwhile, a structure that one up-sampling module is matched with a plurality of residual error reconstruction modules is also adopted, so that the network can more finely restore the characteristic details, and the pixel-level point cloud segmentation is achieved; after the decoder restores the feature map to the size of the input feature map, the number of feature map channels is reduced to the number nclass of the final target classification by directly utilizing two layers of 1-1 convolution layers, and the feature map channels are normalized into probability distribution by utilizing the softmax function processing to obtain

Wherein a is_cFor the network output results, the subscript c represents the different categories.

The invention has at least the following beneficial technical effects:

For a point cloud segmentation task, the traditional algorithm is to finely divide voxels in the whole point cloud scene in pursuit of accuracy, so that the parameter quantity is exponentially increased, the efficiency of a network model is low, the system overhead is high, and the actual requirement is difficult to adapt. At the same time, the complexity and sparsity of the point cloud data make it difficult for the network to extract enough features to predict details in the three-dimensional scene.

The 3D point cloud semantic segmentation method under the bird's-eye view coding view angle provided by the invention codes the point cloud under the bird's-eye view, extracts the required characteristics through data by using a simplified and improved version PointNet, and reduces network parameters while expanding the perception field of a convolution kernel by using a reconstructed residual error module and a down-sampling module, so that the network model is deeper and wider, an end-to-end pixel-level full convolution network model is constructed, and the leading point cloud segmentation accuracy is ensured while the very high point cloud segmentation efficiency (20 frames per second under 1080 Ti) is achieved. The invention is extremely suitable for directly utilizing tasks such as robot navigation, automatic driving and the like due to high accuracy, high efficiency and low system overhead.

Drawings

Fig. 1 is a schematic view of encoding under a bird's eye view.

Fig. 2 is a backbone network of the model, each square representing a feature map, and the lower number representing the number of channels of the feature map.

Fig. 3 is a simplified modified structure diagram of PointNet.

FIG. 4(a) is a diagram of a reconstructed residual block structure; fig. 4(b) is a structure diagram of a downsampling block.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, but the present invention is not limited to the specific embodiments.

The invention provides a 3D point cloud semantic segmentation method under an aerial view coding visual angle, which comprises the following steps of:

(1) as shown in fig. 1, point cloud coding is projected under a bird's eye view to construct a feature map which can be directly processed by 2D convolution: firstly, under a world coordinate system, grids are divided under a bird's-eye view, each point in the point cloud is distributed to different voxels obtained by grid division according to x, y and z coordinates of each point, and the features of all the points in each voxel are extracted by using a simplified version of PointNet to form a feature map which can be directly processed by 2D convolution;

(1.1) grid division under the view angle of the aerial view:

dividing grids under the view angle of the aerial view, dividing the grids under the x-y plane according to a set size, wherein the size of the grids can be selected according to actual conditions, for example, in a large-range scene, the size of the grids is selected to be 0.16m ². Each point p in the point cloud includes features of three dimensions of x, y and z, and some point clouds obtained by sensors such as laser radar also include a reflectivity r, and the point clouds are distributed into different voxels obtained by grid division according to x and y coordinates of the point clouds. Then, for the voxel containing points inside, calculating the average value of all the point coordinates inside the voxel and marking the average value as x_c,y_c,z_cSimultaneously calculating the deviation of each point in the voxel to the voxel central point x, y, and recording as x_pAnd y_pAfter expansion, the points in the point cloud will have at least D-9 dimensions;

(1.2) converting the divided point cloud into a feature map:

due to the high sparsity of the point cloud, most of voxels obtained by grid division are empty voxels, taking the public data set KITTI as an example, if the point cloud in the KITTI data set is under the x-y plane, the point cloud is 0.16²m²And (4) scale division, wherein the number P of non-empty voxels in the division result is approximately 10000-20000. Limiting the maximum point number in each voxel to N, and converting the point cloud into a (P, N, D) tensor form according to the standard; then, mapping the features of each point to a high-dimensional feature space by using a simplified version of PointNet network, wherein the obtained output is in a (P, N, C) form, as shown in FIG. 3, the simplified version of PointNet mainly comprises modules consisting of a linear transformation layer, a batch normalization layer and a ReLU layer, if more point cloud features can be considered, overlapping of a plurality of modules is considered, and a maximum pooling layer processing channel N is added at the end of each module to retain the most prominent feature in all points in each voxel, so that the obtained output is in the (P, C) form; finally, P voxels are reassigned to the original positions according to the index of the x-y plane position of each voxel before, and a feature map in the shape of (H, W, C) is obtained, wherein H and W respectively represent the feature map For the location of empty voxels, zero padding is used, and for the processed feature map we can directly process it using a 2D convolutional network. For example, an RGB color map is a feature map of the form (H, W, C) with channel C of 3.

(2) Point cloud semantic segmentation network: the data processed by the network is a characteristic diagram in an (H, W, C) form obtained in the last step, and the network structure mainly comprises two parts of an encoder and a decoder which are built by a residual error module consisting of decomposition convolution and cavity convolution; the residual error module and the down-sampling module form an encoder, the residual error module and the up-sampling module form a decoder, the residual error module and the up-sampling module form an end-to-end pixel-level semantic segmentation network, and the whole network frame is shown in FIG. 2;

(2.1) reconstructing residual module architecture:

as shown in fig. 4(a), the reconstructed residual error module replaces 3 × 3 convolution in the conventional residual error module with 1 × 3 and 3 × 1 decomposed convolution, and at the same time, the sensing field is expanded by adding hole convolution, so that the network feature extraction capability and accuracy are ensured, and the network parameters are reduced as much as possible to achieve a faster processing speed, and the feature map has a constant size before and after passing through the reconstructed residual error module. By designing a reconstruction residual error module with fewer parameters, the number of the parameters can be kept less under the condition of deeper network structure so as to keep the efficiency of processing data; meanwhile, by introducing the hole convolution, under the condition of keeping the parameters unchanged, the perception field of the network is increased, so that the network is wider and has stronger feature extraction and identification capabilities; the jump connection structure of the residual module can also enable a deep network to be trained better and faster, and can also predict the details of the target better.

(2.2) downsampling module architecture:

as shown in fig. 4(b), the down-sampling module is formed by connecting the maximum pooling layer with the step length of 2 and the convolution layer in parallel, and the feature map obtained by processing the pooling layer and the feature map obtained by the convolution layer are spliced in the dimension of the feature channel C to form a new feature map and then output the new feature map; the step length of the pooling layer and the convolution layer is 2, so that the size of the feature graph processed by the down-sampling module can be reduced by one time, only the prominent features are concerned, and the effect of reducing the parameter quantity is achieved; and the parallel processing of the pooling layer and the convolution layer also enables the module to keep useful detail characteristics as much as possible, and improves the semantic segmentation effect of the network on small targets.

(2.3) overall network architecture:

the whole network consists of an encoder and a decoder, wherein the encoder mainly consists of a down-sampling module and a reconstructed residual error module; the multiple residual error reconstruction modules can extract high-dimensional features more accurately and quickly. The decoder mainly comprises an up-sampling module and a reconstructed residual error module, wherein the up-sampling module mainly comprises a deconvolution layer and aims to restore the spatial dimension of the feature and reduce the size of the feature map; meanwhile, a structure that one up-sampling module is matched with a plurality of residual error reconstruction modules is also adopted, so that the network can restore feature details more finely, and the pixel-level point cloud segmentation is achieved. After the decoder restores the feature map to the size of the input feature map, the number of feature map channels is reduced to the number nclass of the final target classification by directly utilizing two layers of 1-1 convolution layers, and the feature map channels are normalized into probability distribution by utilizing the softmax function processing to obtain

(3) Network training:

the network input is disordered point cloud data, such as a SemanticKITTI public data set, model training is carried out by using data as a driving method, a cross entropy function is used as a loss function by the model, and meanwhile, penalty weights are added to error losses of different types to relieve the phenomenon of data distribution imbalance:

wherein

The subscript c represents the class, w_cRepresents a penalty weight, mainly consisting of_cDetermination of f_cRepresenting the frequency of occurrence of objects in category c in the data set. Generally, object classes that occur at high frequencies in the dataset will be given less weight.

Examples

The invention provides a 3D point cloud semantic segmentation method under a bird's-eye view coding visual angle.

1. Training network model

Training the 3D point cloud semantic segmentation network model under the aerial view coding visual angle, and firstly, requiring sufficient point cloud data. Each frame of point cloud scene sample should contain XYZ, reflectivity, and semantic category information to which each point belongs. Take the sematic kitti outdoor lidar point cloud dataset as an example. 15000 frames of scene point clouds are used as a training set, and 3000 real point clouds are used as a verification set.

After obtaining an adequate point cloud data set, firstly, each frame of point cloud needs to be encoded through a bird's-eye view to become a grid voxel under a depression view, then, the features of each point are mapped to a high-dimensional feature space by using a simplified and improved PointNet network, and the structure of the simplified PointNet network is shown in FIG. 3. Maximum pooling level processing is added at the end of the module to retain the most salient features of all points in each voxel; finally, according to the index of the x-y plane position of each voxel before, the voxel is redistributed to the original position, and for the position of the empty voxel, zero filling is carried out, and the characteristic diagram is obtained through processing

Then, an end-to-end full convolution network model is built according to the attached figure 2, and for the processing of the feature map, a reconstructed residual error module and a down-sampling module built in the figure 4 are used. Calculating the point cloud network error and the penalty weight of each category by using the formula (1) and the formula (2), iteratively updating parameters according to a gradient back propagation method, and accelerating by using a GPU until the error of the network is reduced to be within a set threshold value or the number of network iterations meets the requirement.

2. Point cloud semantic segmentation process

For a frame of large-scale point cloud, firstly, the frame of large-scale point cloud is sent to a bird's-eye view coding module for coding, a simplified version PointNet is utilized for carrying out feature mapping processing to finally obtain a feature map, then the obtained feature map is sent to a trained semantic segmentation network model for carrying out point cloud segmentation, and a prediction label is given to each point. The scene point cloud segmentation result can be directly used in tasks such as automatic driving and robot navigation.

Claims

1. A3D point cloud semantic segmentation method under a bird's-eye view coding view angle is characterized by comprising the following steps:

(3) network training:

Wherein

representing a network forecast value;

2. The 3D point cloud semantic segmentation method under the bird's eye view coding view angle according to claim 1, wherein the specific implementation method of the step (1) is as follows:

(1.1) grid division under the view angle of the aerial view:

dividing grids under the view angle of the aerial view, and dividing the grids under an x-y plane according to a set size; for each point p in the point cloud, the point p contains the characteristics of three dimensions of x, y and z, and the point p is distributed into different voxels obtained by grid division according to the x and y coordinates of the point p; then, for the voxel containing points inside, calculating the average value of all the point coordinates inside the voxel and marking the average value as x_c,y_c,z_cSimultaneously calculating the deviation of each point in the voxel to the voxel central point x, y, and recording as x_pAnd y_pAfter expansion, the points in the point cloud have at least the characteristic of D being 9 dimensions;

(1.2) converting the divided point cloud into a feature map:

3. The 3D point cloud semantic segmentation method under the view angle of the bird's eye view image of claim 2 is characterized in that the implementation method of the step (2) is as follows:

(2.1) reconstructing residual module architecture:

(2.2) downsampling module architecture:

(2.3) overall network architecture:

the whole network is composed of an encoder andthe encoder is composed of a down-sampling module and a reconstructed residual error module, and in a common structure, a plurality of reconstructed residual error modules are connected behind one down-sampling module and used for extracting point cloud characteristics and reducing network parameters; the multiple reconstructed residual error modules can extract high-dimensional features more accurately and quickly; the decoder consists of an up-sampling module and a reconstructed residual error module, wherein the up-sampling module consists of a deconvolution layer and aims to restore the spatial dimension of the feature and restore the size of the feature map; meanwhile, a structure that one up-sampling module is matched with a plurality of residual error reconstruction modules is also adopted, so that the network can more finely restore the characteristic details, and the pixel-level point cloud segmentation is achieved; after the decoder restores the feature map to the size of the input feature map, the number of feature map channels is reduced to the number nclass of the final target classification by directly utilizing two layers of 1-1 convolution layers, and the feature map channels are normalized into probability distribution by utilizing the softmax function processing to obtain