CN113554653A

CN113554653A - Semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration

Info

Publication number: CN113554653A
Application number: CN202110631495.8A
Authority: CN
Inventors: 李梦甜; 谢源; 马利庄; 张志忠
Original assignee: East China Normal University; Zhejiang Lab
Current assignee: East China Normal University; Zhejiang Lab
Priority date: 2021-06-07
Filing date: 2021-06-07
Publication date: 2021-10-26

Abstract

The invention discloses a semantic segmentation method based on mutual information calibration point cloud data long tail distribution, which comprises two parallel attention methods designed for a large-scale 3D scene, one is used for a space position, the other is used for a channel position, the two methods are both used for capturing remote global information, the proposed spatial attention can support large scale input of 3D point clouds, in contrast to traditional non-local attention operators in 2D images, which, subsequently, the present invention introduces a merging operation to enhance feature discrimination from a global perspective, on the other hand, the present invention proposes an imbalance adjustment loss function, so that the network is more concerned with the recognition results of classes with lower frequency of occurrence and limits the prediction of the number of occupied points per semantic class along with the occupancy regression loss function, both benefit from the training process and encourage the network to produce scene segmentation results with better boundaries.

Description

Semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration

Technical Field

The invention relates to the technical field of semantic segmentation, in particular to a semantic segmentation method based on mutual information calibration point cloud data long tail distribution.

Background

Large scale 3D scene segmentation, aimed at assigning semantic class labels to each point, has recently been extensively and actively studied and is for a variety of challenging and meaningful applications (e.g. autopilot, robotics and location recognition). To accomplish such tasks efficiently, we need to distinguish the blurred shape from the blurred portion and consider objects with different appearances. For example, if the structure information and the discriminant features are not well encoded in the embedding space, adjacent chairs and tables can be easily confused and grouped into a unified class.

The existing method adopts a design concept similar to a typical convolutional neural network in a 2D image, and is mainly provided for learning richer local structures and capturing more extensive context information of point cloud. Despite the series of efforts that have been made on common data sets, there are still some problems to be solved.

First, although a global representation is obtained in the deep learning model, the complex global relationships between point clouds have not been explicitly exploited, which is crucial for better segmentation. For example, the areas of walls and doors are often indistinguishable, and components of tables and chairs may be affected by their similar structures. It is necessary to enhance the discriminative power of the feature representation for point level identification. The ELGS method uses the channel and space attention mechanism method for reference from the 2D image task. The MPRM method replaces the convolution with 1 × 1Conv without structural changes to the channel and spatial attention mechanisms in the 2D image task. The PT method and the PCT method design a converter layer for point cloud processing. Both utilize the self-attention operator instead of the convolution operator in the neural network. However, all the above methods can only be operated by using a sub-cloud of the entire point cloud with a limited number of points, and cannot process a large 3D scene point cloud. Secondly, existing 3D methods pay little attention to the inherent properties of the 3D real world data itself. On the one hand, point clouds collected from the real world typically exhibit an unbalanced or long-tailed label distribution, with several common classes absolutely predominant in number, resulting in the model being biased towards these predominant classes and ignoring a lesser number of classes. For example, in almost every indoor scenario, the classes of walls and floors appear, while in outdoor scenarios, the classes of roads and buildings occupy most of the positions. On the other hand, 3D data is intrinsic, with no occlusion or scale blurring, so the number of points of an object does not change in the 3D scene. In contrast, in a 2D image, the same object will be imaged as a different number of pixels due to different camera distances and angles. The occupied pixels/points per object (expressed as occupancy) are unpredictable on 2D images, but can be reliably predicted from 3D scenes. The recent RandLA-Net method effectively segments large point clouds, while ignoring long-tail distribution and unbalanced problem point cloud data in the real world.

Disclosure of Invention

In order to solve the defects of the prior art, a novel framework is provided by introducing a fine module of an adjacent region for large-scale 3D scene segmentation and two loss functions aiming at solving the balance problem of network training, so that the aims of processing large-scale point cloud feature input and avoiding inter-class inconsistency and intra-class uncertainty in large-scale point cloud scene semantic segmentation are fulfilled, and the technical scheme is as follows:

a semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration comprises the following steps:

s1, inputting large-scale 3D point cloud data;

s2, extracting point cloud features;

s3, acquiring spatial position attention supporting large-scale input;

s4, acquiring expanded channel position attention;

s5, performing feature fusion, namely splicing feature maps supporting large-scale input space position attention and expanded channel position attention output, performing attention feature fusion, and performing up-sampling to enable the scale of the output point cloud to be equivalent to that of the input point cloud;

s6, constructing a joint loss function, and forcing the neural network to learn the inherent attributes of the input points:

a joint cost function is represented that is,

representing an imbalance adjustment loss function for performing imbalance adjustment of the imbalance and long tail label distribution,

expressing an occupancy rate regression loss function for regressing the occupancy size of the category to which each point belongs,

representing a cross entropy loss function for final semantic segmentation prediction;

and S7, outputting the point cloud segmentation result.

Further, the S3 includes the following steps:

s31, obtaining N × C output feature maps A from the feature extraction network, and feeding A to two different 1 × 1 convolution layers to obtain different feature maps B and C, wherein N represents points and C represents dimensions;

s32, carrying out matrix multiplication, namely attention operation, between transposes of B and A, playing a role of characteristic enhancement, obtaining a C' × 1 output matrix D, wherein the attention value formula on D is as follows:

the indices i and j denote point i and point j, A, respectively_jIndicates the position of the jth point in the feature map A, B_iRepresenting the position of the ith point in the feature map B;

s33, converting the transpose of D into two other 1 × 1 convolutional layers, which are represented as bottleneck conversions:

F_j＝ReLU(LN(D_j))

where LN represents the normalization layer and ReLU is the activation function;

s34, the matrix multiplication between fig. D and C, i.e. the attention operation, has the effect of feature enhancement, and is expressed as:

s35, weighting the aggregated feature maps E and F by using two proportional parameters α and β in the summation process, where α and β are learnable parameters initialized to 0, and we generate the spatial attention map G, namely the attention operation, which has the effect of feature enhancement:

G_j＝αE_j+βF_j+A_j。

further, the S4 includes the following steps:

s41, performing matrix multiplication between the transposed A and the original A, wherein, we obtain a C × C channel attention diagram B;

s42, propagating the feature map C by using matrix multiplication between B and original A, and utilizing an attention operation to achieve the effect of feature enhancement, wherein the effect is expressed as:

wherein M is the channel dimensionThe subscripts i, j denote channel i and channel j, C_jRepresents the j-th position in the characteristic diagram C;

s43, defining a cross-channel operator to capture the adjacent channel relationship, using 1 × 1 convolution with kernel size h to realize, then, local cross-channel interaction, namely attention operation, plays a role of feature enhancement, and is expressed as:

where W is an h M parameter matrix, h represents the adjacent channel (i.e., kernel size) in stride, and σ is a sigmoid function (i.e., sigmoid function).

S44, sets the weighting parameter λ, and generates a channel attention map E represented by:

E_j＝λC_j+D_j+A_j。

further, the imbalance adjustment loss function in S6

Using the minimum softmax cross entropy:

wherein theta represents parameters of the neural network, and (x, y) -D represent training data, wherein x represents data, y represents supervisory information, D represents distribution, and p represents_θ(y | x) represents an unknown profile;

let f_y(x; θ) is the result before the softmax function, i.e., logit, thus yielding:

wherein f is_y(x; theta) represents the current parameter distribution of the neural network, and K represents the number of candidate semantic categories;

from the experience of observation, for example in a conference room, certain categories (e.g. table and chairs) often occur simultaneously; while other classes (e.g., sofas and columns) tend to avoid each other. To describe this, we need a number to represent "how many times the probability of two classes co-existing in the same 3D scene as they randomly encounter? "we describe this phenomenon using a point-by-point mutual information PMI, since it is a measure of the number of relationships between two random variables sampled at the same time, expressed as:

wherein, p (y)₁)、p(y₂) Are respectively category y₁、y₂If the PMI is much larger than 0, then both classes tend to occur simultaneously, otherwise, tend to avoid each other;

from the above discussion, point-by-Point Mutual Information (PMI) is an effective measure, actually revealing the internal relationships between classes. Therefore, let the model fit directly to the PMI for the network to learn more basic knowledge, we model the PMI and express it as:

it is re-normalized using the softmax function, expressed as:

log p_θ(y∣x)～f_y(x；θ)+log p(y)

for the sake of generalization, adding a tuning factor τ, the resulting imbalance tuning loss function is expressed as:

the imbalance adjustment penalty we propose applies a tag-dependent offset to each logit; by embedding PMIs between scene semantics and introducing them into segmentation tasks, the network can be helped to reduce inter-class confusion problems.

Further, the adjustment factor τ is 1.

Further, the occupancy rate regression loss function in S6

In 2D images, objects will be imaged as different numbers of pixels due to different camera distances and angles, which results in an unpredictable occupied pixel/point per object (expressed as occupancy), whereas 3D data is inherent without occlusion or scale blur, so the point number of an object is unchanged in a 3D scene, which means that an object may contain a fixed number of points, and as a result, the same marked points tend to have a stable number, which we call the occupancy scale; for indoor and outdoor scenes we divide the point clouds using sub-grids of size 4cm and 6cm, with the points in each sub-grid represented by only one point with a label (center of gravity), a procedure similar to the voxelization of the point clouds, we can sample a fixed number of points 10^5 from each point cloud as input, for the unmarked points in the scene, these points are not put into the loss function for computation, therefore, o_iThe setting of the method can help the network to correct the problem of data imbalance in the training process, and in addition, for the point cloud data set adopted in the experiment, from any angle, the original setting is that each marked point only has one label, and the unmarked points do not have labels; predicting a positive value o for the ith point in the kth semantic class_iTo indicate the number of points occupied by the current semantic class, o_iThe average of (d) is used as the expected occupancy of the current semantic class, and for more reliable prediction we will regress the number of points rather than the original value:

wherein N is_kIs the point number in the kth semantic class, K represents the semantic class number, the occupancy rate regression loss returns to the occupancy rate of each point class, namely the inherent property of each 3D class object occupiesRate regression loss adjusts the proportion of each semantic class in the training process, which can benefit the network by effectively preventing intra-class inconsistencies.

Further, in the S2, the input point cloud data is described as

It is an original unordered set of points with F dimensions, N is the number of points, p_iIs a feature vector comprising coordinates, colors, labels.

Further, in S2, data enhancement is performed on the input point cloud data, including randomly disordering the order of the points, randomly rotating the point cloud for data enhancement, randomly rotating the spatial coordinates, and normal vectors.

Further, in S43, h is set to 3.

Further, in S44, λ is 0.1, which is the most effective.

The invention has the advantages and beneficial effects that:

the neighborhood refinement module proposed by the invention comprises two types of attention blocks, supports spatial attention in a large scale, and pre-forms expanded channel attention in a parallel manner, and can process a large number of points (for example, 10 points) at a time⁵Whereas the conventional method treats 10 at most once⁴A number of input point clouds) without increasing computational complexity and time cost; the invention provides two loss functions, and the two loss functions jointly utilize the inherent long tail label distribution in a 3D scene to guide a network to solve intra-class inconsistency and inter-class confusion; the invention can train the network in an end-to-end mode and is superior to the traditional method in the aspects of efficiency and effectiveness; the model and the training loss function provided by the invention can achieve better effect on large-scale scene point cloud segmentation.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of a spatial location attention module supporting large-scale input according to the present invention.

FIG. 3 is a schematic diagram of an expanded channel position attention module of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

As shown in fig. 1, the semantic segmentation method based on mutual information calibration point cloud data long tail distribution includes the following steps:

and step T1-1, inputting a large-scale 3D point cloud file.

The format of input data, the storage form of data is txt format, and the keywords are: xyz, rgb, label, the data dimensions of the key index are: xyz (40960, 3), rgb (40960, 3), label (40960, 1).

And T1-2, extracting a network from the point cloud characteristics.

We use RandLA-Net as a point cloud feature extraction network to process large-scale point clouds and generate rich semantic representations for each point. Describing input data as

It is an original unordered set of points with an F dimension, where N is the number of points, p_iIs a feature vector and may contain 3D spatial coordinates (x, y, z), color (r, g, b), and label (label). We set F-3 to use only 3D coordinates as input. Considering that the number of samples in a point cloud in the real world may be very large, allowing each point in each set of points to participate in the calculation leads to a high calculation cost and a problem of gradient vanishing since the weight distribution/influence of each point to other points is very small. During training, firstly, randomly downsampling the input point cloud to 40960 points, then setting the training epoch to be 250 steps, setting the size of batch size to be 8, setting the learning rate to be 0.001, setting momentum based on gradient descent to be 0.9, and setting the optimizer to be Adam. During the training process, the number of neighbor nodes of each point cloud is set to 16. After point cloud data is input into a network, data enhancement needs to be carried out on the point cloud data in a training process, and the data enhancement comprises the following steps: random dot disordering sequence and random rotating point cloud enteringAnd performing line data enhancement and randomly rotating x, y and z space coordinates and normal vectors.

Step T1-3 supports spatial locality attention for large-scale input.

As shown in fig. 2, N × C' (N represents the number of points, and C represents the dimension) output feature maps a are acquired from the feature extraction network and fed to two different 1 × 1 convolution layers, resulting in feature maps B and C. Then, matrix multiplication, namely an attention operation, is performed between transposes of B and A, so that the characteristic enhancement effect is achieved, and a C × 1 output matrix D is obtained. The attention value formula on D is as follows:

the indices i and j denote point i and point j, A, respectively_jIndicates the position of the jth point in the feature map A, B_iIndicating the position of the ith point in the feature map B.

The transpose of D is then converted to the other two 1 × 1 convolutional layers as a bottleneck conversion representation:

F_j＝ReLU(LN(D_j))

where LN denotes the normalization layer and ReLU is the activation function.

The matrix multiplication between the graph D and the graph C, namely the attribute operation, has the effect of feature enhancement and is expressed as follows:

thereafter, in the summation process, the aggregated feature maps E and F are weighted using two scaling parameters α and β, where α and β are learnable parameters initialized to 0. We generate the spatial attention graph G, denoted as attention operation, which has the effect of feature enhancement:

G_j＝αE_j+βF_j+A_j

expanded tunnel position attention of step T1-4

As shown in fig. 3, we perform the matrix multiplication directly between transposed a and original a. Here we get a C × C channel attention B, then we propagate feature map C using matrix multiplication between B and original a, using the attention operation, with the effect of feature enhancement, expressed as:

where M is the channel dimension, and the subscripts i, j denote channel i and channel j, C_jRepresenting the jth position in the signature C. Then we define the cross-channel operator to capture the adjacent channel relations, implemented with a 1 × 1 convolution of kernel size h. Then, the local cross-channel interaction, namely the attribute operation, has the effect of feature enhancement, and can be expressed as:

where W is an hXM parameter matrix, h denotes the adjacent channel in stride (i.e., kernel size), we set h to 3, and σ is a sigmoid function.

Then, we set the weight parameter λ to 0.1 and generate a channel attention map E represented as:

E_j＝λC_j+D_j+A_j

step T1-5 feature fusion.

In this step, feature maps supporting large-scale input spatial position attention in the step T1-3 and output of expanded channel position attention in the step T1-4 are first stitched together, then attention feature fusion is performed through a layer of 1 × 1Conv, and then point clouds are up-sampled through three layers of 1 × 1Conv (convolutional layers) so that the output point cloud scale is equal to the input point cloud scale.

Step T1-6 combines the loss functions.

To exploit together the inherent properties of 3D real world data, we have designed two effective penalties to enforce (guide) the network to learn the inherent attributes of the input pointsSex, i.e., imbalance adjustment losses and occupancy regression losses. The network is trained to combine cost functions

And (3) minimizing:

wherein

Is an imbalance adjustment penalty function that performs imbalance and long tail label distribution in the actual 3D scene.

And the method is an occupation regression loss function and is used for regressing the occupation size of the category to which each point belongs.

Conventional cross entropy loss for the final semantic segmentation prediction.

Firstly, to

For analysis and definition, 3D point clouds collected from the real world typically exhibit an unbalanced or long-tailed distribution of labels. In this case, the batches sampled during training have little chance to sample the low frequency classes compared to the high frequency classes, which easily leads the model to ignore them, but in practice we are usually more concerned about the recognition results of low frequencies. Considering K candidate semantic categories, the training data are (x, y) -D, x represents data, y represents supervision information, D represents distribution, and p represents unknown distribution_θ(y | x), in general, one minimizes the softmax cross entropy:

theta denotes a neural networkLet us assume f_y(x; θ) (the current parameter distribution of the neural network) is the result before the softmax function, i.e., logit, so we get:

from the experience of observation, for example in a conference room, certain categories (e.g. table and chairs) often occur simultaneously; while other classes (e.g., sofas and columns) tend to avoid each other. To describe this, we need a number to represent "how many times the probability of two classes co-existing in the same 3D scene as they randomly encounter? ". We describe this phenomenon using point-by-Point Mutual Information (PMI) because it is a quantity that measures the relationship between two random variables sampled at the same time, expressed as:

wherein, p (y)₁)、p(y₂) Are respectively category y₁、y₂If the PMI is much larger than 0, it means that the two classes tend to occur simultaneously, and conversely, they tend to avoid each other.

From the above discussion, point-by-Point Mutual Information (PMI) is an effective measure, actually revealing the internal relationships between classes. Therefore, we let the model fit the PMI directly so that the network learns more basic knowledge. We model PMI and express it as:

then we re-normalize it using the softmax function, as:

log p_θ(y| x)～f_y(x；θ)+log p(y)

for the sake of generalization, we add an adjustment factor τ, we set τ to 1, and the resulting imbalance adjustment loss is expressed as:

the imbalance adjustment penalty we propose applies a tag-dependent offset to each logit. By embedding PMIs between scene semantics and introducing them into segmentation tasks, the network can be helped to reduce inter-class confusion problems.

Second pair

And (5) analyzing and defining. In a 2D image, due to different camera distances and angles, the object will be imaged as a different number of pixels, which results in the occupied pixels/points (expressed as occupancy) of each object being unpredictable. In contrast, 3D data is intrinsic, with no occlusion or scale blurring, so the number of points of an object does not change in the 3D scene. This means that an object may contain a fixed number of points. As a result, the same marker points tend to have a stable number, which we call the occupancy scale.

For indoor and outdoor scenes, we partition the point cloud using sub-grids of size 4cm and 6cm, leaving the points in each small grid represented by only one point with a label (center of gravity). This step is similar to the operation of voxelizing a point cloud. We can sample a fixed number of points 10 < lambda > 5 from each point cloud as input, which in our experimental setup is 40960. For unmarked points in the scene, these points are not put into the penalty function for computation. Therefore, the setting of oi can help the network to correct the problem of data imbalance during training. Furthermore, for the point cloud datasets used in our experiments, the original setup was, from any perspective, only one label per marked point, whereas unmarked points were unmarked.

For the ith point in the kth semantic class, we predict a positive value o_iTo indicate the number of points occupied by the current semantic class. Then, o_iWill be used as the expected occupancy of the current semantic classA quantitative semantic class. For more reliable prediction, we regress the logarithm instead of the original value and represent it with the following expression:

wherein N is_kIs the number of points in the kth semantic class. The proposed occupancy regression loss returns to the occupancy size of the class for each point, i.e., the inherent property of each 3D class object. Occupancy regression loss adjusts the proportion of each semantic category in the training process, which can benefit the network by effectively preventing intra-category inconsistencies.

And step T1-7, outputting a large-scale 3D point cloud segmentation result. And outputting the point cloud semantic segmentation result predicted by the model.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. The semantic segmentation method for calibrating point cloud data long-tail distribution based on mutual information is characterized by comprising the following steps of:

s1, inputting point cloud data;

s2, extracting point cloud features;

s3, acquiring spatial position attention supporting large-scale input;

s4, acquiring expanded channel position attention;

a joint cost function is represented that is,

and S7, outputting the point cloud segmentation result.

2. The method for semantic segmentation based on mutual information calibration point cloud data long-tail distribution according to claim 1, wherein the step S3 comprises the following steps:

s31, acquiring an output feature map A from the feature extraction network, and acquiring different feature maps B and C through A, wherein N represents points;

s32, carrying out matrix multiplication between transposes of B and A to obtain an output matrix D, wherein the attention value formula on D is as follows:

s33, representing the transpose of D as a bottleneck transformation:

F_j＝ReLU(LN(D_j))

s34, multiplying the matrix between figures D and C, as:

s35, weighting the aggregated feature maps E and F using the scale parameters α and β, where α and β are learnable parameters initialized to 0, to generate the spatial attention map G:

G_j＝αE_j+βF_j+A_j。

3. the method for semantic segmentation based on mutual information calibration point cloud data long-tail distribution according to claim 1, wherein the step S4 comprises the following steps:

s41, performing matrix multiplication between the transposed A and the original A to obtain an attention graph B;

s42, propagating signature C using matrix multiplication between B and original a, represented as:

where M is the channel dimension, and the subscripts i, j denote channel i and channel j, C_jRepresents the j-th position in the characteristic diagram C;

s43, defining a cross-channel operator to capture the adjacent channel relation, and performing local cross-channel interaction, wherein the local cross-channel interaction is represented as:

where W is an h M parameter matrix, h represents the adjacent channel in stride, and σ is an S-type function.

S44, setting the weight parameter λ, and generating a channel attention map E:

E_j＝λC_j+D_j+A_j。

4. the method of claim 1, wherein the imbalance adjustment loss function in S6 is a semantic segmentation method based on mutual information calibration point cloud data long tail distribution

Using the minimum softmax cross entropy:

the relation between two random variables sampled simultaneously is measured by adopting point-by-point mutual information PMI, and is expressed as follows:

PMI was modeled and expressed as:

it is re-normalized using the softmax function, expressed as:

log p_θ(y∣x)～f_y(x；θ)+log p(y)

adding the adjustment factor τ, the resulting imbalance adjustment loss function is expressed as:

5. the method of claim 4, wherein the adjustment factor τ is 1.

6. The method of claim 1, wherein the occupancy rate regression loss function in S6 is a function of the semantic segmentation of long-tail distribution of point cloud data based on mutual information calibration

Predicting a positive value o for the ith point in the kth semantic class_iTo indicate the number of points occupied by the current semantic class, o_iThe average of (a) is used as the expected occupancy of the current semantic class, and the point number is regressed:

wherein N is_kIs in the kth semantic classThe point number, K, represents the semantic category number, and the occupancy regression loss returns to the occupancy size of each point's class, i.e., the inherent attribute of each class object.

7. The method for semantic segmentation based on mutual information calibration point cloud data long-tail distribution according to claim 1, wherein in the step S2, the input point cloud data is described as

8. The method for semantic segmentation based on mutual information calibration point cloud data long-tail distribution according to claim 1, wherein in S2, the input point cloud data is subjected to data enhancement, including random dot disordering, random point cloud rotation, random rotation space coordinates and normal vectors.

9. The method for semantic segmentation based on mutual information calibration point cloud data long-tail distribution according to claim 3, wherein in the step S43, h is set to be 3.

10. The method for semantic segmentation based on mutual information calibration point cloud data long-tail distribution according to claim 3, wherein in S44, λ is 0.1.