CN113554653A - Semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration - Google Patents
Semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration Download PDFInfo
- Publication number
- CN113554653A CN113554653A CN202110631495.8A CN202110631495A CN113554653A CN 113554653 A CN113554653 A CN 113554653A CN 202110631495 A CN202110631495 A CN 202110631495A CN 113554653 A CN113554653 A CN 113554653A
- Authority
- CN
- China
- Prior art keywords
- point cloud
- point
- attention
- cloud data
- mutual information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000011218 segmentation Effects 0.000 title claims abstract description 31
- 238000012549 training Methods 0.000 claims abstract description 15
- 239000011159 matrix material Substances 0.000 claims description 18
- 238000013528 artificial neural network Methods 0.000 claims description 8
- 230000004927 fusion Effects 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 6
- 238000010586 diagram Methods 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 4
- 230000003993 interaction Effects 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 239000003086 colorant Substances 0.000 claims description 2
- 230000001902 propagating effect Effects 0.000 claims description 2
- 238000005070 sampling Methods 0.000 claims description 2
- 230000009466 transformation Effects 0.000 claims 1
- 230000008569 process Effects 0.000 abstract description 11
- 230000008901 benefit Effects 0.000 abstract description 4
- 230000006870 function Effects 0.000 description 26
- 229920006934 PMI Polymers 0.000 description 12
- 230000000694 effects Effects 0.000 description 10
- 238000004364 calculation method Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000005484 gravity Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 208000013774 myofibrillar myopathy 9 Diseases 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/40—Scaling of whole images or parts thereof, e.g. expanding or contracting
- G06T3/4038—Image mosaicing, e.g. composing plane images from plane sub-images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration using two or more images, e.g. averaging or subtraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10028—Range image; Depth image; 3D point clouds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20212—Image combination
- G06T2207/20221—Image fusion; Image merging
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a semantic segmentation method based on mutual information calibration point cloud data long tail distribution, which comprises two parallel attention methods designed for a large-scale 3D scene, one is used for a space position, the other is used for a channel position, the two methods are both used for capturing remote global information, the proposed spatial attention can support large scale input of 3D point clouds, in contrast to traditional non-local attention operators in 2D images, which, subsequently, the present invention introduces a merging operation to enhance feature discrimination from a global perspective, on the other hand, the present invention proposes an imbalance adjustment loss function, so that the network is more concerned with the recognition results of classes with lower frequency of occurrence and limits the prediction of the number of occupied points per semantic class along with the occupancy regression loss function, both benefit from the training process and encourage the network to produce scene segmentation results with better boundaries.
Description
Technical Field
The invention relates to the technical field of semantic segmentation, in particular to a semantic segmentation method based on mutual information calibration point cloud data long tail distribution.
Background
Large scale 3D scene segmentation, aimed at assigning semantic class labels to each point, has recently been extensively and actively studied and is for a variety of challenging and meaningful applications (e.g. autopilot, robotics and location recognition). To accomplish such tasks efficiently, we need to distinguish the blurred shape from the blurred portion and consider objects with different appearances. For example, if the structure information and the discriminant features are not well encoded in the embedding space, adjacent chairs and tables can be easily confused and grouped into a unified class.
The existing method adopts a design concept similar to a typical convolutional neural network in a 2D image, and is mainly provided for learning richer local structures and capturing more extensive context information of point cloud. Despite the series of efforts that have been made on common data sets, there are still some problems to be solved.
First, although a global representation is obtained in the deep learning model, the complex global relationships between point clouds have not been explicitly exploited, which is crucial for better segmentation. For example, the areas of walls and doors are often indistinguishable, and components of tables and chairs may be affected by their similar structures. It is necessary to enhance the discriminative power of the feature representation for point level identification. The ELGS method uses the channel and space attention mechanism method for reference from the 2D image task. The MPRM method replaces the convolution with 1 × 1Conv without structural changes to the channel and spatial attention mechanisms in the 2D image task. The PT method and the PCT method design a converter layer for point cloud processing. Both utilize the self-attention operator instead of the convolution operator in the neural network. However, all the above methods can only be operated by using a sub-cloud of the entire point cloud with a limited number of points, and cannot process a large 3D scene point cloud. Secondly, existing 3D methods pay little attention to the inherent properties of the 3D real world data itself. On the one hand, point clouds collected from the real world typically exhibit an unbalanced or long-tailed label distribution, with several common classes absolutely predominant in number, resulting in the model being biased towards these predominant classes and ignoring a lesser number of classes. For example, in almost every indoor scenario, the classes of walls and floors appear, while in outdoor scenarios, the classes of roads and buildings occupy most of the positions. On the other hand, 3D data is intrinsic, with no occlusion or scale blurring, so the number of points of an object does not change in the 3D scene. In contrast, in a 2D image, the same object will be imaged as a different number of pixels due to different camera distances and angles. The occupied pixels/points per object (expressed as occupancy) are unpredictable on 2D images, but can be reliably predicted from 3D scenes. The recent RandLA-Net method effectively segments large point clouds, while ignoring long-tail distribution and unbalanced problem point cloud data in the real world.
Disclosure of Invention
In order to solve the defects of the prior art, a novel framework is provided by introducing a fine module of an adjacent region for large-scale 3D scene segmentation and two loss functions aiming at solving the balance problem of network training, so that the aims of processing large-scale point cloud feature input and avoiding inter-class inconsistency and intra-class uncertainty in large-scale point cloud scene semantic segmentation are fulfilled, and the technical scheme is as follows:
a semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration comprises the following steps:
s1, inputting large-scale 3D point cloud data;
s2, extracting point cloud features;
s3, acquiring spatial position attention supporting large-scale input;
s4, acquiring expanded channel position attention;
s5, performing feature fusion, namely splicing feature maps supporting large-scale input space position attention and expanded channel position attention output, performing attention feature fusion, and performing up-sampling to enable the scale of the output point cloud to be equivalent to that of the input point cloud;
s6, constructing a joint loss function, and forcing the neural network to learn the inherent attributes of the input points:
a joint cost function is represented that is,representing an imbalance adjustment loss function for performing imbalance adjustment of the imbalance and long tail label distribution,expressing an occupancy rate regression loss function for regressing the occupancy size of the category to which each point belongs,representing a cross entropy loss function for final semantic segmentation prediction;
and S7, outputting the point cloud segmentation result.
Further, the S3 includes the following steps:
s31, obtaining N × C output feature maps A from the feature extraction network, and feeding A to two different 1 × 1 convolution layers to obtain different feature maps B and C, wherein N represents points and C represents dimensions;
s32, carrying out matrix multiplication, namely attention operation, between transposes of B and A, playing a role of characteristic enhancement, obtaining a C' × 1 output matrix D, wherein the attention value formula on D is as follows:
the indices i and j denote point i and point j, A, respectivelyjIndicates the position of the jth point in the feature map A, BiRepresenting the position of the ith point in the feature map B;
s33, converting the transpose of D into two other 1 × 1 convolutional layers, which are represented as bottleneck conversions:
Fj=ReLU(LN(Dj))
where LN represents the normalization layer and ReLU is the activation function;
s34, the matrix multiplication between fig. D and C, i.e. the attention operation, has the effect of feature enhancement, and is expressed as:
s35, weighting the aggregated feature maps E and F by using two proportional parameters α and β in the summation process, where α and β are learnable parameters initialized to 0, and we generate the spatial attention map G, namely the attention operation, which has the effect of feature enhancement:
Gj=αEj+βFj+Aj。
further, the S4 includes the following steps:
s41, performing matrix multiplication between the transposed A and the original A, wherein, we obtain a C × C channel attention diagram B;
s42, propagating the feature map C by using matrix multiplication between B and original A, and utilizing an attention operation to achieve the effect of feature enhancement, wherein the effect is expressed as:
wherein M is the channel dimensionThe subscripts i, j denote channel i and channel j, CjRepresents the j-th position in the characteristic diagram C;
s43, defining a cross-channel operator to capture the adjacent channel relationship, using 1 × 1 convolution with kernel size h to realize, then, local cross-channel interaction, namely attention operation, plays a role of feature enhancement, and is expressed as:
where W is an h M parameter matrix, h represents the adjacent channel (i.e., kernel size) in stride, and σ is a sigmoid function (i.e., sigmoid function).
S44, sets the weighting parameter λ, and generates a channel attention map E represented by:
Ej=λCj+Dj+Aj。
wherein theta represents parameters of the neural network, and (x, y) -D represent training data, wherein x represents data, y represents supervisory information, D represents distribution, and p representsθ(y | x) represents an unknown profile;
let fy(x; θ) is the result before the softmax function, i.e., logit, thus yielding:
wherein f isy(x; theta) represents the current parameter distribution of the neural network, and K represents the number of candidate semantic categories;
from the experience of observation, for example in a conference room, certain categories (e.g. table and chairs) often occur simultaneously; while other classes (e.g., sofas and columns) tend to avoid each other. To describe this, we need a number to represent "how many times the probability of two classes co-existing in the same 3D scene as they randomly encounter? "we describe this phenomenon using a point-by-point mutual information PMI, since it is a measure of the number of relationships between two random variables sampled at the same time, expressed as:
wherein, p (y)1)、p(y2) Are respectively category y1、y2If the PMI is much larger than 0, then both classes tend to occur simultaneously, otherwise, tend to avoid each other;
from the above discussion, point-by-Point Mutual Information (PMI) is an effective measure, actually revealing the internal relationships between classes. Therefore, let the model fit directly to the PMI for the network to learn more basic knowledge, we model the PMI and express it as:
it is re-normalized using the softmax function, expressed as:
log pθ(y∣x)~fy(x;θ)+log p(y)
for the sake of generalization, adding a tuning factor τ, the resulting imbalance tuning loss function is expressed as:
the imbalance adjustment penalty we propose applies a tag-dependent offset to each logit; by embedding PMIs between scene semantics and introducing them into segmentation tasks, the network can be helped to reduce inter-class confusion problems.
Further, the adjustment factor τ is 1.
Further, the occupancy rate regression loss function in S6In 2D images, objects will be imaged as different numbers of pixels due to different camera distances and angles, which results in an unpredictable occupied pixel/point per object (expressed as occupancy), whereas 3D data is inherent without occlusion or scale blur, so the point number of an object is unchanged in a 3D scene, which means that an object may contain a fixed number of points, and as a result, the same marked points tend to have a stable number, which we call the occupancy scale; for indoor and outdoor scenes we divide the point clouds using sub-grids of size 4cm and 6cm, with the points in each sub-grid represented by only one point with a label (center of gravity), a procedure similar to the voxelization of the point clouds, we can sample a fixed number of points 10^5 from each point cloud as input, for the unmarked points in the scene, these points are not put into the loss function for computation, therefore, oiThe setting of the method can help the network to correct the problem of data imbalance in the training process, and in addition, for the point cloud data set adopted in the experiment, from any angle, the original setting is that each marked point only has one label, and the unmarked points do not have labels; predicting a positive value o for the ith point in the kth semantic classiTo indicate the number of points occupied by the current semantic class, oiThe average of (d) is used as the expected occupancy of the current semantic class, and for more reliable prediction we will regress the number of points rather than the original value:
wherein N iskIs the point number in the kth semantic class, K represents the semantic class number, the occupancy rate regression loss returns to the occupancy rate of each point class, namely the inherent property of each 3D class object occupiesRate regression loss adjusts the proportion of each semantic class in the training process, which can benefit the network by effectively preventing intra-class inconsistencies.
Further, in the S2, the input point cloud data is described asIt is an original unordered set of points with F dimensions, N is the number of points, piIs a feature vector comprising coordinates, colors, labels.
Further, in S2, data enhancement is performed on the input point cloud data, including randomly disordering the order of the points, randomly rotating the point cloud for data enhancement, randomly rotating the spatial coordinates, and normal vectors.
Further, in S43, h is set to 3.
Further, in S44, λ is 0.1, which is the most effective.
The invention has the advantages and beneficial effects that:
the neighborhood refinement module proposed by the invention comprises two types of attention blocks, supports spatial attention in a large scale, and pre-forms expanded channel attention in a parallel manner, and can process a large number of points (for example, 10 points) at a time5Whereas the conventional method treats 10 at most once4A number of input point clouds) without increasing computational complexity and time cost; the invention provides two loss functions, and the two loss functions jointly utilize the inherent long tail label distribution in a 3D scene to guide a network to solve intra-class inconsistency and inter-class confusion; the invention can train the network in an end-to-end mode and is superior to the traditional method in the aspects of efficiency and effectiveness; the model and the training loss function provided by the invention can achieve better effect on large-scale scene point cloud segmentation.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a schematic diagram of a spatial location attention module supporting large-scale input according to the present invention.
FIG. 3 is a schematic diagram of an expanded channel position attention module of the present invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.
As shown in fig. 1, the semantic segmentation method based on mutual information calibration point cloud data long tail distribution includes the following steps:
and step T1-1, inputting a large-scale 3D point cloud file.
The format of input data, the storage form of data is txt format, and the keywords are: xyz, rgb, label, the data dimensions of the key index are: xyz (40960, 3), rgb (40960, 3), label (40960, 1).
And T1-2, extracting a network from the point cloud characteristics.
We use RandLA-Net as a point cloud feature extraction network to process large-scale point clouds and generate rich semantic representations for each point. Describing input data asIt is an original unordered set of points with an F dimension, where N is the number of points, piIs a feature vector and may contain 3D spatial coordinates (x, y, z), color (r, g, b), and label (label). We set F-3 to use only 3D coordinates as input. Considering that the number of samples in a point cloud in the real world may be very large, allowing each point in each set of points to participate in the calculation leads to a high calculation cost and a problem of gradient vanishing since the weight distribution/influence of each point to other points is very small. During training, firstly, randomly downsampling the input point cloud to 40960 points, then setting the training epoch to be 250 steps, setting the size of batch size to be 8, setting the learning rate to be 0.001, setting momentum based on gradient descent to be 0.9, and setting the optimizer to be Adam. During the training process, the number of neighbor nodes of each point cloud is set to 16. After point cloud data is input into a network, data enhancement needs to be carried out on the point cloud data in a training process, and the data enhancement comprises the following steps: random dot disordering sequence and random rotating point cloud enteringAnd performing line data enhancement and randomly rotating x, y and z space coordinates and normal vectors.
Step T1-3 supports spatial locality attention for large-scale input.
As shown in fig. 2, N × C' (N represents the number of points, and C represents the dimension) output feature maps a are acquired from the feature extraction network and fed to two different 1 × 1 convolution layers, resulting in feature maps B and C. Then, matrix multiplication, namely an attention operation, is performed between transposes of B and A, so that the characteristic enhancement effect is achieved, and a C × 1 output matrix D is obtained. The attention value formula on D is as follows:
the indices i and j denote point i and point j, A, respectivelyjIndicates the position of the jth point in the feature map A, BiIndicating the position of the ith point in the feature map B.
The transpose of D is then converted to the other two 1 × 1 convolutional layers as a bottleneck conversion representation:
Fj=ReLU(LN(Dj))
where LN denotes the normalization layer and ReLU is the activation function.
The matrix multiplication between the graph D and the graph C, namely the attribute operation, has the effect of feature enhancement and is expressed as follows:
thereafter, in the summation process, the aggregated feature maps E and F are weighted using two scaling parameters α and β, where α and β are learnable parameters initialized to 0. We generate the spatial attention graph G, denoted as attention operation, which has the effect of feature enhancement:
Gj=αEj+βFj+Aj
expanded tunnel position attention of step T1-4
As shown in fig. 3, we perform the matrix multiplication directly between transposed a and original a. Here we get a C × C channel attention B, then we propagate feature map C using matrix multiplication between B and original a, using the attention operation, with the effect of feature enhancement, expressed as:
where M is the channel dimension, and the subscripts i, j denote channel i and channel j, CjRepresenting the jth position in the signature C. Then we define the cross-channel operator to capture the adjacent channel relations, implemented with a 1 × 1 convolution of kernel size h. Then, the local cross-channel interaction, namely the attribute operation, has the effect of feature enhancement, and can be expressed as:
where W is an hXM parameter matrix, h denotes the adjacent channel in stride (i.e., kernel size), we set h to 3, and σ is a sigmoid function.
Then, we set the weight parameter λ to 0.1 and generate a channel attention map E represented as:
Ej=λCj+Dj+Aj
step T1-5 feature fusion.
In this step, feature maps supporting large-scale input spatial position attention in the step T1-3 and output of expanded channel position attention in the step T1-4 are first stitched together, then attention feature fusion is performed through a layer of 1 × 1Conv, and then point clouds are up-sampled through three layers of 1 × 1Conv (convolutional layers) so that the output point cloud scale is equal to the input point cloud scale.
Step T1-6 combines the loss functions.
To exploit together the inherent properties of 3D real world data, we have designed two effective penalties to enforce (guide) the network to learn the inherent attributes of the input pointsSex, i.e., imbalance adjustment losses and occupancy regression losses. The network is trained to combine cost functionsAnd (3) minimizing:
whereinIs an imbalance adjustment penalty function that performs imbalance and long tail label distribution in the actual 3D scene.And the method is an occupation regression loss function and is used for regressing the occupation size of the category to which each point belongs.Conventional cross entropy loss for the final semantic segmentation prediction.
Firstly, toFor analysis and definition, 3D point clouds collected from the real world typically exhibit an unbalanced or long-tailed distribution of labels. In this case, the batches sampled during training have little chance to sample the low frequency classes compared to the high frequency classes, which easily leads the model to ignore them, but in practice we are usually more concerned about the recognition results of low frequencies. Considering K candidate semantic categories, the training data are (x, y) -D, x represents data, y represents supervision information, D represents distribution, and p represents unknown distributionθ(y | x), in general, one minimizes the softmax cross entropy:
theta denotes a neural networkLet us assume fy(x; θ) (the current parameter distribution of the neural network) is the result before the softmax function, i.e., logit, so we get:
from the experience of observation, for example in a conference room, certain categories (e.g. table and chairs) often occur simultaneously; while other classes (e.g., sofas and columns) tend to avoid each other. To describe this, we need a number to represent "how many times the probability of two classes co-existing in the same 3D scene as they randomly encounter? ". We describe this phenomenon using point-by-Point Mutual Information (PMI) because it is a quantity that measures the relationship between two random variables sampled at the same time, expressed as:
wherein, p (y)1)、p(y2) Are respectively category y1、y2If the PMI is much larger than 0, it means that the two classes tend to occur simultaneously, and conversely, they tend to avoid each other.
From the above discussion, point-by-Point Mutual Information (PMI) is an effective measure, actually revealing the internal relationships between classes. Therefore, we let the model fit the PMI directly so that the network learns more basic knowledge. We model PMI and express it as:
then we re-normalize it using the softmax function, as:
log pθ(y| x)~fy(x;θ)+log p(y)
for the sake of generalization, we add an adjustment factor τ, we set τ to 1, and the resulting imbalance adjustment loss is expressed as:
the imbalance adjustment penalty we propose applies a tag-dependent offset to each logit. By embedding PMIs between scene semantics and introducing them into segmentation tasks, the network can be helped to reduce inter-class confusion problems.
Second pairAnd (5) analyzing and defining. In a 2D image, due to different camera distances and angles, the object will be imaged as a different number of pixels, which results in the occupied pixels/points (expressed as occupancy) of each object being unpredictable. In contrast, 3D data is intrinsic, with no occlusion or scale blurring, so the number of points of an object does not change in the 3D scene. This means that an object may contain a fixed number of points. As a result, the same marker points tend to have a stable number, which we call the occupancy scale.
For indoor and outdoor scenes, we partition the point cloud using sub-grids of size 4cm and 6cm, leaving the points in each small grid represented by only one point with a label (center of gravity). This step is similar to the operation of voxelizing a point cloud. We can sample a fixed number of points 10 < lambda > 5 from each point cloud as input, which in our experimental setup is 40960. For unmarked points in the scene, these points are not put into the penalty function for computation. Therefore, the setting of oi can help the network to correct the problem of data imbalance during training. Furthermore, for the point cloud datasets used in our experiments, the original setup was, from any perspective, only one label per marked point, whereas unmarked points were unmarked.
For the ith point in the kth semantic class, we predict a positive value oiTo indicate the number of points occupied by the current semantic class. Then, oiWill be used as the expected occupancy of the current semantic classA quantitative semantic class. For more reliable prediction, we regress the logarithm instead of the original value and represent it with the following expression:
wherein N iskIs the number of points in the kth semantic class. The proposed occupancy regression loss returns to the occupancy size of the class for each point, i.e., the inherent property of each 3D class object. Occupancy regression loss adjusts the proportion of each semantic category in the training process, which can benefit the network by effectively preventing intra-category inconsistencies.
And step T1-7, outputting a large-scale 3D point cloud segmentation result. And outputting the point cloud semantic segmentation result predicted by the model.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.
Claims (10)
1. The semantic segmentation method for calibrating point cloud data long-tail distribution based on mutual information is characterized by comprising the following steps of:
s1, inputting point cloud data;
s2, extracting point cloud features;
s3, acquiring spatial position attention supporting large-scale input;
s4, acquiring expanded channel position attention;
s5, performing feature fusion, namely splicing feature maps supporting large-scale input space position attention and expanded channel position attention output, performing attention feature fusion, and performing up-sampling to enable the scale of the output point cloud to be equivalent to that of the input point cloud;
s6, constructing a joint loss function, and forcing the neural network to learn the inherent attributes of the input points:
a joint cost function is represented that is,representing an imbalance adjustment loss function for performing imbalance adjustment of the imbalance and long tail label distribution,expressing an occupancy rate regression loss function for regressing the occupancy size of the category to which each point belongs,representing a cross entropy loss function for final semantic segmentation prediction;
and S7, outputting the point cloud segmentation result.
2. The method for semantic segmentation based on mutual information calibration point cloud data long-tail distribution according to claim 1, wherein the step S3 comprises the following steps:
s31, acquiring an output feature map A from the feature extraction network, and acquiring different feature maps B and C through A, wherein N represents points;
s32, carrying out matrix multiplication between transposes of B and A to obtain an output matrix D, wherein the attention value formula on D is as follows:
the indices i and j denote point i and point j, A, respectivelyjIndicates the position of the jth point in the feature map A, BiRepresenting the position of the ith point in the feature map B;
s33, representing the transpose of D as a bottleneck transformation:
Fj=ReLU(LN(Dj))
where LN represents the normalization layer and ReLU is the activation function;
s34, multiplying the matrix between figures D and C, as:
s35, weighting the aggregated feature maps E and F using the scale parameters α and β, where α and β are learnable parameters initialized to 0, to generate the spatial attention map G:
Gj=αEj+βFj+Aj。
3. the method for semantic segmentation based on mutual information calibration point cloud data long-tail distribution according to claim 1, wherein the step S4 comprises the following steps:
s41, performing matrix multiplication between the transposed A and the original A to obtain an attention graph B;
s42, propagating signature C using matrix multiplication between B and original a, represented as:
where M is the channel dimension, and the subscripts i, j denote channel i and channel j, CjRepresents the j-th position in the characteristic diagram C;
s43, defining a cross-channel operator to capture the adjacent channel relation, and performing local cross-channel interaction, wherein the local cross-channel interaction is represented as:
where W is an h M parameter matrix, h represents the adjacent channel in stride, and σ is an S-type function.
S44, setting the weight parameter λ, and generating a channel attention map E:
Ej=λCj+Dj+Aj。
4. the method of claim 1, wherein the imbalance adjustment loss function in S6 is a semantic segmentation method based on mutual information calibration point cloud data long tail distributionUsing the minimum softmax cross entropy:
wherein theta represents parameters of the neural network, and (x, y) -D represent training data, wherein x represents data, y represents supervisory information, D represents distribution, and p representsθ(y | x) represents an unknown profile;
let fy(x; θ) is the result before the softmax function, i.e., logit, thus yielding:
wherein f isy(x; theta) represents the current parameter distribution of the neural network, and K represents the number of candidate semantic categories;
the relation between two random variables sampled simultaneously is measured by adopting point-by-point mutual information PMI, and is expressed as follows:
wherein, p (y)1)、p(y2) Are respectively category y1、y2If the PMI is much larger than 0, then both classes tend to occur simultaneously, otherwise, tend to avoid each other;
PMI was modeled and expressed as:
it is re-normalized using the softmax function, expressed as:
log pθ(y∣x)~fy(x;θ)+log p(y)
adding the adjustment factor τ, the resulting imbalance adjustment loss function is expressed as:
5. the method of claim 4, wherein the adjustment factor τ is 1.
6. The method of claim 1, wherein the occupancy rate regression loss function in S6 is a function of the semantic segmentation of long-tail distribution of point cloud data based on mutual information calibrationPredicting a positive value o for the ith point in the kth semantic classiTo indicate the number of points occupied by the current semantic class, oiThe average of (a) is used as the expected occupancy of the current semantic class, and the point number is regressed:
wherein N iskIs in the kth semantic classThe point number, K, represents the semantic category number, and the occupancy regression loss returns to the occupancy size of each point's class, i.e., the inherent attribute of each class object.
7. The method for semantic segmentation based on mutual information calibration point cloud data long-tail distribution according to claim 1, wherein in the step S2, the input point cloud data is described asIt is an original unordered set of points with F dimensions, N is the number of points, piIs a feature vector comprising coordinates, colors, labels.
8. The method for semantic segmentation based on mutual information calibration point cloud data long-tail distribution according to claim 1, wherein in S2, the input point cloud data is subjected to data enhancement, including random dot disordering, random point cloud rotation, random rotation space coordinates and normal vectors.
9. The method for semantic segmentation based on mutual information calibration point cloud data long-tail distribution according to claim 3, wherein in the step S43, h is set to be 3.
10. The method for semantic segmentation based on mutual information calibration point cloud data long-tail distribution according to claim 3, wherein in S44, λ is 0.1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110631495.8A CN113554653A (en) | 2021-06-07 | 2021-06-07 | Semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110631495.8A CN113554653A (en) | 2021-06-07 | 2021-06-07 | Semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113554653A true CN113554653A (en) | 2021-10-26 |
Family
ID=78130320
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110631495.8A Pending CN113554653A (en) | 2021-06-07 | 2021-06-07 | Semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113554653A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114638336A (en) * | 2021-12-26 | 2022-06-17 | 海南大学 | Unbalanced learning focusing on strange samples |
WO2023098807A1 (en) * | 2021-12-03 | 2023-06-08 | 维沃移动通信有限公司 | Point cloud encoding processing method and apparatus, point cloud decoding processing method and apparatus, and encoding device and decoding device |
-
2021
- 2021-06-07 CN CN202110631495.8A patent/CN113554653A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023098807A1 (en) * | 2021-12-03 | 2023-06-08 | 维沃移动通信有限公司 | Point cloud encoding processing method and apparatus, point cloud decoding processing method and apparatus, and encoding device and decoding device |
CN114638336A (en) * | 2021-12-26 | 2022-06-17 | 海南大学 | Unbalanced learning focusing on strange samples |
CN114638336B (en) * | 2021-12-26 | 2023-09-22 | 海南大学 | Unbalanced learning focused on strange samples |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10733431B2 (en) | Systems and methods for optimizing pose estimation | |
US10796452B2 (en) | Optimizations for structure mapping and up-sampling | |
CN112308200B (en) | Searching method and device for neural network | |
CN109902798A (en) | The training method and device of deep neural network | |
CN111507378A (en) | Method and apparatus for training image processing model | |
CN109493303A (en) | A kind of image defogging method based on generation confrontation network | |
CN106845471A (en) | A kind of vision significance Forecasting Methodology based on generation confrontation network | |
JP2019082978A (en) | Skip architecture neutral network device and method for improved semantic segmentation | |
CN113705769A (en) | Neural network training method and device | |
CN110070107A (en) | Object identification method and device | |
CN112634296A (en) | RGB-D image semantic segmentation method and terminal for guiding edge information distillation through door mechanism | |
CN110222717A (en) | Image processing method and device | |
CN109711401A (en) | A kind of Method for text detection in natural scene image based on Faster Rcnn | |
CN113095254B (en) | Method and system for positioning key points of human body part | |
WO2021103731A1 (en) | Semantic segmentation method, and model training method and apparatus | |
CN114049381A (en) | Twin cross target tracking method fusing multilayer semantic information | |
CN113011562A (en) | Model training method and device | |
CN113554653A (en) | Semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration | |
WO2022111387A1 (en) | Data processing method and related apparatus | |
CN112580720A (en) | Model training method and device | |
CN111723667A (en) | Human body joint point coordinate-based intelligent lamp pole crowd behavior identification method and device | |
CN115512251A (en) | Unmanned aerial vehicle low-illumination target tracking method based on double-branch progressive feature enhancement | |
Luvizon et al. | SSP-Net: Scalable sequential pyramid networks for real-Time 3D human pose regression | |
CN113706544A (en) | Medical image segmentation method based on complete attention convolution neural network | |
CN113066018A (en) | Image enhancement method and related device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |