CN113554653A - Semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration - Google Patents

Semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration Download PDF

Info

Publication number
CN113554653A
CN113554653A CN202110631495.8A CN202110631495A CN113554653A CN 113554653 A CN113554653 A CN 113554653A CN 202110631495 A CN202110631495 A CN 202110631495A CN 113554653 A CN113554653 A CN 113554653A
Authority
CN
China
Prior art keywords
point cloud
point
attention
cloud data
mutual information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110631495.8A
Other languages
Chinese (zh)
Inventor
李梦甜
谢源
马利庄
张志忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Zhejiang Lab
Original Assignee
East China Normal University
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University, Zhejiang Lab filed Critical East China Normal University
Priority to CN202110631495.8A priority Critical patent/CN113554653A/en
Publication of CN113554653A publication Critical patent/CN113554653A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4038Image mosaicing, e.g. composing plane images from plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a semantic segmentation method based on mutual information calibration point cloud data long tail distribution, which comprises two parallel attention methods designed for a large-scale 3D scene, one is used for a space position, the other is used for a channel position, the two methods are both used for capturing remote global information, the proposed spatial attention can support large scale input of 3D point clouds, in contrast to traditional non-local attention operators in 2D images, which, subsequently, the present invention introduces a merging operation to enhance feature discrimination from a global perspective, on the other hand, the present invention proposes an imbalance adjustment loss function, so that the network is more concerned with the recognition results of classes with lower frequency of occurrence and limits the prediction of the number of occupied points per semantic class along with the occupancy regression loss function, both benefit from the training process and encourage the network to produce scene segmentation results with better boundaries.

Description

Semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration
Technical Field
The invention relates to the technical field of semantic segmentation, in particular to a semantic segmentation method based on mutual information calibration point cloud data long tail distribution.
Background
Large scale 3D scene segmentation, aimed at assigning semantic class labels to each point, has recently been extensively and actively studied and is for a variety of challenging and meaningful applications (e.g. autopilot, robotics and location recognition). To accomplish such tasks efficiently, we need to distinguish the blurred shape from the blurred portion and consider objects with different appearances. For example, if the structure information and the discriminant features are not well encoded in the embedding space, adjacent chairs and tables can be easily confused and grouped into a unified class.
The existing method adopts a design concept similar to a typical convolutional neural network in a 2D image, and is mainly provided for learning richer local structures and capturing more extensive context information of point cloud. Despite the series of efforts that have been made on common data sets, there are still some problems to be solved.
First, although a global representation is obtained in the deep learning model, the complex global relationships between point clouds have not been explicitly exploited, which is crucial for better segmentation. For example, the areas of walls and doors are often indistinguishable, and components of tables and chairs may be affected by their similar structures. It is necessary to enhance the discriminative power of the feature representation for point level identification. The ELGS method uses the channel and space attention mechanism method for reference from the 2D image task. The MPRM method replaces the convolution with 1 × 1Conv without structural changes to the channel and spatial attention mechanisms in the 2D image task. The PT method and the PCT method design a converter layer for point cloud processing. Both utilize the self-attention operator instead of the convolution operator in the neural network. However, all the above methods can only be operated by using a sub-cloud of the entire point cloud with a limited number of points, and cannot process a large 3D scene point cloud. Secondly, existing 3D methods pay little attention to the inherent properties of the 3D real world data itself. On the one hand, point clouds collected from the real world typically exhibit an unbalanced or long-tailed label distribution, with several common classes absolutely predominant in number, resulting in the model being biased towards these predominant classes and ignoring a lesser number of classes. For example, in almost every indoor scenario, the classes of walls and floors appear, while in outdoor scenarios, the classes of roads and buildings occupy most of the positions. On the other hand, 3D data is intrinsic, with no occlusion or scale blurring, so the number of points of an object does not change in the 3D scene. In contrast, in a 2D image, the same object will be imaged as a different number of pixels due to different camera distances and angles. The occupied pixels/points per object (expressed as occupancy) are unpredictable on 2D images, but can be reliably predicted from 3D scenes. The recent RandLA-Net method effectively segments large point clouds, while ignoring long-tail distribution and unbalanced problem point cloud data in the real world.
Disclosure of Invention
In order to solve the defects of the prior art, a novel framework is provided by introducing a fine module of an adjacent region for large-scale 3D scene segmentation and two loss functions aiming at solving the balance problem of network training, so that the aims of processing large-scale point cloud feature input and avoiding inter-class inconsistency and intra-class uncertainty in large-scale point cloud scene semantic segmentation are fulfilled, and the technical scheme is as follows:
a semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration comprises the following steps:
s1, inputting large-scale 3D point cloud data;
s2, extracting point cloud features;
s3, acquiring spatial position attention supporting large-scale input;
s4, acquiring expanded channel position attention;
s5, performing feature fusion, namely splicing feature maps supporting large-scale input space position attention and expanded channel position attention output, performing attention feature fusion, and performing up-sampling to enable the scale of the output point cloud to be equivalent to that of the input point cloud;
s6, constructing a joint loss function, and forcing the neural network to learn the inherent attributes of the input points:
Figure BDA0003103898800000021
Figure BDA0003103898800000022
a joint cost function is represented that is,
Figure BDA0003103898800000023
representing an imbalance adjustment loss function for performing imbalance adjustment of the imbalance and long tail label distribution,
Figure BDA0003103898800000024
expressing an occupancy rate regression loss function for regressing the occupancy size of the category to which each point belongs,
Figure BDA0003103898800000025
representing a cross entropy loss function for final semantic segmentation prediction;
and S7, outputting the point cloud segmentation result.
Further, the S3 includes the following steps:
s31, obtaining N × C output feature maps A from the feature extraction network, and feeding A to two different 1 × 1 convolution layers to obtain different feature maps B and C, wherein N represents points and C represents dimensions;
s32, carrying out matrix multiplication, namely attention operation, between transposes of B and A, playing a role of characteristic enhancement, obtaining a C' × 1 output matrix D, wherein the attention value formula on D is as follows:
Figure BDA0003103898800000026
the indices i and j denote point i and point j, A, respectivelyjIndicates the position of the jth point in the feature map A, BiRepresenting the position of the ith point in the feature map B;
s33, converting the transpose of D into two other 1 × 1 convolutional layers, which are represented as bottleneck conversions:
Fj=ReLU(LN(Dj))
where LN represents the normalization layer and ReLU is the activation function;
s34, the matrix multiplication between fig. D and C, i.e. the attention operation, has the effect of feature enhancement, and is expressed as:
Figure BDA0003103898800000031
s35, weighting the aggregated feature maps E and F by using two proportional parameters α and β in the summation process, where α and β are learnable parameters initialized to 0, and we generate the spatial attention map G, namely the attention operation, which has the effect of feature enhancement:
Gj=αEj+βFj+Aj
further, the S4 includes the following steps:
s41, performing matrix multiplication between the transposed A and the original A, wherein, we obtain a C × C channel attention diagram B;
s42, propagating the feature map C by using matrix multiplication between B and original A, and utilizing an attention operation to achieve the effect of feature enhancement, wherein the effect is expressed as:
Figure BDA0003103898800000032
wherein M is the channel dimensionThe subscripts i, j denote channel i and channel j, CjRepresents the j-th position in the characteristic diagram C;
s43, defining a cross-channel operator to capture the adjacent channel relationship, using 1 × 1 convolution with kernel size h to realize, then, local cross-channel interaction, namely attention operation, plays a role of feature enhancement, and is expressed as:
Figure BDA0003103898800000033
where W is an h M parameter matrix, h represents the adjacent channel (i.e., kernel size) in stride, and σ is a sigmoid function (i.e., sigmoid function).
S44, sets the weighting parameter λ, and generates a channel attention map E represented by:
Ej=λCj+Dj+Aj
further, the imbalance adjustment loss function in S6
Figure BDA0003103898800000034
Using the minimum softmax cross entropy:
Figure BDA0003103898800000035
wherein theta represents parameters of the neural network, and (x, y) -D represent training data, wherein x represents data, y represents supervisory information, D represents distribution, and p representsθ(y | x) represents an unknown profile;
let fy(x; θ) is the result before the softmax function, i.e., logit, thus yielding:
Figure BDA0003103898800000041
wherein f isy(x; theta) represents the current parameter distribution of the neural network, and K represents the number of candidate semantic categories;
from the experience of observation, for example in a conference room, certain categories (e.g. table and chairs) often occur simultaneously; while other classes (e.g., sofas and columns) tend to avoid each other. To describe this, we need a number to represent "how many times the probability of two classes co-existing in the same 3D scene as they randomly encounter? "we describe this phenomenon using a point-by-point mutual information PMI, since it is a measure of the number of relationships between two random variables sampled at the same time, expressed as:
Figure BDA0003103898800000042
wherein, p (y)1)、p(y2) Are respectively category y1、y2If the PMI is much larger than 0, then both classes tend to occur simultaneously, otherwise, tend to avoid each other;
from the above discussion, point-by-Point Mutual Information (PMI) is an effective measure, actually revealing the internal relationships between classes. Therefore, let the model fit directly to the PMI for the network to learn more basic knowledge, we model the PMI and express it as:
Figure BDA0003103898800000043
it is re-normalized using the softmax function, expressed as:
log pθ(y∣x)~fy(x;θ)+log p(y)
for the sake of generalization, adding a tuning factor τ, the resulting imbalance tuning loss function is expressed as:
Figure BDA0003103898800000044
the imbalance adjustment penalty we propose applies a tag-dependent offset to each logit; by embedding PMIs between scene semantics and introducing them into segmentation tasks, the network can be helped to reduce inter-class confusion problems.
Further, the adjustment factor τ is 1.
Further, the occupancy rate regression loss function in S6
Figure BDA0003103898800000045
In 2D images, objects will be imaged as different numbers of pixels due to different camera distances and angles, which results in an unpredictable occupied pixel/point per object (expressed as occupancy), whereas 3D data is inherent without occlusion or scale blur, so the point number of an object is unchanged in a 3D scene, which means that an object may contain a fixed number of points, and as a result, the same marked points tend to have a stable number, which we call the occupancy scale; for indoor and outdoor scenes we divide the point clouds using sub-grids of size 4cm and 6cm, with the points in each sub-grid represented by only one point with a label (center of gravity), a procedure similar to the voxelization of the point clouds, we can sample a fixed number of points 10^5 from each point cloud as input, for the unmarked points in the scene, these points are not put into the loss function for computation, therefore, oiThe setting of the method can help the network to correct the problem of data imbalance in the training process, and in addition, for the point cloud data set adopted in the experiment, from any angle, the original setting is that each marked point only has one label, and the unmarked points do not have labels; predicting a positive value o for the ith point in the kth semantic classiTo indicate the number of points occupied by the current semantic class, oiThe average of (d) is used as the expected occupancy of the current semantic class, and for more reliable prediction we will regress the number of points rather than the original value:
Figure BDA0003103898800000051
wherein N iskIs the point number in the kth semantic class, K represents the semantic class number, the occupancy rate regression loss returns to the occupancy rate of each point class, namely the inherent property of each 3D class object occupiesRate regression loss adjusts the proportion of each semantic class in the training process, which can benefit the network by effectively preventing intra-class inconsistencies.
Further, in the S2, the input point cloud data is described as
Figure BDA0003103898800000052
It is an original unordered set of points with F dimensions, N is the number of points, piIs a feature vector comprising coordinates, colors, labels.
Further, in S2, data enhancement is performed on the input point cloud data, including randomly disordering the order of the points, randomly rotating the point cloud for data enhancement, randomly rotating the spatial coordinates, and normal vectors.
Further, in S43, h is set to 3.
Further, in S44, λ is 0.1, which is the most effective.
The invention has the advantages and beneficial effects that:
the neighborhood refinement module proposed by the invention comprises two types of attention blocks, supports spatial attention in a large scale, and pre-forms expanded channel attention in a parallel manner, and can process a large number of points (for example, 10 points) at a time5Whereas the conventional method treats 10 at most once4A number of input point clouds) without increasing computational complexity and time cost; the invention provides two loss functions, and the two loss functions jointly utilize the inherent long tail label distribution in a 3D scene to guide a network to solve intra-class inconsistency and inter-class confusion; the invention can train the network in an end-to-end mode and is superior to the traditional method in the aspects of efficiency and effectiveness; the model and the training loss function provided by the invention can achieve better effect on large-scale scene point cloud segmentation.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a schematic diagram of a spatial location attention module supporting large-scale input according to the present invention.
FIG. 3 is a schematic diagram of an expanded channel position attention module of the present invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.
As shown in fig. 1, the semantic segmentation method based on mutual information calibration point cloud data long tail distribution includes the following steps:
and step T1-1, inputting a large-scale 3D point cloud file.
The format of input data, the storage form of data is txt format, and the keywords are: xyz, rgb, label, the data dimensions of the key index are: xyz (40960, 3), rgb (40960, 3), label (40960, 1).
And T1-2, extracting a network from the point cloud characteristics.
We use RandLA-Net as a point cloud feature extraction network to process large-scale point clouds and generate rich semantic representations for each point. Describing input data as
Figure BDA0003103898800000061
It is an original unordered set of points with an F dimension, where N is the number of points, piIs a feature vector and may contain 3D spatial coordinates (x, y, z), color (r, g, b), and label (label). We set F-3 to use only 3D coordinates as input. Considering that the number of samples in a point cloud in the real world may be very large, allowing each point in each set of points to participate in the calculation leads to a high calculation cost and a problem of gradient vanishing since the weight distribution/influence of each point to other points is very small. During training, firstly, randomly downsampling the input point cloud to 40960 points, then setting the training epoch to be 250 steps, setting the size of batch size to be 8, setting the learning rate to be 0.001, setting momentum based on gradient descent to be 0.9, and setting the optimizer to be Adam. During the training process, the number of neighbor nodes of each point cloud is set to 16. After point cloud data is input into a network, data enhancement needs to be carried out on the point cloud data in a training process, and the data enhancement comprises the following steps: random dot disordering sequence and random rotating point cloud enteringAnd performing line data enhancement and randomly rotating x, y and z space coordinates and normal vectors.
Step T1-3 supports spatial locality attention for large-scale input.
As shown in fig. 2, N × C' (N represents the number of points, and C represents the dimension) output feature maps a are acquired from the feature extraction network and fed to two different 1 × 1 convolution layers, resulting in feature maps B and C. Then, matrix multiplication, namely an attention operation, is performed between transposes of B and A, so that the characteristic enhancement effect is achieved, and a C × 1 output matrix D is obtained. The attention value formula on D is as follows:
Figure BDA0003103898800000062
the indices i and j denote point i and point j, A, respectivelyjIndicates the position of the jth point in the feature map A, BiIndicating the position of the ith point in the feature map B.
The transpose of D is then converted to the other two 1 × 1 convolutional layers as a bottleneck conversion representation:
Fj=ReLU(LN(Dj))
where LN denotes the normalization layer and ReLU is the activation function.
The matrix multiplication between the graph D and the graph C, namely the attribute operation, has the effect of feature enhancement and is expressed as follows:
Figure BDA0003103898800000063
thereafter, in the summation process, the aggregated feature maps E and F are weighted using two scaling parameters α and β, where α and β are learnable parameters initialized to 0. We generate the spatial attention graph G, denoted as attention operation, which has the effect of feature enhancement:
Gj=αEj+βFj+Aj
expanded tunnel position attention of step T1-4
As shown in fig. 3, we perform the matrix multiplication directly between transposed a and original a. Here we get a C × C channel attention B, then we propagate feature map C using matrix multiplication between B and original a, using the attention operation, with the effect of feature enhancement, expressed as:
Figure BDA0003103898800000071
where M is the channel dimension, and the subscripts i, j denote channel i and channel j, CjRepresenting the jth position in the signature C. Then we define the cross-channel operator to capture the adjacent channel relations, implemented with a 1 × 1 convolution of kernel size h. Then, the local cross-channel interaction, namely the attribute operation, has the effect of feature enhancement, and can be expressed as:
Figure BDA0003103898800000072
where W is an hXM parameter matrix, h denotes the adjacent channel in stride (i.e., kernel size), we set h to 3, and σ is a sigmoid function.
Then, we set the weight parameter λ to 0.1 and generate a channel attention map E represented as:
Ej=λCj+Dj+Aj
step T1-5 feature fusion.
In this step, feature maps supporting large-scale input spatial position attention in the step T1-3 and output of expanded channel position attention in the step T1-4 are first stitched together, then attention feature fusion is performed through a layer of 1 × 1Conv, and then point clouds are up-sampled through three layers of 1 × 1Conv (convolutional layers) so that the output point cloud scale is equal to the input point cloud scale.
Step T1-6 combines the loss functions.
To exploit together the inherent properties of 3D real world data, we have designed two effective penalties to enforce (guide) the network to learn the inherent attributes of the input pointsSex, i.e., imbalance adjustment losses and occupancy regression losses. The network is trained to combine cost functions
Figure BDA0003103898800000073
And (3) minimizing:
Figure BDA0003103898800000074
wherein
Figure BDA0003103898800000075
Is an imbalance adjustment penalty function that performs imbalance and long tail label distribution in the actual 3D scene.
Figure BDA0003103898800000076
And the method is an occupation regression loss function and is used for regressing the occupation size of the category to which each point belongs.
Figure BDA0003103898800000077
Conventional cross entropy loss for the final semantic segmentation prediction.
Firstly, to
Figure BDA0003103898800000081
For analysis and definition, 3D point clouds collected from the real world typically exhibit an unbalanced or long-tailed distribution of labels. In this case, the batches sampled during training have little chance to sample the low frequency classes compared to the high frequency classes, which easily leads the model to ignore them, but in practice we are usually more concerned about the recognition results of low frequencies. Considering K candidate semantic categories, the training data are (x, y) -D, x represents data, y represents supervision information, D represents distribution, and p represents unknown distributionθ(y | x), in general, one minimizes the softmax cross entropy:
Figure BDA0003103898800000082
theta denotes a neural networkLet us assume fy(x; θ) (the current parameter distribution of the neural network) is the result before the softmax function, i.e., logit, so we get:
Figure BDA0003103898800000083
from the experience of observation, for example in a conference room, certain categories (e.g. table and chairs) often occur simultaneously; while other classes (e.g., sofas and columns) tend to avoid each other. To describe this, we need a number to represent "how many times the probability of two classes co-existing in the same 3D scene as they randomly encounter? ". We describe this phenomenon using point-by-Point Mutual Information (PMI) because it is a quantity that measures the relationship between two random variables sampled at the same time, expressed as:
Figure BDA0003103898800000084
wherein, p (y)1)、p(y2) Are respectively category y1、y2If the PMI is much larger than 0, it means that the two classes tend to occur simultaneously, and conversely, they tend to avoid each other.
From the above discussion, point-by-Point Mutual Information (PMI) is an effective measure, actually revealing the internal relationships between classes. Therefore, we let the model fit the PMI directly so that the network learns more basic knowledge. We model PMI and express it as:
Figure BDA0003103898800000085
then we re-normalize it using the softmax function, as:
log pθ(y| x)~fy(x;θ)+log p(y)
for the sake of generalization, we add an adjustment factor τ, we set τ to 1, and the resulting imbalance adjustment loss is expressed as:
Figure BDA0003103898800000086
the imbalance adjustment penalty we propose applies a tag-dependent offset to each logit. By embedding PMIs between scene semantics and introducing them into segmentation tasks, the network can be helped to reduce inter-class confusion problems.
Second pair
Figure BDA0003103898800000091
And (5) analyzing and defining. In a 2D image, due to different camera distances and angles, the object will be imaged as a different number of pixels, which results in the occupied pixels/points (expressed as occupancy) of each object being unpredictable. In contrast, 3D data is intrinsic, with no occlusion or scale blurring, so the number of points of an object does not change in the 3D scene. This means that an object may contain a fixed number of points. As a result, the same marker points tend to have a stable number, which we call the occupancy scale.
For indoor and outdoor scenes, we partition the point cloud using sub-grids of size 4cm and 6cm, leaving the points in each small grid represented by only one point with a label (center of gravity). This step is similar to the operation of voxelizing a point cloud. We can sample a fixed number of points 10 < lambda > 5 from each point cloud as input, which in our experimental setup is 40960. For unmarked points in the scene, these points are not put into the penalty function for computation. Therefore, the setting of oi can help the network to correct the problem of data imbalance during training. Furthermore, for the point cloud datasets used in our experiments, the original setup was, from any perspective, only one label per marked point, whereas unmarked points were unmarked.
For the ith point in the kth semantic class, we predict a positive value oiTo indicate the number of points occupied by the current semantic class. Then, oiWill be used as the expected occupancy of the current semantic classA quantitative semantic class. For more reliable prediction, we regress the logarithm instead of the original value and represent it with the following expression:
Figure BDA0003103898800000092
wherein N iskIs the number of points in the kth semantic class. The proposed occupancy regression loss returns to the occupancy size of the class for each point, i.e., the inherent property of each 3D class object. Occupancy regression loss adjusts the proportion of each semantic category in the training process, which can benefit the network by effectively preventing intra-category inconsistencies.
And step T1-7, outputting a large-scale 3D point cloud segmentation result. And outputting the point cloud semantic segmentation result predicted by the model.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. The semantic segmentation method for calibrating point cloud data long-tail distribution based on mutual information is characterized by comprising the following steps of:
s1, inputting point cloud data;
s2, extracting point cloud features;
s3, acquiring spatial position attention supporting large-scale input;
s4, acquiring expanded channel position attention;
s5, performing feature fusion, namely splicing feature maps supporting large-scale input space position attention and expanded channel position attention output, performing attention feature fusion, and performing up-sampling to enable the scale of the output point cloud to be equivalent to that of the input point cloud;
s6, constructing a joint loss function, and forcing the neural network to learn the inherent attributes of the input points:
Figure FDA0003103898790000011
Figure FDA0003103898790000012
a joint cost function is represented that is,
Figure FDA0003103898790000013
representing an imbalance adjustment loss function for performing imbalance adjustment of the imbalance and long tail label distribution,
Figure FDA0003103898790000014
expressing an occupancy rate regression loss function for regressing the occupancy size of the category to which each point belongs,
Figure FDA0003103898790000015
representing a cross entropy loss function for final semantic segmentation prediction;
and S7, outputting the point cloud segmentation result.
2. The method for semantic segmentation based on mutual information calibration point cloud data long-tail distribution according to claim 1, wherein the step S3 comprises the following steps:
s31, acquiring an output feature map A from the feature extraction network, and acquiring different feature maps B and C through A, wherein N represents points;
s32, carrying out matrix multiplication between transposes of B and A to obtain an output matrix D, wherein the attention value formula on D is as follows:
Figure FDA0003103898790000016
the indices i and j denote point i and point j, A, respectivelyjIndicates the position of the jth point in the feature map A, BiRepresenting the position of the ith point in the feature map B;
s33, representing the transpose of D as a bottleneck transformation:
Fj=ReLU(LN(Dj))
where LN represents the normalization layer and ReLU is the activation function;
s34, multiplying the matrix between figures D and C, as:
Figure FDA0003103898790000017
s35, weighting the aggregated feature maps E and F using the scale parameters α and β, where α and β are learnable parameters initialized to 0, to generate the spatial attention map G:
Gj=αEj+βFj+Aj
3. the method for semantic segmentation based on mutual information calibration point cloud data long-tail distribution according to claim 1, wherein the step S4 comprises the following steps:
s41, performing matrix multiplication between the transposed A and the original A to obtain an attention graph B;
s42, propagating signature C using matrix multiplication between B and original a, represented as:
Figure FDA0003103898790000021
where M is the channel dimension, and the subscripts i, j denote channel i and channel j, CjRepresents the j-th position in the characteristic diagram C;
s43, defining a cross-channel operator to capture the adjacent channel relation, and performing local cross-channel interaction, wherein the local cross-channel interaction is represented as:
Figure FDA0003103898790000022
where W is an h M parameter matrix, h represents the adjacent channel in stride, and σ is an S-type function.
S44, setting the weight parameter λ, and generating a channel attention map E:
Ej=λCj+Dj+Aj
4. the method of claim 1, wherein the imbalance adjustment loss function in S6 is a semantic segmentation method based on mutual information calibration point cloud data long tail distribution
Figure FDA0003103898790000023
Using the minimum softmax cross entropy:
Figure FDA0003103898790000024
wherein theta represents parameters of the neural network, and (x, y) -D represent training data, wherein x represents data, y represents supervisory information, D represents distribution, and p representsθ(y | x) represents an unknown profile;
let fy(x; θ) is the result before the softmax function, i.e., logit, thus yielding:
Figure FDA0003103898790000025
wherein f isy(x; theta) represents the current parameter distribution of the neural network, and K represents the number of candidate semantic categories;
the relation between two random variables sampled simultaneously is measured by adopting point-by-point mutual information PMI, and is expressed as follows:
Figure FDA0003103898790000026
wherein, p (y)1)、p(y2) Are respectively category y1、y2If the PMI is much larger than 0, then both classes tend to occur simultaneously, otherwise, tend to avoid each other;
PMI was modeled and expressed as:
Figure FDA0003103898790000031
it is re-normalized using the softmax function, expressed as:
log pθ(y∣x)~fy(x;θ)+log p(y)
adding the adjustment factor τ, the resulting imbalance adjustment loss function is expressed as:
Figure FDA0003103898790000032
5. the method of claim 4, wherein the adjustment factor τ is 1.
6. The method of claim 1, wherein the occupancy rate regression loss function in S6 is a function of the semantic segmentation of long-tail distribution of point cloud data based on mutual information calibration
Figure FDA0003103898790000033
Predicting a positive value o for the ith point in the kth semantic classiTo indicate the number of points occupied by the current semantic class, oiThe average of (a) is used as the expected occupancy of the current semantic class, and the point number is regressed:
Figure FDA0003103898790000034
wherein N iskIs in the kth semantic classThe point number, K, represents the semantic category number, and the occupancy regression loss returns to the occupancy size of each point's class, i.e., the inherent attribute of each class object.
7. The method for semantic segmentation based on mutual information calibration point cloud data long-tail distribution according to claim 1, wherein in the step S2, the input point cloud data is described as
Figure FDA0003103898790000035
It is an original unordered set of points with F dimensions, N is the number of points, piIs a feature vector comprising coordinates, colors, labels.
8. The method for semantic segmentation based on mutual information calibration point cloud data long-tail distribution according to claim 1, wherein in S2, the input point cloud data is subjected to data enhancement, including random dot disordering, random point cloud rotation, random rotation space coordinates and normal vectors.
9. The method for semantic segmentation based on mutual information calibration point cloud data long-tail distribution according to claim 3, wherein in the step S43, h is set to be 3.
10. The method for semantic segmentation based on mutual information calibration point cloud data long-tail distribution according to claim 3, wherein in S44, λ is 0.1.
CN202110631495.8A 2021-06-07 2021-06-07 Semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration Pending CN113554653A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110631495.8A CN113554653A (en) 2021-06-07 2021-06-07 Semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110631495.8A CN113554653A (en) 2021-06-07 2021-06-07 Semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration

Publications (1)

Publication Number Publication Date
CN113554653A true CN113554653A (en) 2021-10-26

Family

ID=78130320

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110631495.8A Pending CN113554653A (en) 2021-06-07 2021-06-07 Semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration

Country Status (1)

Country Link
CN (1) CN113554653A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114638336A (en) * 2021-12-26 2022-06-17 海南大学 Unbalanced learning focusing on strange samples
WO2023098807A1 (en) * 2021-12-03 2023-06-08 维沃移动通信有限公司 Point cloud encoding processing method and apparatus, point cloud decoding processing method and apparatus, and encoding device and decoding device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023098807A1 (en) * 2021-12-03 2023-06-08 维沃移动通信有限公司 Point cloud encoding processing method and apparatus, point cloud decoding processing method and apparatus, and encoding device and decoding device
CN114638336A (en) * 2021-12-26 2022-06-17 海南大学 Unbalanced learning focusing on strange samples
CN114638336B (en) * 2021-12-26 2023-09-22 海南大学 Unbalanced learning focused on strange samples

Similar Documents

Publication Publication Date Title
US10733431B2 (en) Systems and methods for optimizing pose estimation
US10796452B2 (en) Optimizations for structure mapping and up-sampling
CN112308200B (en) Searching method and device for neural network
CN109902798A (en) The training method and device of deep neural network
CN111507378A (en) Method and apparatus for training image processing model
CN109493303A (en) A kind of image defogging method based on generation confrontation network
CN106845471A (en) A kind of vision significance Forecasting Methodology based on generation confrontation network
JP2019082978A (en) Skip architecture neutral network device and method for improved semantic segmentation
CN113705769A (en) Neural network training method and device
CN110070107A (en) Object identification method and device
CN112634296A (en) RGB-D image semantic segmentation method and terminal for guiding edge information distillation through door mechanism
CN110222717A (en) Image processing method and device
CN109711401A (en) A kind of Method for text detection in natural scene image based on Faster Rcnn
CN113095254B (en) Method and system for positioning key points of human body part
WO2021103731A1 (en) Semantic segmentation method, and model training method and apparatus
CN114049381A (en) Twin cross target tracking method fusing multilayer semantic information
CN113011562A (en) Model training method and device
CN113554653A (en) Semantic segmentation method for long-tail distribution of point cloud data based on mutual information calibration
WO2022111387A1 (en) Data processing method and related apparatus
CN112580720A (en) Model training method and device
CN111723667A (en) Human body joint point coordinate-based intelligent lamp pole crowd behavior identification method and device
CN115512251A (en) Unmanned aerial vehicle low-illumination target tracking method based on double-branch progressive feature enhancement
Luvizon et al. SSP-Net: Scalable sequential pyramid networks for real-Time 3D human pose regression
CN113706544A (en) Medical image segmentation method based on complete attention convolution neural network
CN113066018A (en) Image enhancement method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination