CN113870160A

CN113870160A - Point cloud data processing method based on converter neural network

Info

Publication number: CN113870160A
Application number: CN202111060998.0A
Authority: CN
Inventors: 王旭; 曾宇乔; 金�一; 岑翼刚; 孙宇霄; 李浥东; 郎丛妍; 王涛; 冯松鹤
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2021-12-31
Anticipated expiration: 2041-09-10
Also published as: CN113870160B

Abstract

The invention provides a point cloud data processing method based on a converter neural network. The method comprises the following steps: constructing a three-dimensional object symmetry detection model, acquiring symmetry points of input point cloud data by detecting object symmetry planes/axes, converting a projection plane of the point cloud data into a rotational translation operation of a symmetric structure, and obtaining multiple groups of point cloud image data after data enhancement; extracting global characteristic information and local characteristic information of a plurality of groups of data enhanced point cloud picture data through a converter network model to obtain point cloud data after down sampling; and combining different target task requirements, constructing a task network model driven by the task, and inputting the point cloud data subjected to down-sampling into the task network model to obtain a target task result. The method effectively combines the three-dimensional object symmetry detection model and the converter network model, can improve the robustness of the down-sampling model, and has the capability of minimizing the precision loss of the target task, thereby improving the down-sampling scale and the precision of the target task.

Description

Point cloud data processing method based on converter neural network

Technical Field

The invention relates to the technical field of point cloud data down-sampling, in particular to a point cloud data processing method based on a converter neural network.

Background

The Transformer (Transformer) is a new deep learning framework proposed in 2017 by the article "Attention is All You Need" of the *** machine translation team. The converter in the deep learning domain has an encoder-decoder (encoder-decoder) structure, which contains three main modules: an input data embedding module (input embedding), a position encoding module (positional encoding), and a self-attention module (self-attention).

The point cloud data in the rail transit system is a set of vectors in a three-dimensional coordinate system acquired by three-dimensional acquisition equipment, such as a laser radar, a stereo camera and the like, wherein each point contains a three-dimensional coordinate, and some point cloud data also comprise information such as color, depth, reflection intensity and the like.

The point cloud data acquired in the rail transit system is often large in scale, for example, the number of point clouds in a single point cloud image can reach hundreds of thousands to millions, but is limited by indexes such as time, energy consumption and the like, and the existing embedded equipment is difficult to directly operate the large-scale data. Meanwhile, due to the influences of weather, road jolt, illumination change and the like, the point cloud data often contains a large number of noise points, which may seriously affect the accuracy of the data, thereby causing the accuracy reduction of unmanned and other analysis systems depending on large data scale. Therefore, in an actual point cloud data processing system, a down-sampling operation of the point cloud is often included, that is, noise points and redundant points in the point cloud data are removed.

Data enhancement comprises a series of techniques for expanding existing training samples, which are mainly classified into two categories: one is the traditional enhancement method such as random scaling, rotation, dithering, translation, etc., and the other is the method based on deep learning such as training sample migration transformation based on learning, component reorganization, etc. The purpose of applying the data enhancement technology is to expand the training sample number of the neural network model and increase the generalization of the model.

With the development of three-dimensional sensing sampling technologies such as laser radar and the like, the three-dimensional sensor plays more and more important roles in the field of computer vision, particularly in the aspects of automatic driving, environmental perception and the like. The classification or segmentation of objects or scenes described by a three-dimensional point cloud using a deep neural network has become a hot problem in the field. For example, in the field of autonomous driving, vehicles are often equipped with multiple three-dimensional sensors with 360 ° shooting mode to ensure that enough redundant information is captured for the deep neural network to be more accurate and robust. However, a visual task represented by automatic driving puts a high demand on response time, a large amount of unprocessed point cloud data is difficult to be directly used, and down-sampling of three-dimensional point cloud data is usually required to reduce the data scale and remove redundant and noise points, so that the operation is accelerated and the computational power consumption is reduced.

At present, the down-sampling methods in the prior art are mainly classified into two methods, namely a traditional method and a deep learning method. The conventional down-sampling method is represented by a farthest point sampling and a random sampling. The process of sampling the farthest point takes a certain sampling point as a starting point, selects a point with the farthest Euclidean distance as a next sampling point each time, and repeats the operation until the total K sampling points are selected; random sampling is the random sampling of sample points from the original data, and the sampling strategy does not impose any artificial intention. Although the traditional method can effectively reduce the scale of point cloud data and ensure the output accuracy of the model to a certain extent, the non-task-driven down-sampling mode is difficult to be linked with a subsequent task network, and the consideration of task requirements in the down-sampling process is omitted, so that suboptimal sampling results are often obtained, and the output accuracy of a target task is difficult to be maintained to the maximum extent while the scale of input data is reduced.

Downsampling methods based on deep learning have also been proposed in recent years. One of the currently popular deep learning based down-sampling methods is to utilize pooling operations. Pooling (posing) is a common operation in a convolutional neural network, which is to simulate a human visual system to perform dimensionality reduction on data, and is generally used for reducing the characteristic dimensionality output by a convolutional layer after the convolutional layer is constructed, so as to achieve the purpose of effectively reducing network parameters. In the prior art, the maximum pooling operation idea is utilized, and in the network learning process, the point with the maximum characteristic attribute value in the point cloud is regarded as a key point and reserved, so that the down-sampling operation is completed. The other main down-sampling technology is to use a self-adaptive weighting mechanism, namely, firstly, sampling is carried out by utilizing a farthest point sampling algorithm, then neighborhood points of sampling points are obtained by utilizing a K neighbor algorithm, then, the neighborhood points are learned by utilizing a full-link layer, the K neighborhood points are subjected to self-adaptive weighting through the learned weights, and finally, the weighted average value is taken as a new key point. Although the method considers the requirements of the target task to a certain extent, the method inevitably causes the degradation of model performance. Meanwhile, different depth learning framework structures also have an influence on point cloud data. The point cloud learning network based on convolution operation generally needs to voxelate point cloud into three-dimensional grid, so that object learning is carried out by using a three-dimensional convolution neural network. A deep learning framework based on a point method, such as a shared full-connection network, creatively combines a multilayer perceptron with maximum pooling operation, effectively reduces the expenditure of calculation and storage of a neural network, but an input layer reorders point clouds and destroys the original point cloud spatial distribution characteristics, and matrix multiplication carried out on a hidden layer is mapping transformation of the original characteristics to other dimensions, and point cloud spatial structure information is not effectively considered.

In summary, the point cloud down-sampling method in the prior art has the following disadvantages: the existing point cloud down-sampling method does not bring a converter network framework into the design of a depth model, and simultaneously reduces the balance problem of down-sampling scale and target task accuracy into a task-driven point cloud self-attention learning problem.

The deep learning method is data-driven, and a large number of various training samples are needed to improve the accuracy of the deep network model. For a traditional two-dimensional visual task, the disclosed data set is large in scale and high in quality, for example, ImageNet comprises 2.2 ten thousand categories, more than 1500 ten thousand manually annotated images exist, at least 100 ten thousand images have a calibration frame of a target object, great convenience is brought to visual tasks such as object classification and target detection of two-dimensional images, and researchers can conduct more exploration by means of massive high-quality data. However, the public Dataset of the existing three-dimensional point cloud is small in size and not beneficial to the training of the Depth model, for example, the common Dataset Sydney Urban Objects Dataset (Sydney Urban Objects Dataset) contains 631 labeled Objects, the RGB-D Object Dataset (RGB-D Objects Dataset) contains 300 Objects of 51 categories, and the new york university Depth Dataset (NYU-Depth) contains 2347 labeled frames and 108617 unlabeled frames. Existing public data sets are extremely limited in size.

Disclosure of Invention

The embodiment of the invention provides a point cloud data processing method based on a converter neural network, which aims to realize the balance between the point cloud data downsampling scale and the point cloud target task accuracy.

In order to achieve the purpose, the invention adopts the following technical scheme.

A point cloud data processing method based on a transformer neural network comprises the following steps:

step S1, constructing a three-dimensional object symmetry detection model, wherein the three-dimensional object symmetry detection model acquires the symmetrical points of the input point cloud data by detecting the object symmetry plane/axis, and converts the projection plane of the point cloud data into the rotational translation operation of a symmetrical structure by using the symmetrical points to obtain multiple groups of point cloud image data after data enhancement;

step S2, constructing a converter network model, extracting global characteristic information and local characteristic information of the point cloud picture data after the multiple groups of data are enhanced through the converter network model, acquiring importance degree information of each point in the point cloud data, and learning the point cloud data after down sampling;

and step S3, constructing a task network model driven by a task according to different target task requirements, inputting the point cloud data after down sampling into the task network model, and performing target task learning by the task network model to output a target task result.

Preferably, the step S1 specifically includes:

constructing a self-attention mechanism module based on a neural network, collecting and labeling training samples with symmetric information, and training the self-attention mechanism module by using the training samples;

connecting a plurality of three-dimensional object symmetry detection models in parallel by introducing a multiple loss function to obtain a shared self-attention module, wherein different self-attention modules in the shared self-attention module pay attention to different targets;

constructing a three-dimensional object symmetry detection model based on the shared self-attention module, and enabling original point cloud data P to be belonged to R^3+fInputting the said three-dimensional object symmetry detection model with said diversity loss function L_varUnder the constraint of (3), each self-attention model learns the characteristic information of a certain symmetric plane of the original point cloud data, and the original point cloud data P belongs to R^3+fThe method is characterized in that the method is connected with all characteristic information in series, the series connection result is input into a shared full-connection network, a plurality of groups of rotation and translation matrixes of the point cloud picture are learned simultaneously, wherein f represents other characteristic information except three-dimensional coordinates in the point cloud data, and the series connection operation is represented as follows:

F_output＝concat(f_i ¹，f_i ²，…，f_i ⁹，f_i ¹⁰，P)

and multiplying the original point cloud data by the plurality of groups of learned rotation and translation matrixes to obtain a plurality of groups of point cloud image data after data enhancement under a new coordinate with a symmetrical structure of a projection plane.

Preferably, the self-attention mechanism creates three vectors for each point in the point cloud data at three-dimensional coordinates: inquiring the vector Q, the key vector K and the value vector V, and scoring the semantic association degree of each point in the input point and the point cloud by calculating the product of Q and K;

the self attention machine function is formalized as:

wherein y is_iIs subject to the new output characteristics generated by the self-attention module,

beta and alpha represent point-by-point feature transformation operation, Q, K, V three vectors are obtained by embedding points into vector points and multiplying three feature transformation matrixes respectively trained in the training process of the neural network, gamma and theta are matrix functions, wherein gamma represents the multiplication operation of calculating Q and K; theta represents the aggregation operation of the importance fraction matrix of the value vector and the originally input point cloud data, and rho represents the normalization function.

Preferably, said multiple loss function L_varIs represented as follows:

i represents different point cloud samples, w represents learned attention weights, and p and q represent two different self-attention models in the same shared attention module.

Preferably, the shared fully-connected network consists of a concatenation of three parts: the shared fully-connected network comprises a multilayer perceptron, a batch normalization function and a linear rectification function, wherein the mathematical expression of the shared fully-connected network is as follows:

F_ouyput＝ReLU(BN(MLP(F_in)))。

preferably, the step S2 specifically includes:

constructing a converter network model comprising an input embedding module, a position coding module and a self-attention module, training the converter network model by using a loss function, combining the input embedding module and the position coding module, and modeling the spatial distribution of a plurality of groups of data enhanced point cloud image data by using the natural position coordinate information of the three-dimensional point cloud through the combined input embedding module and the position coding module;

analyzing the multiple groups of data enhanced point cloud picture data by using a self-attention model based on a spatial distribution model of the multiple groups of data enhanced point cloud picture data, and extracting global characteristic information of the data enhanced point cloud picture data;

constructing a local feature extraction unit comprising a sampling and grouping layer and a convolution layer, establishing a plurality of hierarchical point cloud subsets of the point cloud image data enhanced by the plurality of groups of data through the sampling and grouping layer, and performing feature extraction on the plurality of point cloud subsets by using a convolution neural network layer to obtain fine-grained local features of the point cloud image data enhanced by the data;

and the self-attention module integrates the global characteristic information and the local characteristic information of the point cloud picture data after the multiple groups of data are enhanced, and selects a three-dimensional point set which has the greatest contribution to the task discrimination accuracy of the task network to obtain the point cloud data after down sampling.

Preferably, the training of the transformer network model by using the loss function includes:

for an input point cloud P containing n points { P }_i∈R^3+fI 1, 2, …, n, the training goal of the transformer network is to learn the subset P_sSuch that s < n and minimizes the task sampling loss L, the objective function L is expressed as:

wherein t is_iRepresenting true value, introducing sampling regularization loss function L to meet the requirement of target function L_samplingThe specific tabular form is as follows:

wherein L is_fAnd L_mMean and maximum neighbor loss, L, respectively_bIndicating a loss of close-neighbor matching.

Preferably, the step S3 specifically includes:

constructing a task network model driven by a task based on a converter neural network, inputting point cloud data after down sampling into the task network model, designing a three-dimensional object symmetry detection model and a converter network model based on the task network model, and designing an end-to-end loss function as:

L_total(P，P_s)＝αL_var(P)+βL_sampling(P，P_s)+L_task(P_s)

where α and β represent weights.

Taking the end-to-end loss function as a training loss function of the three-dimensional object symmetry detection model and the converter network model, updating weight parameters in the three-dimensional object symmetry detection model and the converter network model through a reverse propagation algorithm inherent in a neural network, and continuously optimizing the output accuracy of the three-dimensional object symmetry detection model, the converter network model and the task network model;

and performing down-sampling on the input point cloud data through a finally optimized symmetrical detection model, a converter network model and a task network model, mapping the input down-sampled point cloud data to a feature space, and learning point cloud input features on the feature space through a shared full-connection layer to obtain an output result of a target task.

According to the technical scheme provided by the embodiment of the invention, the embodiment of the invention provides a task-driven point cloud data robustness down-sampling framework based on a converter neural network. The framework effectively combines a three-dimensional object symmetry detection model and a converter network model, and a target task network is cascaded to form an end-to-end deep learning model, and finally a balance point of point cloud data downsampling scale and point cloud target task accuracy is achieved, so that the method can improve downsampling model robustness and has the capability of minimizing target task accuracy loss, and therefore bidirectional improvement of downsampling scale and target task accuracy is achieved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a processing flow chart of a task-driven point cloud data robustness downsampling method based on a converter neural network according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a specific instantiation of a self-attention model according to an embodiment of the present invention;

fig. 3 is a specific instantiation structure diagram of a three-dimensional object symmetry detection model according to an embodiment of the present invention;

fig. 4 is a specific instantiation structure diagram of a local feature extraction module according to an embodiment of the present invention;

fig. 5 is a specific instantiation structure diagram of a converter network model under task driving according to an embodiment of the present invention;

fig. 6 is a training sample and a corresponding down-sampled point cloud image according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.

The unmanned application scene puts higher requirements on the reliability of the three-dimensional point cloud analysis processing, such as classification or network identification, which needs to have high enough accuracy to perform reliable judgment. However, objects at a distance or in a small volume in the three-dimensional point cloud are often represented by sparse points, and in the process of extracting features in the neural network, the features of the objects are usually lost due to the deepening of the layer number, so that the missing detection and the false detection rate of the model are increased. In contrast, the embodiment of the invention introduces a local feature extraction module based on a convolutional neural network framework, finely extracts the detail semantic information of the point cloud, and supplements the neglected local features in the global features, so as to improve the robustness of the whole downsampling model.

By means of a converter model theory which is raised in recent years, the embodiment of the invention tries to cascade a converter model and a target task network for the first time and converts the problem into a task-driven point cloud self-attention learning problem. By designing a special three-dimensional object symmetry detection model, input point cloud data is rotated in a three-dimensional space and is translated and transformed to a new coordinate system projected as a symmetric plane, so that the scale of a training sample is increased, and the generalization capability of a subsequent training model is improved; a converter network model is built, and rich point cloud semantic information is obtained as much as possible, so that the whole model framework can effectively learn key points, redundant points and noise points in the point cloud, and the importance degree information of each point in the data is obtained. And then, according to the point-by-point importance degree information, performing measurement-based downsampling to achieve the purpose of minimizing the precision loss of the target task.

The embodiment of the invention provides a point cloud data robustness downsampling framework under task driving based on a converter neural network. The framework effectively combines a three-dimensional object symmetry detection model and a converter network model, and a target task network is cascaded to form an end-to-end deep learning model, and finally a balance point of point cloud data downsampling scale and point cloud target task accuracy is achieved, so that the method can improve the downsampling model robustness, and meanwhile, the model has the capability of minimizing target task accuracy loss, and therefore bidirectional improvement of the downsampling scale and the target task accuracy is achieved.

The balance problem of the down-sampling scale and the target task accuracy in the point cloud down-sampling algorithm is simplified into a task-driven point cloud self-attention metric learning problem, and input point cloud data are rotated in a three-dimensional space through designing a special three-dimensional object symmetry detection model, so that the point cloud data are translated and transformed to a new coordinate system projected as a symmetric plane, the scale of a training sample is increased, and the generalization capability of a subsequent training model is improved; a converter network model is built, and rich point cloud semantic information is obtained as much as possible, so that the whole model framework can effectively learn key points, redundant points and noise points in the point cloud, and the importance degree information of each point in the data is obtained. And then, according to the point-by-point importance degree information, performing measurement-based downsampling to achieve the purpose of minimizing the precision loss of the target task.

The three-dimensional space point cloud data has translation invariance, rotation invariance and scale invariance, namely, the point cloud is subjected to integral translation, rotation and scale transformation, and the real representation of the point cloud data and the expression of semantic information cannot be changed. The symmetric structure is the basic geometric attribute of most objects in nature, and widely exists in actual rail transit scenes, such as pedestrians, automobiles, bicycles and the like, and has a generalized symmetric structure, so understanding the symmetry of the objects is an important problem in deep learning model understanding the real world and unmanned vehicle intelligent interaction. In order to solve the problem of insufficient scale of a three-dimensional point cloud training data set in the field of deep learning in the prior art, a three-dimensional object symmetry detection model detects symmetrical corresponding points of an object symmetry plane/axis and an input point cloud through learning, and seeks to perform accurate three-dimensional transformation on the input point cloud so as to obtain a plurality of groups of point cloud images under new coordinates with projection planes as symmetrical structures.

The problems that a sampling strategy is irrelevant to a target task, noise is sensitive and the like in a traditional down-sampling method are solved; the method solves the problems of damage to the spatial distribution characteristics of the point cloud caused by a downsampling strategy based on the existing deep learning framework, and the like, designs a robust downsampling framework of the point cloud data under the task drive of a converter neural network, fully utilizes the arrangement invariance of a self-attention module to the processing of a point cloud sequence, and can effectively extract the global characteristics of the point cloud and protect the spatial distribution characteristics of the point cloud; a local feature extraction module based on a convolutional neural network framework is introduced, the detail semantic information of the point cloud is extracted in a fine-grained manner, and the neglected local features in the global features are supplemented, so that the robustness of the whole downsampling model is improved, and the robustness of the downsampling model and the accuracy of a target task are improved in a two-way mode.

The point cloud data processing method based on the converter neural network mainly comprises the following processing processes:

(1) and constructing a three-dimensional object symmetry detection model. The model introduces a diversity loss function based on a shared attention mechanism, promotes different attention mechanisms to concentrate on learning specific symmetrical plane information of the point cloud picture, and accordingly realizes simultaneous output of multiple groups of rotation and translation matrixes of the same point cloud picture;

(2) and constructing a converter network model. The model consists of three modules: the device comprises a local feature extraction module, a global feature extraction module and a point cloud reconstruction module. The local feature extraction module consists of a group of cascaded two-dimensional convolutional neural networks and is used for extracting fine-grained semantic feature information of the input point cloud data. The global feature extraction module consists of a group of cascaded self-attention modules and is used for extracting global semantic features of the input point cloud data. The point cloud reconstruction module is cascaded with three groups of shared fully-connected neural networks and is used for fusing important information learned from the global characteristics and the local characteristics and reconstructing a three-dimensional point cloud picture containing important degree information;

(3) and constructing a task network model driven by the tasks. And combining different task requirements, cascading the three-dimensional object symmetry detection model, the converter network model and the task model to form an end-to-end integrated self-learning model framework.

The processing flow chart of the point cloud data robustness downsampling method based on the task drive of the converter neural network provided by the embodiment of the invention is shown in fig. 1, and the method specifically comprises the following steps:

step S1: a three-dimensional object symmetry detection model is constructed based on a neural network, a shared attention mechanism is adopted in the three-dimensional object symmetry detection model, a variety of loss functions are added, and training samples with symmetry information are collected and labeled.

The three-dimensional object symmetry detection model obtains the symmetrical points of the input point cloud data by detecting the object symmetry plane/axis, and converts the projection plane of the point cloud data into the rotation and translation operation of a symmetrical structure by using the symmetrical points to obtain multiple groups of point cloud image data after data enhancement.

Step S1-1: and constructing a self-attention mechanism module based on the neural network.

The self-attention mechanism is a way for a computer to simulate the internal brain neurosynaptic activation process when human observation behaviors, and the computer fuses and aligns internal experience with external perception information so as to enhance the observation fineness of a region of interest. The special structure of the attention mechanism enables the attention mechanism to rapidly extract important features of sparse data, so the attention mechanism is applied to a three-dimensional point cloud processing task with sparse spatial distribution, in particular to a point cloud down-sampling task. The specific instantiation structure of the self-attention model is shown in fig. 2. The self-attention mechanism creates three vectors for each point in the point cloud data at three-dimensional coordinates: query vector (Q), key vector (K), value vector (V). The three vectors are high-level abstractions of a computer for calculation and thinking attention, then the product of Q and K is calculated, the importance degree of semantic relevance of the input point and each point in the point cloud is scored, and the scores determine the influence of other points on the current point, namely the importance degree of other points on the current point is judged.

Then, the scores are normalized by a logistic regression function so that all the scores are positive values and the cumulative sum is 1. And then multiplying the normalized fraction by V to obtain an importance fraction matrix of the point cloud value vector. And finally, performing aggregation operation on the importance degree score matrix of the value vector and the originally input point cloud data by using a matrix function to obtain the point cloud data containing point cloud coordinate information and importance degree information. Formalizing the self attention function as:

β and α represent point-by-point feature transformation operations, and the above-mentioned Q, K, V three vectors can be obtained by point-multiplying a point-embedded vector (pointembeddingvector) by three feature transformation matrices respectively trained in the neural network training process. γ, θ are matrix functions, where γ represents the multiplication of Q and K in FIG. 2; and theta represents the operation of the importance fraction matrix of the value vector and the collection of the originally input point cloud data, and common functions comprise addition, subtraction, multiplication and concatenation. δ represents the position encoding function of the point cloud. ρ denotes a normalization function, which is used in the present invention.

Step S1-2: a shared self-attention mechanism module is constructed, and a diversity loss function is introduced.

With the continuous increase of the scale of point cloud data in the rail transit data, semantic information contained in the point cloud data is more and more complex, and a single attention mechanism is difficult to pay attention to all important targets. Therefore, the shared self-attention module is constructed, and each self-attention module focuses on a specific target in the deep learning training process of the module through the parallel connection of multiple groups of self-attention models in the step S1-1, so that the feature extraction capability of semantic detail information is improved. Meanwhile, in order to enable different self-attention modules in the shared self-attention module to effectively focus on different targets and distinguish the targets focused by other attention modules, the invention introduces a diversity loss function to promote the model to learn diversity targets consciously in the learning process. Loss of diversity function L_varIs represented as follows:

where i represents different point cloud samples, w represents learned attention weights, and p and q represent two different self-attention models in the same shared attention module.

Step S1-3: and constructing a three-dimensional object symmetry detection model structure based on a shared self-attention mechanism.

A three-dimensional object symmetry detection model is designed based on the shared self-attention module of S1-2, and a specific instantiation structure is shown in FIG. 3. The original input point cloud data is transformed into a shared self-attention model containing ten self-attention modules, and the diversity loss function L is obtained in S1-2_varUnder the constraint of (2), each self-attention model learns the characteristic information of a certain symmetric plane of the point cloud picture. Then the original point cloud data P belongs to R^3+fAnd (4) connecting with all the characteristic information in series (concatemate), and inputting the result into a shared full-connection network to realize simultaneous learning of multiple groups of rotation and translation matrixes of the point cloud picture. Wherein f represents other characteristic information except three-dimensional coordinates in the point cloud data, and generally comprises image RGB, reflectivity, characteristics of deep network learning and the like. The tandem operation can be expressed as:

F_output＝concat(f_i ¹，f_i ²，…，f_i ⁹，f_i ¹⁰，P)

specifically, the concatenation operation is a combination of the number of channels in the neural network, i.e., features describing the point cloud itself are added, while the information under each feature is not added. A shared fully-connected network consists of three parts cascaded: multilayer perceptron (MLP), Batch Normalization (BN), and Linear rectification (strained Linear Unit, ReLU). The mathematical representation of the shared fully-connected network is as follows:

F_output＝ReLU(BN(MLP(F_in)))

and finally, multiplying the input point cloud data by the learned multiple groups of rotation and translation matrixes to obtain multiple groups of data enhanced point cloud picture data under the new coordinates with the symmetric projection planes.

Step S1-4: training samples with symmetric information are collected and labeled.

The current task network based on deep learning depends on data driving, artificial efficiency can be improved through methods such as neural network learning based on big data, and even manual work is replaced under a specific scene, so that machine intelligence is realized. The three-dimensional object symmetry detection model designed by the invention has the greatest advantages that the three-dimensional object symmetry detection model completely depends on data driving, manual intervention is not needed, and the information contained in sample data is utilized to carry out point cloud picture symmetry plane detection to the maximum extent. Therefore, in order to maximize the detection accuracy of the model, the invention discloses a data set containing symmetric information labels: on the basis of a shape data set and a YCB data set (a public data set access address: https:// githu. com/GodZaarathusta/SymmetryNet), sample data containing a symmetric structure, such as pedestrians, bicycles, cars and the like, appearing in a rail transit system in a real scene can be continuously expanded, and therefore the accuracy of a symmetric detection model is gradually improved.

The purpose of the module is to design a neural network, so that the neural network can learn the symmetry information of the three-dimensional point cloud data and carry out rotation transformation on the data. The neural network is only a section of machine code, and after being trained by a large number of data samples, the neural network learns specific weights (the weights can be thought of as matrix parameters or individual numerical values), and then the input of the neural network is processed by fixed weights. Therefore, the first step of learning the three-dimensional point cloud data by the network is to collect training samples, label the samples, and train the neural network by using the collected training samples, so as to fulfill the aim of the cost model, i.e. design a neural network, so that the neural network can learn the symmetry information of the three-dimensional point cloud data and perform rotation transformation on the data.

Step S2: and constructing a converter network model, extracting global characteristic information and local characteristic information of the point cloud picture data after the multiple groups of data are enhanced through the converter network model, acquiring importance degree information of each point in the point cloud data, and learning the point cloud data after down-sampling.

Through the processing of the input point cloud picture, a richer training set required by the training point cloud variable converter network model is obtained. The point cloud converter network model of the invention mainly comprises two modules: a coordinate-based position encoding module (coordinate-based position encoding) and a self-attention module (self-attention), wherein the self-attention module is the core of the converter module and is used for obtaining refined global characteristic information of the input point cloud picture.

According to the method, a local feature extraction module based on a convolutional neural network framework is introduced, the detail semantic information of the point cloud is extracted in a fine-grained manner, and the neglected local features in the global features are supplemented, so that the robustness of the whole downsampling model is improved. In the following sections, we will explain the above modules in detail, respectively.

Step S2-1: input embedded module

The point cloud converter network model mainly comprises three modules: an Input embedding module (Input embedding), a position encoding module (positional encoding), and a self-attention module (self-attention). It is worth noting that the point cloud has arrangement invariance, and the real representation of the point cloud data and the expression of semantic information cannot be changed by different arrangement sequences, so that an input embedding module and a position coding module are combined, and the spatial distribution of the point cloud is modeled by utilizing the natural position coordinate information of the three-dimensional point cloud.

The point cloud converter network model is actually a point cloud data robustness down-sampling model, and can perform point cloud down-sampling on various point cloud inputs, namely, the scale of the point cloud is reduced, so that the S3 task network only needs fewer point clouds, and the calculation cost and the memory cost of the network are effectively reduced.

Step S2-2: self-attention model

Among many existing point cloud processing tasks, self-attention models have proven their effectiveness on point cloud tasks. Therefore, the invention utilizes the self-attention model to analyze the point cloud data after data enhancement and extracts the global characteristic information of the point cloud data. The self-attentive instantiation model is identical to the structure of fig. 2 in step S1-1. In particular, the θ set operation functions are various, and common functions such as addition, subtraction, multiplication, concatenation and the like are represented by specific mathematical formulas:

addition operation θ (SA (x)_i)；x)＝SA(x_i)+x

Subtraction operation θ (SA (x)_i)；x)＝SA(x_i)-x

Series operation θ (SA (x)_i)；x)＝[SA(x_i),x]

Hadamard product operation θ (SA (x)_i)；x)＝SA(x_i)⊙x

Dot product operation θ (SA (x)_i)；x)＝SA(x_i)·x

Through testing the five common functions through an ablation experiment, the invention finds that for a point cloud down-sampling task, the serial operation can provide higher accuracy contribution for a deep learning network. The invention verifies that the best characteristic extraction effect is achieved by cascading three groups of attention mechanisms.

Step S2-3: constructing local feature extraction units

The convolutional neural network has strong local feature extraction capability and can be used for identifying detail information and carrying out effective feature fusion on complex scene information. For a conventional convolutional network, the output of a certain position in a two-dimensional image is not only related to the input of the position, but also related to the input of positions around the position, and the inputs of different positions have different weights. However, the three-dimensional point cloud is a data form with a sparse structure, and the point cloud data in each same position cannot be guaranteed, so that the point cloud task is difficult to process by directly applying convolution operation. In order to solve the above problems, the present invention proposes a new feature extraction module, which comprises two main components: the sampling and grouping layer, the convolutional layer, and the concrete instantiation structure are shown in fig. 4.

The goal of the sampling and grouping layer is to build a hierarchical set of input point clouds. The method comprises the following specific steps: (1) acquiring initial M sampling Point indexes by using a Farthest Point sampling Function (FPS), then extracting the points from an original Point cloud picture P through the sampling Point indexes, representing the points by using new _ points, and keeping the spatial distribution of the points in a three-dimensional space; (2) setting a spherical radius parameterr, establishing a spherical coordinate system by taking each point in the new _ points as a central coordinate and taking r as a radius; (3) extracting all point clouds in a spherical body with new _ points as centers and r as a radius in the original point clouds to form a new point cloud data graph represented by the new _ ball _ points; (4) setting the number K of sampling points in the spherical coordinates, sampling K points in each new _ ball _ points by using a K neighbor algorithm, removing non-sampling points, forming a new point cloud data graph with fixed sampling points, and expressing the new point cloud data graph by using the new _ ball _ sampled _ points; (5) subtracting the value of the new _ points from the points in the new _ ball _ sampled _ points area, splicing the new features and the old features on each point, and finally obtaining a uniform point cloud picture with the spherical stereo space as the constraint and fixed points. The above process can be expressed by using mathematical notation that for an input point cloud picture P containing N points, M point cloud subsets are obtained through sampling and grouping layers

Wherein each point cloud subset p^mIs the corresponding center coordinate point

K points of the nearest neighbor, and the neighbor points satisfy an euclidean metric ρ:

obtaining new M dense point cloud subsets

And then, carrying out feature extraction on the point cloud by using a convolutional neural network layer to obtain fine-grained local features of the point cloud image, so as to fulfill the aim of supplementing the ignored local features in the global features and improve the robustness of the whole downsampling model. The invention uses two groups of cascaded local feature extraction modules to complete the local fine-grained semantic feature extraction target.

Step S2-4: constructing a loss function for training a converter network model

For an input point cloud P containing n points { P }_i∈R^3+fI 1, 2, …, n, the training goal of the transformer network is to learn the subset P_sSo that s < n and minimize the task sampling loss L. The objective function L can be expressed as:

wherein t is_iRepresenting the true value. In order to meet the requirement of an objective function L, the invention introduces a sampling regularization loss function L_samplingThe specific tabular form is as follows:

And selecting a three-dimensional point set which has the greatest contribution to the task discrimination accuracy of the task network by the point cloud converter network model, and taking the three-dimensional point set as point cloud data after down sampling.

Step S3: and combining different target task requirements, constructing a task network model driven by a task, inputting the point cloud data after down sampling into the task network model, learning the target task by the task network model, and outputting a target task result.

For specific point cloud processing tasks, such as point cloud classification, point cloud reconstruction and the like, the invention provides a task network model driven by a task based on a converter neural network. The task network model can be regarded as a three-dimensional point selection mechanism used for completing original task requirements, such as target classification and the like.

And (4) taking the down-sampled point cloud data obtained in the step (S2) as the input of the task network model to complete the target task. In order to balance the down-sampling scale and the task accuracy, the three-dimensional object symmetry detection model, the converter network model and the task network model are cascaded to form an end-to-end integrated self-learning model framework, and the point cloud data under the specified task is down-sampled through the end-to-end integrated self-learning model framework.

The target tasks are various, and different tasks such as target detection, target classification, semantic segmentation and the like can be performed on the same data set. The point cloud target task of the invention refers to a specific task, and the neural network model is trained by taking the specific task as a target. Since the same dataset may have different applications in the neural network, such as object classification, object detection, object reconstruction, etc. for the same dataset, and the down-sampling of the same dataset at the module S2 is different for different tasks, i.e. the network learns different features, qualifiers are added in S3-under the assigned task.

Fig. 5 is a specific instantiation structure diagram of a converter network model under task driving according to an embodiment of the present invention, as shown in fig. 5. Firstly, the three-dimensional object symmetry detection model constructed in step S1 is used for data enhancement of training samples. The enhanced data is then input to the transformer network of step S2 to learn a simplified cloud point map. And finally, inputting the simplified point cloud picture into the task network model and outputting a target task result. Where the overall end-to-end loss function is expressed as:

L_total(P，P_s)＝αL_var(P)+βL_sampling(P，P_s)+L_task(P_s)

where α and β represent weights.

The end-to-end loss function is used as a neural network training loss function in step S2, and the weight parameters in the network are updated by a back propagation algorithm inherent to the neural network, so as to continuously optimize the output accuracy of the network. Wherein α and β represent a proportional function, i.e. L_var(P) and L_sampling(P，P_s) At L_total(P，P_s) The multiplied proportional coefficient has the value range of (0, 1)](i.e., values in the range of 0 to 1A number in between, the left small bracket means that 0 cannot be taken, and the right middle bracket means that 1 can be taken). P represents the original point cloud input and Ps represents the down-sampled point cloud data.

The task network may be replaced according to the user's target task. The invention designs a point cloud down-sampling model aiming at the trained task neural network, namely S1 and S2 are learnable parts and are trained according to the point cloud input and the network parameters obtained by the task network. It is necessary to define a neural network in which the task network is trained in advance and the parameters are fixed. The significance of the invention is that any task network S3 can learn less point cloud sets required by the task network S3 through the S1 module and the S2 module, thereby effectively reducing the calculation cost of the whole network and ensuring that the whole precision of the task network meets the user requirements.

The invention takes the classification target task as an example to carry out S3 model design. Inputting original point cloud data into a target network, firstly performing feature mapping, namely mapping three-dimensional point cloud data to a feature space, then learning point cloud input features on the feature space through a shared full-connection layer, and continuously updating weight parameters of a task network to obtain the maximum output precision of the task network. And finally, fixing the trained task network model as an S3 model.

Fig. 6 is a schematic diagram illustrating an example of a training sample and a corresponding downsampled point cloud overall network structure according to an embodiment of the present invention.

In conclusion, the balance problem of down-sampling scale and target task accuracy in the point cloud down-sampling algorithm is simplified into a task-driven point cloud self-attention metric learning problem, and the input point cloud data is rotated in a three-dimensional space by designing a special three-dimensional object symmetry detection model, so that the input point cloud data is translated and converted into a plurality of new coordinate systems with symmetric planes projected, and the scale of a training sample is increased, and the generalization capability of a subsequent training model is improved; a converter network model is built, and rich point cloud semantic information is obtained as much as possible, so that the whole model framework can effectively learn key points, redundant points and noise points in the point cloud, and the importance degree information of each point in the data is obtained. And then, according to the point-by-point importance degree information, performing measurement-based downsampling to achieve the purpose of minimizing the precision loss of the target task.

The scheme of the invention effectively solves the problem of insufficient training samples, and makes a contribution to improving the robustness and the generalization of subsequent training models. According to the invention, the dynamic multi-angle rotation of point cloud input data is realized by utilizing rich geometric and semantic information contained in the object symmetry plane, the scale of training data is enlarged, and the generalization capability of a model is enhanced; a self-attention and local feature extraction model is introduced, feature extraction is carried out on input data from two dimensions of global dimension and local dimension, and abundant semantic information is obtained as much as possible, so that the whole model can effectively distinguish key points, redundancy points and noise points in point cloud data; and combining the modules, designing a complete point cloud down-sampling model, enabling the model to carry out self-learning according to a specified point cloud task, and finally realizing the point cloud data down-sampling target which minimizes the precision loss of the target task under the task driving.

Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A point cloud data processing method based on a converter neural network is characterized by comprising the following steps:

2. The method according to claim 1, wherein the step S1 specifically includes:

3. The method of claim 2, wherein the self-attention mechanism creates three vectors for each point in the point cloud data at three-dimensional coordinates: inquiring the vector Q, the key vector K and the value vector V, and scoring the semantic association degree of each point in the input point and the point cloud by calculating the product of Q and K;

the self attention machine function is formalized as:

4. The method of claim 2, wherein the multiple loss function L is_varIs represented as follows:

5. The method of claim 2, wherein the shared fully-connected network consists of a concatenation of three parts: the shared fully-connected network comprises a multilayer perceptron, a batch normalization function and a linear rectification function, wherein the mathematical expression of the shared fully-connected network is as follows:

F_output＝ReLU(BN(MLP(F_in)))

。

6. the method according to any one of claims 2 to 5, wherein the step S2 specifically comprises:

7. The method of claim 6, wherein the training the transformer network model with the loss function comprises:

for an input point cloud P containing n points { P }_i∈R^3+f1, 2.. n }, the training goal of the transformer network is to learn the subset P_sSuch that s < n and minimizes the task sampling loss L, the objective function L is expressed as:

8. The method according to claim 7, wherein the step S3 specifically includes:

L_total(P，P_s)＝αL_var(P)+βL_sampling(P，P_s)+L_task(P_s)

where α and β represent weights.