CN117132650A

CN117132650A - Category-level 6D object pose estimation method based on point cloud image attention network

Info

Publication number: CN117132650A
Application number: CN202311083936.0A
Authority: CN
Inventors: 黄章进; 邹露
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-08-25
Filing date: 2023-08-25
Publication date: 2023-11-28

Abstract

The invention discloses a category-level 6D object pose estimation method based on a point cloud image attention network, which comprises the following steps: s1, preprocessing input RGB-D image data, and extracting observation point clouds of an object under a depth camera; s2, extracting multi-scale local-global object structural features from observation point clouds by using a point cloud image attention network; s3, reconstructing a 3D point cloud model of the object by using a shape priori adaptation mechanism and a category shape priori point cloud, and regressing an object normalized NOCS coordinate; and S4, calculating similar transformation between the reconstructed NOCS coordinates and the observation point cloud through a Umeyama algorithm, and obtaining the posture and size information of the object. Experiments performed on the NOCS-REAL data set prove that the scheme provided by the invention is superior to the prior art, and better results are obtained.

Description

Category-level 6D object pose estimation method based on point cloud image attention network

Technical Field

The invention relates to the technical field of computer vision and object pose estimation, in particular to a category-level 6D object pose estimation method based on a point cloud image attention network.

Background

Category-level six degrees of freedom (6D) object pose estimation is a fundamental problem in the field of computer vision, which involves predicting 3D rotation, 3D translation of an object from an object coordinate system to a camera coordinate system, and 3D dimensions of the object. This technology is widely used in robotics, augmented reality, autopilot, etc. This problem is extremely challenging due to the large variations in shape of objects within different categories.

In response to the above-described problems, the pose and size of the object are recovered by computing a similarity transformation between the observed point cloud and the reconstructed NOCS coordinates. Thus, the quality of the reconstructed NOCS coordinates implicitly indicates the accuracy of the subsequent pose estimation. In order to improve the reconstruction quality of NOCS coordinates, some methods, such as SPD and SGPA, reconstruct a 3D model of an object by adopting a class shape prior point cloud representing the average shape of the object in the same class, and establish a 3D-3D corresponding relation between an observed point cloud model and the reconstructed point cloud model, thereby realizing 6D pose and size estimation. Although these correspondence-based approaches have made great progress, they fail to take full advantage of the unique structural features in individual instances, limiting their predictive capabilities. Therefore, how to provide a category-level 6D object pose estimation method based on a point cloud image attention network is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The invention aims to provide a category-level 6D object pose estimation method based on a point cloud image attention network, and experiments performed on NOCS-REAL data sets prove that the scheme provided by the invention is superior to the prior art, and better results are obtained.

According to the embodiment of the invention, the category-level 6D object pose estimation method based on the point cloud image attention network comprises the following steps:

s1, preprocessing input RGB-D image data, and extracting observation point clouds of an object under a depth camera;

s2, extracting multi-scale local-global object structural features from observation point clouds by using a point cloud image attention network;

s3, reconstructing a 3D point cloud model of the object by using a shape priori adaptation mechanism and a category shape priori point cloud, and regressing an object normalized NOCS coordinate;

and S4, calculating similar transformation between the reconstructed NOCS coordinates and the observation point cloud through a Umeyama algorithm, and obtaining the posture and size information of the object.

Optionally, the S1 specifically includes:

s11, dividing and detecting objects in RGB-D image data by using Mask R-CNN to obtain an object Mask area;

s12, mapping the object mask area onto a depth image of an object to obtain an object depth area;

s13, converting the depth information of the object into three-dimensional point cloud of the object by using camera parameters, and generating observation point cloud data observed by a camera.

Optionally, the point cloud attention network is an encoder-decoder architecture, the encoder-decoder architecture comprising:

a graph attention encoder for extracting multi-scale local to global object features from the observation point cloud;

the drawing force encoder is used for observing the point cloud P _o As input:

wherein,representing a real set, N _o Representing the number of points, 3 representing XYZ three-dimensional coordinates of the points;

converting the original three-dimensional coordinates of the observation point cloud into high-dimensional features by using a position embedding module;

extracting local to global instance geometric features from the input feature embedding in a layered manner through a graph attention module;

and the iterative non-parameter decoder is used for aggregating the geometric features of multiple scales.

Optionally, the position embedding module encodes the position information in the observation point cloud by using a 3D graph convolution layer, and for each observation point in the observation point cloud:

searching coordinate set of M nearest neighbors by nearest neighbor searching algorithmReceptive fields as convolution kernels for 3D map convolution:

wherein M represents the number of the closest points, M represents one of the points,representing three-dimensional coordinates of one of the points;

calculating a direction vector d in a receptive field obtained by a nearest neighbor search algorithm _m,n ：

d _m,n ＝p _m -p _n ；

And initializing the supporting point kernel vector k by uniform distribution _s ：

Wherein S represents the number of support points, each support point k _s Are all three-dimensional coordinates;

will p _n Is embedded in C ₀ In the dimensional feature vector, the obtained feature vector generates position embedding through a ReLU activation function

Wherein max represents maximum value solving operation, wherein < represents vector inner product, and < I > represents modular length of the solving vector.

Optionally, the drawing meaning force module performs multi-stage operation on the position embedding:

wherein G is _e (P _o ) Denoted as G _e ，N _o Representing the number of points C ₀ Representing a feature dimension for each point;

each stage having a different hidden dimension C _i In each stage i, three different point-by-point feature extraction layers are respectively applied to convert the input point features into corresponding dimensions C _i ：

iii) A graph convolution layer GCL;

iv) a point-by-point self-attention layer PSAL;

iii) And a feed forward layer FFN.

Optionally, the graph convolution layer GCL extracts the local geometric feature F of the object from the input point feature embedding by using the graph structure defined by the neighboring points of the points _i ：

Wherein N is _i Representing the number of points in the ith stage, C _i Representing a feature dimension for each point;

the volume lamination GCL includes a 3D graph convolution layer and a ReLU function.

The point-by-point self-attention layer PSAL uses a point cloud self-attention mechanism to extract global geometry from local geometry featuresSign G _i ：

Point-by-point self-attention layer application sharing multi-layer perceptron network to locally geometric featuresProjected to the query vector, key vector and value vector, denoted Q respectively _i 、K _i And V _i ：

Q _i ＝F _i W _i ^Q ；

K _i ＝F _i W _i ^K ；

V _i ＝F _i W _i ^V ；

Wherein W is _i ^Q 、W _i ^K And W is _i ^V Is of dimension C _i ×C _i Is a matrix of (a);

by calculating point-to-point attention weights between query vectors and value vectorsTo capture global geometric relationships between different points;

obtaining global geometric features through dot multiplication operation of attention weight and value vector

The feed-forward layer FFN generates a final output for each stage of the graph attention module by using the shared multi-layer perceptron network and the ReLU activation function;

in local geometric feature F _i A residual connection is added:

the geometric features obtainedIs used as an input point feature for the next stage;

geometric featuresObtaining the global shape feature of the object through global maximum pooling operation and repeated operation>Wherein C is ₄ ＝C ₅ Global shape information for an instance is described.

Optionally, the drawing meaning force module includes two drawing maximizing pooling layers, and the drawing meaning force module output includes five parts:

at each stage, the atlas GCL, the point-by-point self-attention layer PSAL and the feed forward layer FFL are stacked in order, extracting the object geometry from the input point features in a point-by-point manner.

Optionally, the iterative nonparametric decoder includes:

assume thatIs the ith stage of the graph annotation force module, where the geometric features in i.epsilon.2, 3,4 areIs a downsampled set of points;

in each iteration of the ith stage, the nearest neighbor search algorithm is used to search the previous stage point set P _i-1 Each of (3)Point p _n,i-1 Nearest neighbor q in the ith stage _n,i ；

Each point p is determined _n,i-1 Nearest neighbor q of (2) _n,i Point q _n,i Features propagated to point p _n,i-1 Applying;

updating the characteristics of the points of the i-2 th stage using the updated characteristics of the points of the i-1 th stage;

the whole process is carried out in an iterative manner until the characteristics of all points in the first stage are determined and applied to the output of each stage of the graph attention module;

when all multi-scale geometric features are aligned to the same pointWill be through cascade operation->With position embedding (G) _e ) And global shape feature->Polymerization is carried out to generate final geometric features

Wherein,the feature dimension is the sum of the dimensions of all features +.>

Optionally, the step S3 includes:

given and observed point cloudCorresponding class shape prior point cloud>Wherein N is _o And N _r Respectively representing the number of points, wherein each point is an XYZ three-dimensional coordinate;

extracting point-by-point prior feature G from category shape prior _r Point-by-point prior feature G _r Is a local prior featureAnd global a priori feature->Is->

Generating local prior features using a three-layer multi-layer perceptron networkGenerating global prior feature ++using a multi-layer perceptron network of two further layers based on local prior features>Wherein D is ₁ And D ₂ The dimensions of the features;

using one ReLU activation function after each multi-layer perceptron layer, using an adaptive max pooling operation after the last ReLU activation function to generate a global a priori feature;

in obtaining a priori features G _r And geometric featuresAfter thatGenerating deformation fields from the features by shape prior adaptation mechanism, respectively>And correspondence matrix->Wherein N is _r And N _o The number of the points is the number of the points and is used for NOCS coordinate regression;

each row D of D _i Representing a point cloud P from a priori _r To reconstruction point cloudIs defined by the deformation of each point:

each row A of A _i The sum of the elements of (1) represents the observation point cloud P _o Each point in (a) and its reconstruction point cloudSoft correspondence between all points in (a);

the shape prior self-adaptation stage uses two parallel networks, each network is composed of three layers of multi-layer perceptron networks, and the deformation field D and the corresponding relation matrix A are respectively regressed;

combining A withMatching to obtain NOCS coordinates of the object:

optionally, the observation point cloud P _o And reconstructed NOCS coordinates thereofThe Umeyama algorithm is used in combination with the RANSAC algorithm to calculate the 6D object pose and 3D dimensions.

The beneficial effects of the invention are as follows:

(1) The invention provides a novel point cloud image attention network, which adopts a network model of an encoder-decoder framework and is used for extracting unique structural features of an individual instance from object point cloud so as to improve accuracy of estimating the pose of a class-level object.

(2) The invention provides a graph annotation force encoder, which firstly utilizes 3D graph convolution to extract local structural features of multi-scale point cloud, and then adopts a self-attention mechanism to extract multi-scale global structural features from the local structural features.

(3) The invention provides an iterative non-parameter decoder which is used for propagating multi-scale global structural features from fine granularity to coarse granularity, and avoiding information loss in the process of feature propagation while retaining the multi-scale structural features.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a flow chart of a category-level 6D object pose estimation method based on a point cloud image attention network;

fig. 2 is a block diagram of a point cloud image attention network in a category-level 6D object pose estimation method based on the point cloud image attention network according to the present invention;

FIG. 3 is a view showing the visual results of 6D pose and 3D size estimation performed on REAL275 dataset by the present invention and advanced method in a class-level 6D object pose estimation method based on point cloud image attention network;

fig. 4 is a visual result diagram of 3D shape reconstruction performed on a REAL275 dataset by the method and the advanced method in the category-level 6D object pose estimation method based on the point cloud image attention network.

Detailed Description

The invention will now be described in further detail with reference to the accompanying drawings. The drawings are simplified schematic representations which merely illustrate the basic structure of the invention and therefore show only the structures which are relevant to the invention.

Referring to fig. 1, a category-level 6D object pose estimation method based on a point cloud image attention network includes:

in this embodiment, S1 specifically includes:

referring to fig. 2, in the present embodiment, the point cloud attention network is an encoder-decoder architecture, which includes:

graph attention encoder to observe point cloud P _o As input:

in this embodiment, the position embedding module encodes the position information in the observation point cloud using the 3D map convolutional layer, for each observation point in the observation point cloud:

d _m,n ＝p _m -p _n ；

will p _n Is embedded in C ₀ Viterbi deviceIn the eigenvector, the obtained eigenvector generates position embedding through ReLU activation function

In this embodiment, the drawing force module performs a multi-stage operation for embedding the position:

v) graph convolution layer GCL;

vi) point-by-point self-attention layer PSAL;

iii) And a feed forward layer FFN.

In the present embodiment, the graph convolution layer GCL extracts the local geometric feature F of the object from the input point feature embedding by using the graph structure defined by the neighboring points of the points _i ：

the graph convolution layer GCL comprises a 3D graph convolution layer and a ReLU function.

Point-by-point self-attention layer PSAL uses point cloud self-attention mechanism to extract global geometric feature G from local geometric feature _i ：

Q _i ＝F _i W _i ^Q ；

K _i ＝F _i W _i ^K ；

V _i ＝F _i W _i ^V ；

in local geometric feature F _i A residual connection is added:

In this embodiment, the graph attention module includes two graph maximization layers, and the graph attention module output includes five parts:

at each stage, the graph roll layer GCL, the point-by-point self-attention layer PSAL and the feedforward layer FFL are stacked in order, and object geometric features are extracted from the input point features in a point-by-point manner, so that expression from local to global is realized, and complex object geometric shapes are effectively described.

In this embodiment, the iterative nonparametric decoder includes:

assume thatIs the ith stage of the schematic illustration force module, in whichGeometrical features in i E {2,3,4} areIs a downsampled set of points;

in each iteration of the ith stage, the nearest neighbor search algorithm is used to search the previous stage point set P _i-1 Each point p of (3) _n,i-1 Nearest neighbor q in the ith stage _n,i ；

Wherein,the feature dimension being the dimension of all featuresAnd->

The iterative non-parametric decoder enables the network to propagate the point-wise geometrical features from fine granularity to coarse granularity in a progressive manner. The method retains multi-scale geometric characteristics, avoids information loss during characteristic propagation between different scales, and does not need any additional learnable parameters.

in this embodiment, S3 includes:

in obtaining a priori features G _r And geometric featuresThen, the shape prior adaptation mechanism is adopted to respectively generate deformation fields from the characteristics>And correspondence matrix->Wherein N is _r And N _o The number of the points is the number of the points and is used for NOCS coordinate regression;

combining A withMatching to obtain NOCS coordinates of the object:

In the present embodiment, the observation point cloud P _o And reconstructed NOCS coordinates thereofThe 6D object pose and 3D dimensions are calculated using the Umeyama algorithm in combination with the RANSAC algorithm, which is used to estimate optimal similar transformation parameters, i.e. rotation, translation and scale, where the rotation and translation parameters correspond to the 6D object pose and the scale parameters correspond to the object dimensions. The RANSAC algorithm is used to remove outliers and to achieve a robust estimation.

Example 1:

this example 1 was implemented using a pyrerch framework and experiments were performed on 1 desktop equipped with NVIDIA GeForce RTX 3090GPU, with a batch size of 64, first, clipping the depth image using an example segmentation Mask generated by Mask-RCNN, and resizing the clipped depth image to 256 x 256 pixels. Then, randomly sampling N from the point cloud of depth image conversion _o =1024 points, forming an observation point cloud. Next, N is sampled from a class shape prior point cloud pre-trained by SPD techniques _r =1024 points, obtaining a priori point clouds. In step two, the hidden layer dimension of the point cloud attention network is set to C ₀ ＝128，C ₁ ＝128，C ₂ ＝256，C ₃ ＝256，C ₄ ＝512，C ₅ =512. The super parameters in the 3D volume lamination all use default settings, i.e. the number of nearest neighbors is set to m=50, the number of support point kernel vectors is s=1. In step three, the hidden layer dimension of the multi-layer perceptron network for local a priori feature extraction is [64,64,64 ]]The hidden layer dimension of the multi-layer perceptron network for global prior feature extraction is [128,1024 ]]D is ₁ =64 and D ₂ =1024. For deformation field regression, the hidden layer dimension of the multi-layer perceptron network is set to [512,256, no×3 ]]For correspondence regression, the hidden layer dimension is set to [512,256, N _o ×N _r ]. During the training process, the network was optimized using an Adam optimizer with an initial learning rate of 1e-4 and a total of 100 rounds of training on the model. The learning rate was attenuated at a ratio of 0.6, 0.3, 0.1, and 0.01 for every 20 rounds. The same loss function is used to train the network as in SPD techniques, with all classes trained using a single model.

The present example reports the average accuracy of the 3D intersection ratio (IoU) at 50% and 75% threshold, respectively, to comprehensively evaluate the accuracy of rotation, translation, and size estimation. In order to directly compare the errors of rotation and translation, indices of 5 ° 2cm, 5 ° 5cm, 10 ° 2cm and 10 ° 5cm were also used. If the rotation and translation errors are below a given threshold, the pose is considered correct. Furthermore, the Chamfer distance was used to evaluate the accuracy of the 3D model reconstruction results.

TABLE 1 quantitative analysis of 6D pose and 3D dimension estimates versus advanced methods on REAL275 dataset of the invention

From the results in table 1, it can be clearly seen that the method proposed by the present invention is significantly superior to the prior art in terms of object pose and size estimation, achieving the best performance. In terms of comprehensive evaluation of rotation, translation and dimension estimation accuracy, the scheme of the invention is applied to 3D compared with NOCS technology using only RGB features ₅₀ The index is improved by 4.0 percent, inThe 3D75 index is improved by 40.3 percent. Compared with SGPA technology using RGB-D characteristics at the same time, the scheme of the invention is characterized in 3D ₅₀ The index is improved by 1.9%, and the index is improved by 8.5% in the 3D75 index. In terms of directly evaluating the accuracy of rotation and translation estimation, compared with the NOCS technology using only RGB features, the scheme of the invention improves by 38.7% on the 5 DEG 2cm index, 43.8% on the 5 DEG 5cm index, 49.3% on the 10 DEG 2cm index, and 52.5% on the 10 DEG 5cm index. Compared with SGPA technology using RGB-D characteristics, the technical scheme of the invention improves the index by 10.0% on the index of 5 degrees 2cm, improves the index by 14.2% on the index of 5 degrees 5cm, improves the index by 1.8% on the index of 10 degrees 2cm and improves the index by 7.0% on the index of 10 degrees 5 cm. These results clearly demonstrate that the proposed method provides a significant improvement over the prior art on the read 275 dataset. The inventive approach exhibits optimal results over multiple evaluation metrics, both with respect to methods using only RGB features and methods using both RGB-D features.

TABLE 2 quantitative analysis of 3D shape reconstruction with advanced methods on REAL275 dataset according to the invention

From the results of Table 2, it is clearly seen that the method of the present invention achieves the lowest shape reconstruction errors for all three object categories, bottle, can and notebook in the REAL275 dataset. For both the bowl and camera categories, the error of the method of the invention is only 0.05 worse than the best SGPA technology, while on the cup category, the error is also only 0.14 worse than the best SPD technology. The average error of the six categories is lower than that of all other methods, and compared with the best SGPA technology, the error is reduced by 0.44, so that the best three-dimensional shape reconstruction result is obtained. These results fully demonstrate the superiority of the method of the invention in class-level object pose estimation, particularly in terms of shape reconstruction in the classes of bottles, cans and notebook computers. This further demonstrates the effectiveness of this method in handling complex structure object pose estimation tasks.

Referring to fig. 3, it is apparent that the proposed method is closer to the real tag (white bounding box) than the SGPA method in terms of object pose and size estimation. This means that the method of the invention can better capture the geometric features of the object, thereby realizing more accurate pose and size estimation and effectively reducing the error between the real label.

Referring to fig. 4, it can be clearly observed that the 3D shape reconstructed by the method proposed by the present invention is very close to the real shape of the object. This demonstrates that the inventive method achieves excellent performance in recovering the three-dimensional shape of an object from point cloud data. From the figure, the method can accurately capture the geometric structure and detail of the object, and realize high-quality three-dimensional shape reconstruction.

In summary, the present invention provides a category-level object pose estimation method based on a graph attention network, which extracts unique geometric features from an observed object point cloud by using a point cloud graph attention network composed of a graph attention encoder and an iterative nonparametric decoder, and gradually perceives structural information of an object from local to global. And then, regression is carried out on the normalized coordinates of the object by adopting a shape prior adaptation mechanism, and finally six-degree-of-freedom pose and size information of the object are obtained through a Umeyama algorithm.

In the method provided by the invention, in the task of estimating the pose of the class-level object, the most advanced performance is obtained through experiments on REAL275 data sets. The innovation of the method is that the learning and aggregation of multi-scale geometric features and the introduction of a shape priori adaptation mechanism are realized by using a graph attention network, so that the pose estimation accuracy of an object with a complex structure is remarkably improved. The method brings new thought and breakthrough to the field of category-level object pose estimation, and has important research and application values to the fields of machine vision and three-dimensional perception.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A category-level 6D object pose estimation method based on a point cloud image attention network is characterized by comprising the following steps:

2. The method for estimating the pose of a category-level 6D object based on a point cloud image attention network according to claim 1, wherein S1 specifically comprises:

3. A class-level 6D object pose estimation method based on a point cloud attention network according to claim 1, wherein said point cloud attention network is an encoder-decoder architecture comprising:

the drawing force encoder is used for observing the point cloud P _o As input:

4. A category-level 6D object pose estimation method based on a point cloud image attention network according to claim 3, wherein the position embedding module encodes position information in an observation point cloud by using a 3D image convolution layer, for each observation point in the observation point cloud:

wherein M represents the number of the closest points, M represents one of the points,representation ofThree-dimensional coordinates of one of the points;

d _m,n ＝p _m -p _n ；

5. A method for estimating a category-level 6D object pose based on a point cloud image attention network according to claim 3, wherein the image attention module performs a multi-stage operation on the position embedding:

wherein the method comprises the steps of，G _e (P _o ) Denoted as G _e ，N _o Representing the number of points C ₀ Representing a feature dimension for each point;

i) A graph convolution layer GCL;

ii) a point-by-point self-attention layer PSAL;

iii) And a feed forward layer FFN.

6. The category-level 6D object pose estimation method based on point cloud image attention network as claimed in claim 5, wherein said graph convolution layer GCL extracts the local geometric feature F of the object from the input point feature embedding by using the graph structure defined by the neighbor points of the points _i ：

The point-by-point self-attention layer PSAL uses a point cloud self-attention mechanism to extract global geometric features G from local geometric features _i ：

Q _i ＝F _i W _i ^Q ；

K _i ＝F _i W _i ^K ；

V _i ＝F _i W _i ^V ；

in local geometric feature F _i A residual connection is added:

geometric featuresObtaining the global shape characteristics of the object through global maximum pooling operation and repeated operationWherein C is ₄ ＝C ₅ Global shape information for an instance is described.

7. The point cloud graph attention network based category-level 6D object pose estimation method of claim 6, wherein the graph attention module comprises two graph maximization layers, and the graph attention module output comprises five parts:

8. The method for estimating the pose of a class-level 6D object based on a point cloud attention network as claimed in claim 7, wherein said iterative non-parametric decoder comprises:

DeterminesEach point p _n,i-1 Nearest neighbor q of (2) _n,i Point q _n,i Features propagated to point p _n,i-1 Applying;

when all multi-scale geometric features are aligned to the same pointWill be through cascade operation->With position embedding (G) _e ) And global shape feature->Polymerization is carried out to give the final geometric feature +.>

Wherein,the feature dimension is the sum of the dimensions of all features +.>

9. The method for estimating the pose of a category-level 6D object based on a point cloud attention network according to claim 8, wherein said S3 comprises:

combining A withMatching to obtain NOCS coordinates of the object:

10. the category-level 6D object pose estimation method based on the point cloud image attention network according to claim 9, wherein the observation point cloud P _o And reconstructed NOCS coordinates thereofThe Umeyama algorithm is used in combination with the RANSAC algorithm to calculate the 6D object pose and 3D dimensions.