CN117132650A - Category-level 6D object pose estimation method based on point cloud image attention network - Google Patents

Category-level 6D object pose estimation method based on point cloud image attention network Download PDF

Info

Publication number
CN117132650A
CN117132650A CN202311083936.0A CN202311083936A CN117132650A CN 117132650 A CN117132650 A CN 117132650A CN 202311083936 A CN202311083936 A CN 202311083936A CN 117132650 A CN117132650 A CN 117132650A
Authority
CN
China
Prior art keywords
point
point cloud
points
feature
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311083936.0A
Other languages
Chinese (zh)
Inventor
黄章进
邹露
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202311083936.0A priority Critical patent/CN117132650A/en
Publication of CN117132650A publication Critical patent/CN117132650A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a category-level 6D object pose estimation method based on a point cloud image attention network, which comprises the following steps: s1, preprocessing input RGB-D image data, and extracting observation point clouds of an object under a depth camera; s2, extracting multi-scale local-global object structural features from observation point clouds by using a point cloud image attention network; s3, reconstructing a 3D point cloud model of the object by using a shape priori adaptation mechanism and a category shape priori point cloud, and regressing an object normalized NOCS coordinate; and S4, calculating similar transformation between the reconstructed NOCS coordinates and the observation point cloud through a Umeyama algorithm, and obtaining the posture and size information of the object. Experiments performed on the NOCS-REAL data set prove that the scheme provided by the invention is superior to the prior art, and better results are obtained.

Description

Category-level 6D object pose estimation method based on point cloud image attention network
Technical Field
The invention relates to the technical field of computer vision and object pose estimation, in particular to a category-level 6D object pose estimation method based on a point cloud image attention network.
Background
Category-level six degrees of freedom (6D) object pose estimation is a fundamental problem in the field of computer vision, which involves predicting 3D rotation, 3D translation of an object from an object coordinate system to a camera coordinate system, and 3D dimensions of the object. This technology is widely used in robotics, augmented reality, autopilot, etc. This problem is extremely challenging due to the large variations in shape of objects within different categories.
In response to the above-described problems, the pose and size of the object are recovered by computing a similarity transformation between the observed point cloud and the reconstructed NOCS coordinates. Thus, the quality of the reconstructed NOCS coordinates implicitly indicates the accuracy of the subsequent pose estimation. In order to improve the reconstruction quality of NOCS coordinates, some methods, such as SPD and SGPA, reconstruct a 3D model of an object by adopting a class shape prior point cloud representing the average shape of the object in the same class, and establish a 3D-3D corresponding relation between an observed point cloud model and the reconstructed point cloud model, thereby realizing 6D pose and size estimation. Although these correspondence-based approaches have made great progress, they fail to take full advantage of the unique structural features in individual instances, limiting their predictive capabilities. Therefore, how to provide a category-level 6D object pose estimation method based on a point cloud image attention network is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
The invention aims to provide a category-level 6D object pose estimation method based on a point cloud image attention network, and experiments performed on NOCS-REAL data sets prove that the scheme provided by the invention is superior to the prior art, and better results are obtained.
According to the embodiment of the invention, the category-level 6D object pose estimation method based on the point cloud image attention network comprises the following steps:
s1, preprocessing input RGB-D image data, and extracting observation point clouds of an object under a depth camera;
s2, extracting multi-scale local-global object structural features from observation point clouds by using a point cloud image attention network;
s3, reconstructing a 3D point cloud model of the object by using a shape priori adaptation mechanism and a category shape priori point cloud, and regressing an object normalized NOCS coordinate;
and S4, calculating similar transformation between the reconstructed NOCS coordinates and the observation point cloud through a Umeyama algorithm, and obtaining the posture and size information of the object.
Optionally, the S1 specifically includes:
s11, dividing and detecting objects in RGB-D image data by using Mask R-CNN to obtain an object Mask area;
s12, mapping the object mask area onto a depth image of an object to obtain an object depth area;
s13, converting the depth information of the object into three-dimensional point cloud of the object by using camera parameters, and generating observation point cloud data observed by a camera.
Optionally, the point cloud attention network is an encoder-decoder architecture, the encoder-decoder architecture comprising:
a graph attention encoder for extracting multi-scale local to global object features from the observation point cloud;
the drawing force encoder is used for observing the point cloud P o As input:
wherein,representing a real set, N o Representing the number of points, 3 representing XYZ three-dimensional coordinates of the points;
converting the original three-dimensional coordinates of the observation point cloud into high-dimensional features by using a position embedding module;
extracting local to global instance geometric features from the input feature embedding in a layered manner through a graph attention module;
and the iterative non-parameter decoder is used for aggregating the geometric features of multiple scales.
Optionally, the position embedding module encodes the position information in the observation point cloud by using a 3D graph convolution layer, and for each observation point in the observation point cloud:
searching coordinate set of M nearest neighbors by nearest neighbor searching algorithmReceptive fields as convolution kernels for 3D map convolution:
wherein M represents the number of the closest points, M represents one of the points,representing three-dimensional coordinates of one of the points;
calculating a direction vector d in a receptive field obtained by a nearest neighbor search algorithm m,n
d m,n =p m -p n
And initializing the supporting point kernel vector k by uniform distribution s
Wherein S represents the number of support points, each support point k s Are all three-dimensional coordinates;
will p n Is embedded in C 0 In the dimensional feature vector, the obtained feature vector generates position embedding through a ReLU activation function
Wherein max represents maximum value solving operation, wherein < represents vector inner product, and < I > represents modular length of the solving vector.
Optionally, the drawing meaning force module performs multi-stage operation on the position embedding:
wherein G is e (P o ) Denoted as G e ,N o Representing the number of points C 0 Representing a feature dimension for each point;
each stage having a different hidden dimension C i In each stage i, three different point-by-point feature extraction layers are respectively applied to convert the input point features into corresponding dimensions C i
iii) A graph convolution layer GCL;
iv) a point-by-point self-attention layer PSAL;
iii) And a feed forward layer FFN.
Optionally, the graph convolution layer GCL extracts the local geometric feature F of the object from the input point feature embedding by using the graph structure defined by the neighboring points of the points i
Wherein N is i Representing the number of points in the ith stage, C i Representing a feature dimension for each point;
the volume lamination GCL includes a 3D graph convolution layer and a ReLU function.
The point-by-point self-attention layer PSAL uses a point cloud self-attention mechanism to extract global geometry from local geometry featuresSign G i
Point-by-point self-attention layer application sharing multi-layer perceptron network to locally geometric featuresProjected to the query vector, key vector and value vector, denoted Q respectively i 、K i And V i
Q i =F i W i Q
K i =F i W i K
V i =F i W i V
Wherein W is i Q 、W i K And W is i V Is of dimension C i ×C i Is a matrix of (a);
by calculating point-to-point attention weights between query vectors and value vectorsTo capture global geometric relationships between different points;
obtaining global geometric features through dot multiplication operation of attention weight and value vector
The feed-forward layer FFN generates a final output for each stage of the graph attention module by using the shared multi-layer perceptron network and the ReLU activation function;
in local geometric feature F i A residual connection is added:
the geometric features obtainedIs used as an input point feature for the next stage;
geometric featuresObtaining the global shape feature of the object through global maximum pooling operation and repeated operation>Wherein C is 4 =C 5 Global shape information for an instance is described.
Optionally, the drawing meaning force module includes two drawing maximizing pooling layers, and the drawing meaning force module output includes five parts:
at each stage, the atlas GCL, the point-by-point self-attention layer PSAL and the feed forward layer FFL are stacked in order, extracting the object geometry from the input point features in a point-by-point manner.
Optionally, the iterative nonparametric decoder includes:
assume thatIs the ith stage of the graph annotation force module, where the geometric features in i.epsilon.2, 3,4 areIs a downsampled set of points;
in each iteration of the ith stage, the nearest neighbor search algorithm is used to search the previous stage point set P i-1 Each of (3)Point p n,i-1 Nearest neighbor q in the ith stage n,i
Each point p is determined n,i-1 Nearest neighbor q of (2) n,i Point q n,i Features propagated to point p n,i-1 Applying;
updating the characteristics of the points of the i-2 th stage using the updated characteristics of the points of the i-1 th stage;
the whole process is carried out in an iterative manner until the characteristics of all points in the first stage are determined and applied to the output of each stage of the graph attention module;
when all multi-scale geometric features are aligned to the same pointWill be through cascade operation->With position embedding (G) e ) And global shape feature->Polymerization is carried out to generate final geometric features
Wherein,the feature dimension is the sum of the dimensions of all features +.>
Optionally, the step S3 includes:
given and observed point cloudCorresponding class shape prior point cloud>Wherein N is o And N r Respectively representing the number of points, wherein each point is an XYZ three-dimensional coordinate;
extracting point-by-point prior feature G from category shape prior r Point-by-point prior feature G r Is a local prior featureAnd global a priori feature->Is->
Generating local prior features using a three-layer multi-layer perceptron networkGenerating global prior feature ++using a multi-layer perceptron network of two further layers based on local prior features>Wherein D is 1 And D 2 The dimensions of the features;
using one ReLU activation function after each multi-layer perceptron layer, using an adaptive max pooling operation after the last ReLU activation function to generate a global a priori feature;
in obtaining a priori features G r And geometric featuresAfter thatGenerating deformation fields from the features by shape prior adaptation mechanism, respectively>And correspondence matrix->Wherein N is r And N o The number of the points is the number of the points and is used for NOCS coordinate regression;
each row D of D i Representing a point cloud P from a priori r To reconstruction point cloudIs defined by the deformation of each point:
each row A of A i The sum of the elements of (1) represents the observation point cloud P o Each point in (a) and its reconstruction point cloudSoft correspondence between all points in (a);
the shape prior self-adaptation stage uses two parallel networks, each network is composed of three layers of multi-layer perceptron networks, and the deformation field D and the corresponding relation matrix A are respectively regressed;
combining A withMatching to obtain NOCS coordinates of the object:
optionally, the observation point cloud P o And reconstructed NOCS coordinates thereofThe Umeyama algorithm is used in combination with the RANSAC algorithm to calculate the 6D object pose and 3D dimensions.
The beneficial effects of the invention are as follows:
(1) The invention provides a novel point cloud image attention network, which adopts a network model of an encoder-decoder framework and is used for extracting unique structural features of an individual instance from object point cloud so as to improve accuracy of estimating the pose of a class-level object.
(2) The invention provides a graph annotation force encoder, which firstly utilizes 3D graph convolution to extract local structural features of multi-scale point cloud, and then adopts a self-attention mechanism to extract multi-scale global structural features from the local structural features.
(3) The invention provides an iterative non-parameter decoder which is used for propagating multi-scale global structural features from fine granularity to coarse granularity, and avoiding information loss in the process of feature propagation while retaining the multi-scale structural features.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is a flow chart of a category-level 6D object pose estimation method based on a point cloud image attention network;
fig. 2 is a block diagram of a point cloud image attention network in a category-level 6D object pose estimation method based on the point cloud image attention network according to the present invention;
FIG. 3 is a view showing the visual results of 6D pose and 3D size estimation performed on REAL275 dataset by the present invention and advanced method in a class-level 6D object pose estimation method based on point cloud image attention network;
fig. 4 is a visual result diagram of 3D shape reconstruction performed on a REAL275 dataset by the method and the advanced method in the category-level 6D object pose estimation method based on the point cloud image attention network.
Detailed Description
The invention will now be described in further detail with reference to the accompanying drawings. The drawings are simplified schematic representations which merely illustrate the basic structure of the invention and therefore show only the structures which are relevant to the invention.
Referring to fig. 1, a category-level 6D object pose estimation method based on a point cloud image attention network includes:
s1, preprocessing input RGB-D image data, and extracting observation point clouds of an object under a depth camera;
in this embodiment, S1 specifically includes:
s11, dividing and detecting objects in RGB-D image data by using Mask R-CNN to obtain an object Mask area;
s12, mapping the object mask area onto a depth image of an object to obtain an object depth area;
s13, converting the depth information of the object into three-dimensional point cloud of the object by using camera parameters, and generating observation point cloud data observed by a camera.
S2, extracting multi-scale local-global object structural features from observation point clouds by using a point cloud image attention network;
referring to fig. 2, in the present embodiment, the point cloud attention network is an encoder-decoder architecture, which includes:
a graph attention encoder for extracting multi-scale local to global object features from the observation point cloud;
graph attention encoder to observe point cloud P o As input:
wherein,representing a real set, N o Representing the number of points, 3 representing XYZ three-dimensional coordinates of the points;
converting the original three-dimensional coordinates of the observation point cloud into high-dimensional features by using a position embedding module;
extracting local to global instance geometric features from the input feature embedding in a layered manner through a graph attention module;
in this embodiment, the position embedding module encodes the position information in the observation point cloud using the 3D map convolutional layer, for each observation point in the observation point cloud:
searching coordinate set of M nearest neighbors by nearest neighbor searching algorithmReceptive fields as convolution kernels for 3D map convolution:
wherein M represents the number of the closest points, M represents one of the points,representing three-dimensional coordinates of one of the points;
calculating a direction vector d in a receptive field obtained by a nearest neighbor search algorithm m,n
d m,n =p m -p n
And initializing the supporting point kernel vector k by uniform distribution s
Wherein S represents the number of support points, each support point k s Are all three-dimensional coordinates;
will p n Is embedded in C 0 Viterbi deviceIn the eigenvector, the obtained eigenvector generates position embedding through ReLU activation function
Wherein max represents maximum value solving operation, wherein < represents vector inner product, and < I > represents modular length of the solving vector.
In this embodiment, the drawing force module performs a multi-stage operation for embedding the position:
wherein G is e (P o ) Denoted as G e ,N o Representing the number of points C 0 Representing a feature dimension for each point;
each stage having a different hidden dimension C i In each stage i, three different point-by-point feature extraction layers are respectively applied to convert the input point features into corresponding dimensions C i
v) graph convolution layer GCL;
vi) point-by-point self-attention layer PSAL;
iii) And a feed forward layer FFN.
In the present embodiment, the graph convolution layer GCL extracts the local geometric feature F of the object from the input point feature embedding by using the graph structure defined by the neighboring points of the points i
Wherein N is i Representing the number of points in the ith stage, C i Representing a feature dimension for each point;
the graph convolution layer GCL comprises a 3D graph convolution layer and a ReLU function.
Point-by-point self-attention layer PSAL uses point cloud self-attention mechanism to extract global geometric feature G from local geometric feature i
Point-by-point self-attention layer application sharing multi-layer perceptron network to locally geometric featuresProjected to the query vector, key vector and value vector, denoted Q respectively i 、K i And V i
Q i =F i W i Q
K i =F i W i K
V i =F i W i V
Wherein W is i Q 、W i K And W is i V Is of dimension C i ×C i Is a matrix of (a);
by calculating point-to-point attention weights between query vectors and value vectorsTo capture global geometric relationships between different points;
obtaining global geometric features through dot multiplication operation of attention weight and value vector
The feed-forward layer FFN generates a final output for each stage of the graph attention module by using the shared multi-layer perceptron network and the ReLU activation function;
in local geometric feature F i A residual connection is added:
the geometric features obtainedIs used as an input point feature for the next stage;
geometric featuresObtaining the global shape feature of the object through global maximum pooling operation and repeated operation>Wherein C is 4 =C 5 Global shape information for an instance is described.
In this embodiment, the graph attention module includes two graph maximization layers, and the graph attention module output includes five parts:
at each stage, the graph roll layer GCL, the point-by-point self-attention layer PSAL and the feedforward layer FFL are stacked in order, and object geometric features are extracted from the input point features in a point-by-point manner, so that expression from local to global is realized, and complex object geometric shapes are effectively described.
And the iterative non-parameter decoder is used for aggregating the geometric features of multiple scales.
In this embodiment, the iterative nonparametric decoder includes:
assume thatIs the ith stage of the schematic illustration force module, in whichGeometrical features in i E {2,3,4} areIs a downsampled set of points;
in each iteration of the ith stage, the nearest neighbor search algorithm is used to search the previous stage point set P i-1 Each point p of (3) n,i-1 Nearest neighbor q in the ith stage n,i
Each point p is determined n,i-1 Nearest neighbor q of (2) n,i Point q n,i Features propagated to point p n,i-1 Applying;
updating the characteristics of the points of the i-2 th stage using the updated characteristics of the points of the i-1 th stage;
the whole process is carried out in an iterative manner until the characteristics of all points in the first stage are determined and applied to the output of each stage of the graph attention module;
when all multi-scale geometric features are aligned to the same pointWill be through cascade operation->With position embedding (G) e ) And global shape feature->Polymerization is carried out to generate final geometric features
Wherein,the feature dimension being the dimension of all featuresAnd->
The iterative non-parametric decoder enables the network to propagate the point-wise geometrical features from fine granularity to coarse granularity in a progressive manner. The method retains multi-scale geometric characteristics, avoids information loss during characteristic propagation between different scales, and does not need any additional learnable parameters.
S3, reconstructing a 3D point cloud model of the object by using a shape priori adaptation mechanism and a category shape priori point cloud, and regressing an object normalized NOCS coordinate;
in this embodiment, S3 includes:
given and observed point cloudCorresponding class shape prior point cloud>Wherein N is o And N r Respectively representing the number of points, wherein each point is an XYZ three-dimensional coordinate;
extracting point-by-point prior feature G from category shape prior r Point-by-point prior feature G r Is a local prior featureAnd global a priori feature->Is->
Generating local prior features using a three-layer multi-layer perceptron networkGenerating global prior feature ++using a multi-layer perceptron network of two further layers based on local prior features>Wherein D is 1 And D 2 The dimensions of the features;
using one ReLU activation function after each multi-layer perceptron layer, using an adaptive max pooling operation after the last ReLU activation function to generate a global a priori feature;
in obtaining a priori features G r And geometric featuresThen, the shape prior adaptation mechanism is adopted to respectively generate deformation fields from the characteristics>And correspondence matrix->Wherein N is r And N o The number of the points is the number of the points and is used for NOCS coordinate regression;
each row D of D i Representing a point cloud P from a priori r To reconstruction point cloudIs defined by the deformation of each point:
each row A of A i The sum of the elements of (1) represents the observation point cloud P o Each point in (a) and its reconstruction point cloudSoft correspondence between all points in (a);
the shape prior self-adaptation stage uses two parallel networks, each network is composed of three layers of multi-layer perceptron networks, and the deformation field D and the corresponding relation matrix A are respectively regressed;
combining A withMatching to obtain NOCS coordinates of the object:
and S4, calculating similar transformation between the reconstructed NOCS coordinates and the observation point cloud through a Umeyama algorithm, and obtaining the posture and size information of the object.
In the present embodiment, the observation point cloud P o And reconstructed NOCS coordinates thereofThe 6D object pose and 3D dimensions are calculated using the Umeyama algorithm in combination with the RANSAC algorithm, which is used to estimate optimal similar transformation parameters, i.e. rotation, translation and scale, where the rotation and translation parameters correspond to the 6D object pose and the scale parameters correspond to the object dimensions. The RANSAC algorithm is used to remove outliers and to achieve a robust estimation.
Example 1:
this example 1 was implemented using a pyrerch framework and experiments were performed on 1 desktop equipped with NVIDIA GeForce RTX 3090GPU, with a batch size of 64, first, clipping the depth image using an example segmentation Mask generated by Mask-RCNN, and resizing the clipped depth image to 256 x 256 pixels. Then, randomly sampling N from the point cloud of depth image conversion o =1024 points, forming an observation point cloud. Next, N is sampled from a class shape prior point cloud pre-trained by SPD techniques r =1024 points, obtaining a priori point clouds. In step two, the hidden layer dimension of the point cloud attention network is set to C 0 =128,C 1 =128,C 2 =256,C 3 =256,C 4 =512,C 5 =512. The super parameters in the 3D volume lamination all use default settings, i.e. the number of nearest neighbors is set to m=50, the number of support point kernel vectors is s=1. In step three, the hidden layer dimension of the multi-layer perceptron network for local a priori feature extraction is [64,64,64 ]]The hidden layer dimension of the multi-layer perceptron network for global prior feature extraction is [128,1024 ]]D is 1 =64 and D 2 =1024. For deformation field regression, the hidden layer dimension of the multi-layer perceptron network is set to [512,256, no×3 ]]For correspondence regression, the hidden layer dimension is set to [512,256, N o ×N r ]. During the training process, the network was optimized using an Adam optimizer with an initial learning rate of 1e-4 and a total of 100 rounds of training on the model. The learning rate was attenuated at a ratio of 0.6, 0.3, 0.1, and 0.01 for every 20 rounds. The same loss function is used to train the network as in SPD techniques, with all classes trained using a single model.
The present example reports the average accuracy of the 3D intersection ratio (IoU) at 50% and 75% threshold, respectively, to comprehensively evaluate the accuracy of rotation, translation, and size estimation. In order to directly compare the errors of rotation and translation, indices of 5 ° 2cm, 5 ° 5cm, 10 ° 2cm and 10 ° 5cm were also used. If the rotation and translation errors are below a given threshold, the pose is considered correct. Furthermore, the Chamfer distance was used to evaluate the accuracy of the 3D model reconstruction results.
TABLE 1 quantitative analysis of 6D pose and 3D dimension estimates versus advanced methods on REAL275 dataset of the invention
From the results in table 1, it can be clearly seen that the method proposed by the present invention is significantly superior to the prior art in terms of object pose and size estimation, achieving the best performance. In terms of comprehensive evaluation of rotation, translation and dimension estimation accuracy, the scheme of the invention is applied to 3D compared with NOCS technology using only RGB features 50 The index is improved by 4.0 percent, inThe 3D75 index is improved by 40.3 percent. Compared with SGPA technology using RGB-D characteristics at the same time, the scheme of the invention is characterized in 3D 50 The index is improved by 1.9%, and the index is improved by 8.5% in the 3D75 index. In terms of directly evaluating the accuracy of rotation and translation estimation, compared with the NOCS technology using only RGB features, the scheme of the invention improves by 38.7% on the 5 DEG 2cm index, 43.8% on the 5 DEG 5cm index, 49.3% on the 10 DEG 2cm index, and 52.5% on the 10 DEG 5cm index. Compared with SGPA technology using RGB-D characteristics, the technical scheme of the invention improves the index by 10.0% on the index of 5 degrees 2cm, improves the index by 14.2% on the index of 5 degrees 5cm, improves the index by 1.8% on the index of 10 degrees 2cm and improves the index by 7.0% on the index of 10 degrees 5 cm. These results clearly demonstrate that the proposed method provides a significant improvement over the prior art on the read 275 dataset. The inventive approach exhibits optimal results over multiple evaluation metrics, both with respect to methods using only RGB features and methods using both RGB-D features.
TABLE 2 quantitative analysis of 3D shape reconstruction with advanced methods on REAL275 dataset according to the invention
From the results of Table 2, it is clearly seen that the method of the present invention achieves the lowest shape reconstruction errors for all three object categories, bottle, can and notebook in the REAL275 dataset. For both the bowl and camera categories, the error of the method of the invention is only 0.05 worse than the best SGPA technology, while on the cup category, the error is also only 0.14 worse than the best SPD technology. The average error of the six categories is lower than that of all other methods, and compared with the best SGPA technology, the error is reduced by 0.44, so that the best three-dimensional shape reconstruction result is obtained. These results fully demonstrate the superiority of the method of the invention in class-level object pose estimation, particularly in terms of shape reconstruction in the classes of bottles, cans and notebook computers. This further demonstrates the effectiveness of this method in handling complex structure object pose estimation tasks.
Referring to fig. 3, it is apparent that the proposed method is closer to the real tag (white bounding box) than the SGPA method in terms of object pose and size estimation. This means that the method of the invention can better capture the geometric features of the object, thereby realizing more accurate pose and size estimation and effectively reducing the error between the real label.
Referring to fig. 4, it can be clearly observed that the 3D shape reconstructed by the method proposed by the present invention is very close to the real shape of the object. This demonstrates that the inventive method achieves excellent performance in recovering the three-dimensional shape of an object from point cloud data. From the figure, the method can accurately capture the geometric structure and detail of the object, and realize high-quality three-dimensional shape reconstruction.
In summary, the present invention provides a category-level object pose estimation method based on a graph attention network, which extracts unique geometric features from an observed object point cloud by using a point cloud graph attention network composed of a graph attention encoder and an iterative nonparametric decoder, and gradually perceives structural information of an object from local to global. And then, regression is carried out on the normalized coordinates of the object by adopting a shape prior adaptation mechanism, and finally six-degree-of-freedom pose and size information of the object are obtained through a Umeyama algorithm.
In the method provided by the invention, in the task of estimating the pose of the class-level object, the most advanced performance is obtained through experiments on REAL275 data sets. The innovation of the method is that the learning and aggregation of multi-scale geometric features and the introduction of a shape priori adaptation mechanism are realized by using a graph attention network, so that the pose estimation accuracy of an object with a complex structure is remarkably improved. The method brings new thought and breakthrough to the field of category-level object pose estimation, and has important research and application values to the fields of machine vision and three-dimensional perception.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (10)

1. A category-level 6D object pose estimation method based on a point cloud image attention network is characterized by comprising the following steps:
s1, preprocessing input RGB-D image data, and extracting observation point clouds of an object under a depth camera;
s2, extracting multi-scale local-global object structural features from observation point clouds by using a point cloud image attention network;
s3, reconstructing a 3D point cloud model of the object by using a shape priori adaptation mechanism and a category shape priori point cloud, and regressing an object normalized NOCS coordinate;
and S4, calculating similar transformation between the reconstructed NOCS coordinates and the observation point cloud through a Umeyama algorithm, and obtaining the posture and size information of the object.
2. The method for estimating the pose of a category-level 6D object based on a point cloud image attention network according to claim 1, wherein S1 specifically comprises:
s11, dividing and detecting objects in RGB-D image data by using Mask R-CNN to obtain an object Mask area;
s12, mapping the object mask area onto a depth image of an object to obtain an object depth area;
s13, converting the depth information of the object into three-dimensional point cloud of the object by using camera parameters, and generating observation point cloud data observed by a camera.
3. A class-level 6D object pose estimation method based on a point cloud attention network according to claim 1, wherein said point cloud attention network is an encoder-decoder architecture comprising:
a graph attention encoder for extracting multi-scale local to global object features from the observation point cloud;
the drawing force encoder is used for observing the point cloud P o As input:
wherein,representing a real set, N o Representing the number of points, 3 representing XYZ three-dimensional coordinates of the points;
converting the original three-dimensional coordinates of the observation point cloud into high-dimensional features by using a position embedding module;
extracting local to global instance geometric features from the input feature embedding in a layered manner through a graph attention module;
and the iterative non-parameter decoder is used for aggregating the geometric features of multiple scales.
4. A category-level 6D object pose estimation method based on a point cloud image attention network according to claim 3, wherein the position embedding module encodes position information in an observation point cloud by using a 3D image convolution layer, for each observation point in the observation point cloud:
searching coordinate set of M nearest neighbors by nearest neighbor searching algorithmReceptive fields as convolution kernels for 3D map convolution:
wherein M represents the number of the closest points, M represents one of the points,representation ofThree-dimensional coordinates of one of the points;
calculating a direction vector d in a receptive field obtained by a nearest neighbor search algorithm m,n
d m,n =p m -p n
And initializing the supporting point kernel vector k by uniform distribution s
Wherein S represents the number of support points, each support point k s Are all three-dimensional coordinates;
will p n Is embedded in C 0 In the dimensional feature vector, the obtained feature vector generates position embedding through a ReLU activation function
Wherein max represents maximum value solving operation, wherein < represents vector inner product, and < I > represents modular length of the solving vector.
5. A method for estimating a category-level 6D object pose based on a point cloud image attention network according to claim 3, wherein the image attention module performs a multi-stage operation on the position embedding:
wherein the method comprises the steps of,G e (P o ) Denoted as G e ,N o Representing the number of points C 0 Representing a feature dimension for each point;
each stage having a different hidden dimension C i In each stage i, three different point-by-point feature extraction layers are respectively applied to convert the input point features into corresponding dimensions C i
i) A graph convolution layer GCL;
ii) a point-by-point self-attention layer PSAL;
iii) And a feed forward layer FFN.
6. The category-level 6D object pose estimation method based on point cloud image attention network as claimed in claim 5, wherein said graph convolution layer GCL extracts the local geometric feature F of the object from the input point feature embedding by using the graph structure defined by the neighbor points of the points i
Wherein N is i Representing the number of points in the ith stage, C i Representing a feature dimension for each point;
the volume lamination GCL includes a 3D graph convolution layer and a ReLU function.
The point-by-point self-attention layer PSAL uses a point cloud self-attention mechanism to extract global geometric features G from local geometric features i
Point-by-point self-attention layer application sharing multi-layer perceptron network to locally geometric featuresProjected to the query vector, key vector and value vector, denoted Q respectively i 、K i And V i
Q i =F i W i Q
K i =F i W i K
V i =F i W i V
Wherein W is i Q 、W i K And W is i V Is of dimension C i ×C i Is a matrix of (a);
by calculating point-to-point attention weights between query vectors and value vectorsTo capture global geometric relationships between different points;
obtaining global geometric features through dot multiplication operation of attention weight and value vector
The feed-forward layer FFN generates a final output for each stage of the graph attention module by using the shared multi-layer perceptron network and the ReLU activation function;
in local geometric feature F i A residual connection is added:
the geometric features obtainedIs used as an input point feature for the next stage;
geometric featuresObtaining the global shape characteristics of the object through global maximum pooling operation and repeated operationWherein C is 4 =C 5 Global shape information for an instance is described.
7. The point cloud graph attention network based category-level 6D object pose estimation method of claim 6, wherein the graph attention module comprises two graph maximization layers, and the graph attention module output comprises five parts:
at each stage, the atlas GCL, the point-by-point self-attention layer PSAL and the feed forward layer FFL are stacked in order, extracting the object geometry from the input point features in a point-by-point manner.
8. The method for estimating the pose of a class-level 6D object based on a point cloud attention network as claimed in claim 7, wherein said iterative non-parametric decoder comprises:
assume thatIs the ith stage of the graph annotation force module, where the geometric features in i.epsilon.2, 3,4 areIs a downsampled set of points;
in each iteration of the ith stage, the nearest neighbor search algorithm is used to search the previous stage point set P i-1 Each point p of (3) n,i-1 Nearest neighbor q in the ith stage n,i
DeterminesEach point p n,i-1 Nearest neighbor q of (2) n,i Point q n,i Features propagated to point p n,i-1 Applying;
updating the characteristics of the points of the i-2 th stage using the updated characteristics of the points of the i-1 th stage;
the whole process is carried out in an iterative manner until the characteristics of all points in the first stage are determined and applied to the output of each stage of the graph attention module;
when all multi-scale geometric features are aligned to the same pointWill be through cascade operation->With position embedding (G) e ) And global shape feature->Polymerization is carried out to give the final geometric feature +.>
Wherein,the feature dimension is the sum of the dimensions of all features +.>
9. The method for estimating the pose of a category-level 6D object based on a point cloud attention network according to claim 8, wherein said S3 comprises:
given and observed point cloudCorresponding class shape prior point cloud>Wherein N is o And N r Respectively representing the number of points, wherein each point is an XYZ three-dimensional coordinate;
extracting point-by-point prior feature G from category shape prior r Point-by-point prior feature G r Is a local prior featureAnd global a priori feature->Is->
Generating local prior features using a three-layer multi-layer perceptron networkGenerating global prior feature ++using a multi-layer perceptron network of two further layers based on local prior features>Wherein D is 1 And D 2 The dimensions of the features;
using one ReLU activation function after each multi-layer perceptron layer, using an adaptive max pooling operation after the last ReLU activation function to generate a global a priori feature;
in obtaining a priori features G r And geometric featuresThen, the shape prior adaptation mechanism is adopted to respectively generate deformation fields from the characteristics>And correspondence matrix->Wherein N is r And N o The number of the points is the number of the points and is used for NOCS coordinate regression;
each row D of D i Representing a point cloud P from a priori r To reconstruction point cloudIs defined by the deformation of each point:
each row A of A i The sum of the elements of (1) represents the observation point cloud P o Each point in (a) and its reconstruction point cloudSoft correspondence between all points in (a);
the shape prior self-adaptation stage uses two parallel networks, each network is composed of three layers of multi-layer perceptron networks, and the deformation field D and the corresponding relation matrix A are respectively regressed;
combining A withMatching to obtain NOCS coordinates of the object:
10. the category-level 6D object pose estimation method based on the point cloud image attention network according to claim 9, wherein the observation point cloud P o And reconstructed NOCS coordinates thereofThe Umeyama algorithm is used in combination with the RANSAC algorithm to calculate the 6D object pose and 3D dimensions.
CN202311083936.0A 2023-08-25 2023-08-25 Category-level 6D object pose estimation method based on point cloud image attention network Pending CN117132650A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311083936.0A CN117132650A (en) 2023-08-25 2023-08-25 Category-level 6D object pose estimation method based on point cloud image attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311083936.0A CN117132650A (en) 2023-08-25 2023-08-25 Category-level 6D object pose estimation method based on point cloud image attention network

Publications (1)

Publication Number Publication Date
CN117132650A true CN117132650A (en) 2023-11-28

Family

ID=88854043

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311083936.0A Pending CN117132650A (en) 2023-08-25 2023-08-25 Category-level 6D object pose estimation method based on point cloud image attention network

Country Status (1)

Country Link
CN (1) CN117132650A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117522990A (en) * 2024-01-04 2024-02-06 山东科技大学 Category-level pose estimation method based on multi-head attention mechanism and iterative refinement

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117522990A (en) * 2024-01-04 2024-02-06 山东科技大学 Category-level pose estimation method based on multi-head attention mechanism and iterative refinement
CN117522990B (en) * 2024-01-04 2024-03-29 山东科技大学 Category-level pose estimation method based on multi-head attention mechanism and iterative refinement

Similar Documents

Publication Publication Date Title
US10248664B1 (en) Zero-shot sketch-based image retrieval techniques using neural networks for sketch-image recognition and retrieval
US11501415B2 (en) Method and system for high-resolution image inpainting
US11645835B2 (en) Hypercomplex deep learning methods, architectures, and apparatus for multimodal small, medium, and large-scale data representation, analysis, and applications
Chen et al. Efficient approximation of deep relu networks for functions on low dimensional manifolds
CN110597970B (en) Multi-granularity medical entity joint identification method and device
US10204299B2 (en) Unsupervised matching in fine-grained datasets for single-view object reconstruction
US20230359865A1 (en) Modeling Dependencies with Global Self-Attention Neural Networks
CN112288011B (en) Image matching method based on self-attention deep neural network
US20230043026A1 (en) Learning-based active surface model for medical image segmentation
CN112330719B (en) Deep learning target tracking method based on feature map segmentation and self-adaptive fusion
Liu et al. Optimization-based key frame extraction for motion capture animation
CN115203442B (en) Cross-modal deep hash retrieval method, system and medium based on joint attention
CN111709270B (en) Three-dimensional shape recovery and attitude estimation method and device based on depth image
CN113205523A (en) Medical image segmentation and identification system, terminal and storage medium with multi-scale representation optimization
CN117132650A (en) Category-level 6D object pose estimation method based on point cloud image attention network
CN111680550A (en) Emotion information identification method and device, storage medium and computer equipment
CN115222998A (en) Image classification method
US20220382246A1 (en) Differentiable simulator for robotic cutting
CN114048845B (en) Point cloud repairing method and device, computer equipment and storage medium
CN115936992A (en) Garbage image super-resolution method and system of lightweight transform
CN116363750A (en) Human body posture prediction method, device, equipment and readable storage medium
CN117522990B (en) Category-level pose estimation method based on multi-head attention mechanism and iterative refinement
Wang et al. MDISN: Learning multiscale deformed implicit fields from single images
CN116912296A (en) Point cloud registration method based on position-enhanced attention mechanism
CN111753736A (en) Human body posture recognition method, device, equipment and medium based on packet convolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination