CN114140418A - Seven-degree-of-freedom grabbing posture detection method based on RGB image and depth image - Google Patents

Seven-degree-of-freedom grabbing posture detection method based on RGB image and depth image Download PDF

Info

Publication number
CN114140418A
CN114140418A CN202111418398.7A CN202111418398A CN114140418A CN 114140418 A CN114140418 A CN 114140418A CN 202111418398 A CN202111418398 A CN 202111418398A CN 114140418 A CN114140418 A CN 114140418A
Authority
CN
China
Prior art keywords
grabbing
image
feasible
depth
grab
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111418398.7A
Other languages
Chinese (zh)
Inventor
孙景慷
张克勤
裘焱枫
杨根科
褚健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo Institute Of Artificial Intelligence Shanghai Jiaotong University
Original Assignee
Ningbo Institute Of Artificial Intelligence Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo Institute Of Artificial Intelligence Shanghai Jiaotong University filed Critical Ningbo Institute Of Artificial Intelligence Shanghai Jiaotong University
Priority to CN202111418398.7A priority Critical patent/CN114140418A/en
Publication of CN114140418A publication Critical patent/CN114140418A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0004Industrial image inspection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a seven-degree-of-freedom grabbing posture detection method based on RGB images and depth images, which relates to the field of computer vision and comprises the following steps: step 1, converting a depth image into point cloud data and projecting to obtain a three-channel image X-Y-Z image; step 2, using ResNet-50 to encode the information of the RGB image and the X-Y-Z image to obtain a target segmentation result and a semantic segmentation result which can be captured; step 3, completing the depth image to obtain dense point cloud; step 4, calculating normal vectors and two principal curvature directions of the feasible grabbing points by using the feasible grabbing points and the dense point cloud to form a grabbing coordinate system; step 5, sampling the grabbing depth and the grabbing width of the feasible grabbing points to generate a plurality of grabbing candidates, wherein each grabbing candidate corresponds to one grabbing closed area; step 6, inputting points in the grabbing closed area into PointNet, and filtering grabbing candidates to obtain a final grabbing gesture set; and 7, projecting the grabbing candidates to the target to generate a final grabbing posture.

Description

Seven-degree-of-freedom grabbing posture detection method based on RGB image and depth image
Technical Field
The invention relates to the field of computer vision, in particular to a seven-degree-of-freedom grabbing posture detection method based on an RGB image and a depth image.
Background
Robust robotic arm grasping is a fundamental requirement for robots in most industrial scenarios and daily life. The grabbing of the whole mechanical arm is divided into two parts: and (4) grabbing detection and path planning. The grabbing detection refers to scene information obtained by using a monocular camera, a depth camera or a binocular camera and other sensors to generate a six-degree-of-freedom posture which is required to be reached by the tail end of the mechanical arm. The six-degree-of-freedom attitude refers to a position and a coordinate system to be reached by the center of the tail end of the mechanical arm. The path planning means how to plan a motion path of the mechanical arm in a working space aiming at a six-degree-of-freedom posture generated by grabbing detection, so that the mechanical arm does not collide with a scene, and the motion dynamics constraint of the mechanical arm is met.
In recent years, with the development of deep learning, a mechanical arm grabbing detection algorithm based on vision is rapidly developed. Vision-based robotic arm capture detection schemes can be divided into three broad categories (see "Fang HS, Wang C, Gou M, et al, grappnet-1 Billion: a Large-Scale Benchmark for General Object capturing, CVPR 2020"), in which the first category of detection schemes takes as input RGB images captured by a monocular or monocular camera sensor, and then detects a feasible capture frame in the 2D image, the capture frame containing the capture position and an angle information representing in-plane rotation. The algorithm limits grabbing to be vertical to the desktop, seriously limits the grabbing freedom degree, and can cause that stacked objects in a cluttered scene are difficult to grab; the second type of detection scheme is to transform the corresponding grabbing of the Object in the reference coordinate system into the coordinate system by Detecting the transformation of the six-degree-of-freedom pose of the Object (see "Zhao W, Zhang S, Guan Z, et al. The algorithm has the problems that the algorithm can only be used for grabbing existing targets in a data set, and for new targets, 3D modeling is needed firstly, and grabbing postures are marked manually, so that the cost for obtaining the data set is too high; the third type of detection scheme is to use Point cloud data as input, utilize geometrical and semantic information of the Point cloud in a 3D space, and directly obtain the six-degree-of-freedom attitude (see 'Liang H, Ma X, Li S, et al. Point networks GPD: detection grassg configuration from Point Sets, ICRA 2019') which the end of the mechanical arm needs to reach in a single-stage or double-stage manner. The method has the advantages that the trained model is good in universality and can obtain unlimited grabbing postures, but in most cases, the detection scheme only takes unstable point cloud data as input and cannot carry out targeted grabbing.
The existing six-degree-of-freedom grabbing gesture detection schemes are few, and the RGB data is not applied to the scheme for overcoming the instability of point cloud data and generating object-oriented grabbing. In the patent application entitled "a robot grasping detection method based on multi-class object segmentation" (chinese patent No. 112861667 a) of beautiful et al, RGB images are used for image segmentation and semantic recognition, and a grasping rectangular frame containing an in-plane rotation angle is generated for the segmented object; in the 'robot grasping pose estimation method based on object recognition deep learning model' (chinese patent No. 01810803444), the li ming yang et al uses a method of fusing two-dimensional visual information and three-dimensional visual information to obtain a target object point cloud, and then uses a mode of registering the point cloud of the target object with an object point cloud template in a template library to estimate the pose of the target object; qian 22531et al, in the patent application "a robot grabbing detection scheme based on example segmentation under single view point cloud" (Chinese invention patent No. 110363815A), use RGB image to perform target segmentation, then map the segmented target point set into point cloud, then generate initial grabbing coordinate system on randomly sampled points according to the geometric structure of the original point cloud data, and finally generate final six-degree-of-freedom grabbing pose through translation and filtering.
With the continuous development of deep learning, the role of RGB data in gesture detection is gradually mined. RGB data can be used to predict points on an image with specific semantic information, such as key points on the Human body (see "Sun K, Xiao B, Liu D, et al. deep High-Resolution reproduction Learning for Human dose Estimation, CVPR 2019"), and can also be used to predict the grab rotation matrix for each point on the image (see "Gou M, Fan H S, Zhu Z, et al. RGB materials: Learning 7-DoF Grasp Poses on cellular RGBD Images, ICRA 2021").
Therefore, for the capture detection part, those skilled in the art are dedicated to developing a seven-degree-of-freedom capture pose detection method based on RGB images and depth images, overcoming the defects of point cloud data instability and incapability of performing targeted capture in the point cloud-based capture method, and improving the accuracy and stability of the generated capture pose.
Disclosure of Invention
In view of the above defects in the prior art, the technical problem to be solved by the present invention is how to solve the problems of instability of point cloud data and lack of pertinence of generated capture in the existing point cloud-based six-degree-of-freedom capture detection method, so that the accuracy and stability of the finally generated capture pose are improved.
In order to achieve the above object, the present invention provides an improved grab detection method using RGB data, which generates a robust seven-degree-of-freedom grab pose for a parallel two-finger grab based on monocular RGB data and depth data for a grab detection section. Compared with the six-degree-of-freedom posture, the seven-degree-of-freedom grabbing posture increases the grabbing width.
The invention provides a seven-degree-of-freedom grabbing posture detection method based on an RGB image and a depth image, which comprises the following steps of:
step 1, converting a depth image into point cloud data, and projecting coordinates of the point cloud data into a two-dimensional image to obtain a three-channel coordinate image X-Y-Z image;
step 2, using ResNet-50 to encode the information of the RGB image and the X-Y-Z image, then using a target segmentation decoding network and a feasible capture semantic segmentation decoding network to simultaneously decode the encoded information to obtain a target segmentation result and a feasible capture semantic segmentation result of each pixel in the image, and obtaining a feasible capture point from the feasible capture semantic segmentation result;
step 3, completing the depth image by using a PENet and the RGB image to obtain a completed dense depth image and further obtain dense point cloud;
step 4, calculating normal vectors and two principal curvature directions of the feasible grabbing points by using the feasible grabbing points obtained from the feasible grabbing semantic segmentation result in the step 2 and the dense point cloud obtained in the step 3 to form a grabbing coordinate system;
step 5, sampling the grabbing depth and grabbing width of the feasible grabbing points by using a heuristic algorithm to obtain a plurality of grabbing candidates, wherein the grabbing depth of the grabbing candidates is the maximum, and the grabbing width of the grabbing candidates is the minimum; each grabbing candidate corresponds to one grabbing closed area;
step 6, inputting points in the grabbing closed area of the grabbing candidates into PointNet, and filtering out infeasible grabbing candidates to obtain a final grabbing gesture set;
and 7: and combining the target segmentation result in the step 2, projecting the grabbing candidates in the grabbing gesture set onto the target of the target segmentation result, and generating a targeted grabbing gesture.
Further, the grabber adopted in the method is a two-finger parallel grabber, and the grabbing parameters of the seven degrees of freedom are expressed as follows: g ═ x, y, z, α, β, γ, w, where (x, y, z) denotes the position of the gripper in the world coordinate system, (α, β, γ) denotes the rotation of the gripper coordinate system of the gripper around the x, y, z axes of the world coordinate system, and w denotes the tip width of the gripper.
Further, in the step 1, the specific method of projecting the coordinates of the point cloud data to the two-dimensional image to obtain the X-Y-Z image is as follows:
Figure BDA0003375888040000031
Z=D
wherein D represents the depth image, u0,v0,fx,fyRepresenting camera internal parameters.
Further, in the step 2, a multitask semantic segmentation module is used for carrying out pixel-by-pixel target segmentation and feasible capture semantic segmentation on the RGB image and the X-Y-Z image; the object segmentation is used for detecting the class of the pixel, and the feasible grabbing semantic segmentation is used for detecting whether the pixel is suitable to be used as a grabbing center;
the target segmentation decoding network and the feasible capture semantic segmentation decoding network are composed of intensive up-sampling convolution modules with different layer numbers;
the loss function of the target division decoding network uses an improved cross-entropy loss function LsemDefined as:
Figure BDA0003375888040000032
wherein N represents the total number of pixels of the image; n is a radical ofcRepresenting a category total; w is acRepresents the weight of the category c in all categories and has the calculation formula of
Figure BDA0003375888040000033
TcIndicates the total number of pixels with class truth value of c, TcFor balancing the case of an unbalanced number of classes;
Figure BDA0003375888040000034
values of 0 or 1, 0 being indicative of the classThe class truth value is different from the class truth value corresponding to the pixel, and the fact that the class truth value is 1 means that the class truth value is the same as the class truth value corresponding to the pixel;
Figure BDA0003375888040000041
confidence score representing that pixel x belongs to list c, using
Figure BDA0003375888040000042
The difficulty degrees of different samples are balanced, the loss weight of the sample with higher confidence score is reduced, and gamma is an adjustable parameter.
Furthermore, the snatchable semantic segmentation decoding network is a two-class network and adopts a common cross entropy loss function LgaSpecifically defined as:
Figure BDA0003375888040000043
where N represents the total number of pixels of the image, xgE 0,1 represents the class true value for pixel x,
Figure BDA0003375888040000044
the measured confidence score is expressed, and α is set to 1 and β is set to 0.1 so that the loss of the label as a graspable point occupies a larger weight ratio.
Further, the loss function L of the multitask semantic segmentation module is defined as:
L=γ1Lsem2Lga
wherein, γ1And gamma2Is an adjustable parameter.
Further, in the step 3, a depth image completion module is used to complete the depth image;
the PENet algorithm adopts a two-channel framework, and a similar encoder-decoder network is constructed by using a deep convolution neural network and a deconvolution mode, wherein one channel takes color information as dominant input to obtain a color dominant depth map; and the other channel uses the original depth image as a dominant input, combines the color dominant depth map to obtain a depth dominant depth map, then fuses the obtained color dominant depth map and the depth dominant depth map in a weighting mode to obtain a preliminary dense depth image, and finally refines the dense depth image by using DA-CSPN + + to obtain a final complemented dense depth image.
Further, in the step 4, the rotation matrix detection module is used for calculating the grabbing coordinate system where the feasible grabbing point is located;
sampling K nearest neighbor points near the feasible grabbing point by using a K nearest neighbor algorithm to form a point set, fitting a plane nearest to the point set, obtaining a normal direction of the feasible grabbing point according to the fitted plane, cutting a curved surface where the feasible grabbing point is located by using the plane passing through the normal direction, wherein the curved surface and the plane can be intersected to form a curve, the curve has a curvature at the feasible grabbing point, and selecting a direction with the maximum curvature and the minimum curvature at the feasible grabbing point in different curves as two main curvature directions of the feasible grabbing point.
Further, in the step 5, a grasp depth and grasp width detection module is used to determine the grasp depth and grasp width of the feasible grasp point;
taking the z value of the feasible grabbing point as an interval center, sampling two sides of the interval center by adopting a heuristic algorithm, and judging whether different z and w meet the following conditions: 1) the grabber does not collide with the scene point cloud before closing; 2) the closed area of the gripper needs to contain a centre of grip.
Further, in said step 6 and said step 7, determining said targeted grasp gesture using a grasp classification and assignment module;
and using the PointNet as an encoder to encode the generated information of the points in the grabbing closed region of the grabbing candidates, classifying by using a full connection layer, and filtering the grabbing candidates which are infeasible to obtain the final grabbing gesture set.
The improved capture detection method using RGB data provided by the invention at least has the following technical effects:
1. the prior six-degree-of-freedom grabbing detection method simply uses point cloud data as input, and obtains a candidate of a feasible grabbing point by randomly sampling point clouds, and has two problems: firstly, point cloud data obtained by a depth camera has more noise and is sparser at some thin edges of an object, so that noise points are sampled and grabbing cannot be generated at the thin edges of the object; secondly, the distribution of feasible capture points is not uniform in the scene, so that a random sampling method results in a large amount of invalid operations, which causes unnecessary computation overhead. Most of the existing six-degree-of-freedom grabbing detection generates non-pertinence grabbing, and even a grabbing gesture can be generated at a background meeting the grabbing space requirement, so that the six-degree-of-freedom grabbing detection cannot be applied to a scene with a high stability requirement. The technical scheme provided by the invention provides a multitask semantic segmentation module, and the multitask semantic segmentation module is combined with RGB data and point cloud data to obtain pixel-by-pixel class labels and labels which can be captured or not, and the pixel-by-pixel class labels and the labels which can be captured or not are respectively used for subsequent generation of targeted capture and capture posture, so that the blindness of random sampling and the instability of point cloud are overcome, and the formation of final targeted capture is facilitated;
2. according to the technical scheme provided by the invention, a depth image completion algorithm is introduced into a six-degree-of-freedom grabbing detection algorithm, so that the generation of a grabbing coordinate system at the thin edge of an object which is not caught by a sensor is facilitated, the problem of instability of point cloud data in the six-degree-of-freedom grabbing detection method is solved remarkably, and a stable and targeted seven-degree-of-freedom grabbing posture can be detected.
The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.
Drawings
FIG. 1 is an overall flow diagram of a preferred embodiment of the present invention;
FIG. 2 is an overall framework of a multitask semantic segmentation module in the embodiment shown in FIG. 1;
FIG. 3 is a diagram of exemplary pertinence grabbing of original scene, object segmentation, feasible grabbing semantic segmentation and generation in the embodiment shown in FIG. 1.
Detailed Description
The technical contents of the preferred embodiments of the present invention will be more clearly and easily understood by referring to the drawings attached to the specification. The present invention may be embodied in many different forms of embodiments and the scope of the invention is not limited to the embodiments set forth herein.
The existing six-degree-of-freedom grabbing detection method simply uses point cloud data as input, and candidates of feasible grabbing points are obtained by randomly sampling the point cloud data. This method has two problems: firstly, because point cloud data obtained by a depth camera has more noise and is sparser at some thin edges of an object, noise points are sampled and grabbing cannot be generated at the thin edges of the object; secondly, the distribution of feasible capture points is not uniform in a scene, and a random sampling mode can cause a large amount of invalid operations, thereby causing unnecessary operation expenditure. Meanwhile, most of the existing six-degree-of-freedom grabbing detection methods generate non-pertinence grabbing, and even a grabbing gesture may be generated at a background meeting the grabbing space requirement, so that the existing six-degree-of-freedom grabbing detection methods cannot be applied to scenes with higher stability requirements.
In view of the above defects in the prior art, the technical problem to be solved by the present invention is how to solve the problems of instability of point cloud data and lack of pertinence of generated capture in the existing point cloud-based six-degree-of-freedom capture detection method, so that the accuracy and stability of the finally generated capture pose are improved.
In order to achieve the above object, the present invention provides an improved grab detection method using RGB data, which generates a robust seven-degree-of-freedom grab pose for a parallel two-finger grab based on monocular RGB data and depth data for a grab detection section. Compared with the six-degree-of-freedom posture, the seven-degree-of-freedom grabbing posture increases the grabbing width. Specifically, RGB images and depth images of a scene are obtained firstly, coded information is processed by using ResNet-50, target segmentation and feasible capture semantic segmentation are performed through two decoding networks, and the depth images are complemented by using the RGB images, so that dense point clouds are obtained. Generating normal vectors and two principal curvature directions of feasible grabbing points in the dense point cloud to serve as a grabbing coordinate system, then sampling the grabbing depth and width in a heuristic mode, and reserving the grabbing posture with the largest grabbing depth and the smallest grabbing width to serve as a grabbing candidate with seven degrees of freedom. And finally, classifying the grabbing candidates by using PointNet, filtering infeasible grabbing to obtain a final feasible seven-degree-of-freedom grabbing posture, and projecting the grabbing posture to a corresponding target according to a grabbing center to obtain a targeted grabbing posture.
As shown in fig. 1, the method comprises the following steps after obtaining the RGB image and the depth image in advance:
step 1, converting a depth image into point cloud data, and projecting coordinates of the point cloud data into a two-dimensional image to obtain a three-channel coordinate image X-Y-Z image;
step 2, using ResNet-50 to encode the information of RGB image and X-Y-Z image, then using target segmentation decoding network and feasible capture semantic segmentation decoding network to decode the encoded information at the same time, obtaining the target segmentation result and feasible capture semantic segmentation result of each pixel in the image, and obtaining feasible capture point from the feasible capture semantic segmentation result;
step 3, completing the depth image by using a PENet and an RGB image to obtain a completed dense depth image and further obtain dense point cloud;
step 4, calculating normal vectors and two principal curvature directions of the feasible grabbing points by using the feasible grabbing points obtained from the feasible grabbing semantic segmentation result in the step 2 and the dense point cloud obtained in the step 3 to form a grabbing coordinate system;
step 5, sampling the grabbing depth and grabbing width of the feasible grabbing points by using a heuristic algorithm to obtain a plurality of grabbing candidates, wherein the grabbing depth of the grabbing candidates is the maximum, and the grabbing width of the grabbing candidates is the minimum; each grabbing candidate corresponds to one grabbing closed area;
step 6, inputting points in the grabbing closed area of the grabbing candidates into PointNet, filtering out infeasible grabbing candidates, and obtaining a final grabbing gesture set;
and 7: and (3) combining the target segmentation result in the step (2), projecting the grabbing candidates in the grabbing gesture set to the target of the target segmentation result, and generating a targeted grabbing gesture.
Specifically, the grabber adopted in the method is a two-finger parallel grabber, and grabbing parameters of seven degrees of freedom are expressed as follows: g ═ x, y, z, α, β, γ, w, where (x, y, z) denotes the position of the gripper in the world coordinate system, (α, β, γ) denotes the rotation of the gripper coordinate system of the gripper around the x, y, z axis of the world coordinate system, and w denotes the tip width of the gripper.
Because the RGB image has abundant semantic information and texture information in a two-dimensional space, and the point cloud data has semantic information and space geometric information in a three-dimensional space, the two are combined to perform a semantic segmentation task.
In step 1, the specific method for projecting the coordinates of the point cloud data to the two-dimensional image to obtain the X-Y-Z image comprises the following steps:
Figure BDA0003375888040000071
Z=D
where D denotes a depth image, u0,v0,fx,fyRepresenting camera internal parameters.
And then using the obtained (R, G, B, X, Y, Z) six-channel image as the input image of the step 2.
In step 2, a multitask semantic segmentation module is used for carrying out pixel-by-pixel target segmentation and feasible capture semantic segmentation on the RGB image and the X-Y-Z image (namely (R, G, B, X, Y, Z) six-channel image); the object segmentation is used for detecting the class to which the pixel belongs, and the feasible capture semantic segmentation is used for detecting whether the pixel is suitable to be used as a capture center.
The target segmentation decoding network and the feasible capture semantic segmentation decoding network are composed of intensive upsampling convolution modules with different layer numbers;
loss function usage of target-partition decoding network improved cross-entropy loss function LsemDefined as:
Figure BDA0003375888040000072
wherein N represents the total number of pixels of the image; n is a radical ofcRepresenting a category total; w is acRepresents the weight of the category c in all categories and has the calculation formula of
Figure BDA0003375888040000073
TcIndicates the total number of pixels with class truth value of c, TcFor balancing the case of an unbalanced number of classes;
Figure BDA0003375888040000074
the value is 0 or 1, the value of 0 indicates that the class is different from the class truth value corresponding to the pixel, and the value of 1 indicates that the class is the same as the class truth value corresponding to the pixel;
Figure BDA0003375888040000075
confidence score representing that pixel x belongs to list c, using
Figure BDA0003375888040000076
The difficulty degrees of different samples are balanced, the loss weight of the sample with higher confidence score is reduced, and gamma is an adjustable parameter.
The feasible capture semantic segmentation decoding network is a two-class network and adopts a common cross entropy loss function LgaSpecifically defined as:
Figure BDA0003375888040000081
where N represents the total number of pixels of the image, xgE 0,1 represents the class true value for pixel x,
Figure BDA0003375888040000082
the measured confidence score is expressed, and α is set to 1 and β is set to 0.1 so that the loss of the label as a graspable point occupies a larger weight ratio.
The loss function L of the multitask semantic segmentation module is defined as:
L=γ1Lsem2Lga
wherein, γ1And gamma2Is an adjustable parameter.
The training data of the multitask semantic segmentation module is from a GraspNet-1Billion data set. The dataset contains object segmentation labels, and to obtain feasible capture semantic segmentation labels, the capture center of the 6DOF capture pose in the dataset may be projected into the 2D image.
As shown in fig. 2, it is an overall framework of the multitask semantic segmentation module in step 2. In this step, a multitask semantic segmentation module is used to perform pixel-by-pixel target segmentation and feasible capture semantic segmentation on the RGB image and the X-Y-Z image. The object segmentation is used for detecting the class to which the pixel belongs, and the feasible capture semantic segmentation is used for detecting whether the pixel is suitable to be used as a capture center. Specifically, information of an RGB image and an X-Y-Z image is coded by using ResNet-50, and then a target segmentation result and a feasible capture semantic segmentation result of each pixel are obtained by decoding the coded information by using a target segmentation decoding network and a feasible capture semantic segmentation decoding network. The target segmentation decoding network and the feasible capture semantic segmentation decoding network both adopt dense upsampling convolutional networks, but the layer number is different.
Specifically, in step 3, the depth image is complemented using a depth image complementing module.
The PENet algorithm adopts a two-channel framework, and a similar encoder-decoder network is constructed by using a deep convolution neural network and a deconvolution mode, wherein one channel takes color information as dominant input to obtain a color dominant depth map; and the other channel takes the original depth image as a leading input, combines the color leading depth map to obtain a depth leading depth map, then fuses the obtained color leading depth map and the depth leading depth map in a weighting mode to obtain a primary dense depth image, and finally refines the dense depth image by using DA-CSPN + + to obtain a final complete dense depth image.
Specifically, in step 4, a rotation matrix detection module is used to calculate a grabbing coordinate system where the feasible grabbing points are located.
Sampling K nearest Neighbor points near the feasible grabbing point by using a K-nearest Neighbor (KNN) algorithm to form a point set, fitting a plane nearest to the point set, obtaining the normal direction of the feasible grabbing point according to the fitted plane, cutting a curved surface where the feasible grabbing point is located by using the plane passing through the normal direction, intersecting the curved surface and the plane to obtain a curve, wherein the curve has a curvature at the feasible grabbing point, and selecting the direction with the maximum curvature and the minimum curvature at the feasible grabbing point in different curves as two main curvature directions of the feasible grabbing point.
Specifically, in step 5, the grab depth and grab width of the feasible grab point are determined using the grab depth and grab width detection module.
Taking the z value of the feasible grabbing point as an interval center, sampling two sides of the interval center by adopting a heuristic algorithm, and judging whether different z and w meet the following conditions: 1) the grabber does not collide with the scene point cloud before closing; 2) the closed area of the gripper needs to contain the centre of grip.
Specifically, in step 6 and step 7, the targeted grab pose is determined using a grab classification and assignment module.
And using PointNet as an encoder to encode information of points in the grabbing closed region of the generated grabbing candidates, classifying by using a full-connection layer, and filtering infeasible grabbing candidates to obtain a final grabbing posture set.
As shown in fig. 3, the exemplary graphs of the original scene, the target segmentation, the feasible capture semantic segmentation, and the generated targeted capture in the method shown in fig. 1 are shown, where the graph at the upper left corner in fig. 3 is the original scene, the graph at the upper right corner is the target separation, the graph at the lower left corner is the feasible capture semantic segmentation, and the graph at the lower right corner is the generated targeted capture.
The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims (10)

1. A seven-degree-of-freedom grabbing posture detection method based on an RGB image and a depth image is characterized by comprising the following steps of:
step 1, converting a depth image into point cloud data, and projecting coordinates of the point cloud data into a two-dimensional image to obtain a three-channel coordinate image X-Y-Z image;
step 2, using ResNet-50 to encode the information of the RGB image and the X-Y-Z image, then using a target segmentation decoding network and a feasible capture semantic segmentation decoding network to simultaneously decode the encoded information to obtain a target segmentation result and a feasible capture semantic segmentation result of each pixel in the image, and obtaining a feasible capture point from the feasible capture semantic segmentation result;
step 3, completing the depth image by using a PENet and the RGB image to obtain a completed dense depth image and further obtain dense point cloud;
step 4, calculating normal vectors and two principal curvature directions of the feasible grabbing points by using the feasible grabbing points obtained from the feasible grabbing semantic segmentation result in the step 2 and the dense point cloud obtained in the step 3 to form a grabbing coordinate system;
step 5, sampling the grabbing depth and grabbing width of the feasible grabbing points by using a heuristic algorithm to obtain a plurality of grabbing candidates, wherein the grabbing depth of the grabbing candidates is the maximum, and the grabbing width of the grabbing candidates is the minimum; each grabbing candidate corresponds to one grabbing closed area;
step 6, inputting points in the grabbing closed area of the grabbing candidates into PointNet, and filtering out infeasible grabbing candidates to obtain a final grabbing gesture set;
and 7: and combining the target segmentation result in the step 2, projecting the grabbing candidates in the grabbing gesture set onto the target of the target segmentation result, and generating a targeted grabbing gesture.
2. The RGB image and depth image-based seven-degree-of-freedom grab gesture detection method of claim 1, wherein the grabber used in the method is a two-finger parallel grabber, and the grab parameters of the seven degrees of freedom are expressed as: g ═ x, y, z, α, β, γ, w, where (x, y, z) denotes the position of the gripper in the world coordinate system, (α, β, γ) denotes the rotation of the gripper coordinate system of the gripper around the x, y, z axes of the world coordinate system, and w denotes the tip width of the gripper.
3. The method for detecting a capturing pose with seven degrees of freedom based on an RGB image and a depth image as claimed in claim 1, wherein the step 1 of projecting the coordinates of the point cloud data to the two-dimensional image to obtain the X-Y-Z image comprises:
Figure FDA0003375888030000011
wherein D represents the depth image, u0,v0,fx,fyRepresenting camera internal parameters.
4. The RGB image and depth image-based seven-degree-of-freedom capture pose detection method of claim 1, wherein in the step 2, a multitask semantic segmentation module is used to perform pixel-by-pixel target segmentation and feasible capture semantic segmentation on the RGB image and the X-Y-Z image; the object segmentation is used for detecting the class of the pixel, and the feasible grabbing semantic segmentation is used for detecting whether the pixel is suitable to be used as a grabbing center;
the target segmentation decoding network and the feasible capture semantic segmentation decoding network are composed of intensive up-sampling convolution modules with different layer numbers;
the loss function of the target division decoding network uses an improved cross-entropy loss function LsemDefined as:
Figure FDA0003375888030000021
wherein N represents the total number of pixels of the image; n is a radical ofcRepresenting a category total; w is acRepresents the weight of the category c in all categories and has the calculation formula of
Figure FDA0003375888030000022
TcIndicates the total number of pixels with class truth value of c, TcFor balancing the case of an unbalanced number of classes;
Figure FDA0003375888030000023
the value is 0 or 1, the value of 0 indicates that the class is different from the class truth value corresponding to the pixel, and the value of 1 indicates that the class is the same as the class truth value corresponding to the pixel;
Figure FDA0003375888030000024
confidence score representing that pixel x belongs to list c, using
Figure FDA0003375888030000025
The difficulty degrees of different samples are balanced, the loss weight of the sample with higher confidence score is reduced, and gamma is an adjustable parameter.
5. The RGB image and depth image based seven-degree-of-freedom grab gesture detection method of claim 4, wherein the feasible grab semantic segmentation decoding network is one or twoA classification network using a common cross entropy loss function LgaSpecifically defined as:
Figure FDA0003375888030000026
where N represents the total number of pixels of the image, xgE 0,1 represents the class true value for pixel x,
Figure FDA0003375888030000027
the measured confidence score is expressed, and α is set to 1 and β is set to 0.1 so that the loss of the label as a graspable point occupies a larger weight ratio.
6. The method of seven-degree-of-freedom grabbing posture detection based on an RGB image and a depth image as claimed in claim 5, wherein the loss function L of the multitask semantic segmentation module is defined as:
L=γ1Lsem2Lga
wherein, γ1And gamma2Is an adjustable parameter.
7. The RGB image and depth image-based seven-degree-of-freedom capture pose detection method of claim 1, wherein in the step 3, the depth image is complemented using a depth image complementing module;
the PENet algorithm adopts a two-channel framework, and a similar encoder-decoder network is constructed by using a deep convolution neural network and a deconvolution mode, wherein one channel takes color information as dominant input to obtain a color dominant depth map; and the other channel uses the original depth image as a dominant input, combines the color dominant depth map to obtain a depth dominant depth map, then fuses the obtained color dominant depth map and the depth dominant depth map in a weighting mode to obtain a preliminary dense depth image, and finally refines the dense depth image by using DA-CSPN + + to obtain a final complemented dense depth image.
8. The RGB image and depth image-based seven-degree-of-freedom grab gesture detection method of claim 1, wherein in the step 4, the grab coordinate system where the feasible grab point is located is calculated using the rotation matrix detection module;
sampling K nearest neighbor points near the feasible grabbing point by using a K nearest neighbor algorithm to form a point set, fitting a plane nearest to the point set, obtaining a normal direction of the feasible grabbing point according to the fitted plane, cutting a curved surface where the feasible grabbing point is located by using the plane passing through the normal direction, wherein the curved surface and the plane can be intersected to form a curve, the curve has a curvature at the feasible grabbing point, and selecting a direction with the maximum curvature and the minimum curvature at the feasible grabbing point in different curves as two main curvature directions of the feasible grabbing point.
9. The RGB image and depth image-based seven-degree-of-freedom grab gesture detection method of claim 2, wherein in the step 5, the grab depth and the grab width of the feasible grab point are determined using a grab depth and grab width detection module;
taking the z value of the feasible grabbing point as an interval center, sampling two sides of the interval center by adopting a heuristic algorithm, and judging whether different z and w meet the following conditions: 1) the grabber does not collide with the scene point cloud before closing; 2) the closed area of the gripper needs to contain a centre of grip.
10. The RGB image and depth image-based seven-degree-of-freedom grab gesture detection method of claim 1, wherein in the step 6 and the step 7, the targeted grab gesture is determined using a grab classification and assignment module;
and using the PointNet as an encoder to encode the generated information of the points in the grabbing closed region of the grabbing candidates, classifying by using a full connection layer, and filtering the grabbing candidates which are infeasible to obtain the final grabbing gesture set.
CN202111418398.7A 2021-11-26 2021-11-26 Seven-degree-of-freedom grabbing posture detection method based on RGB image and depth image Pending CN114140418A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111418398.7A CN114140418A (en) 2021-11-26 2021-11-26 Seven-degree-of-freedom grabbing posture detection method based on RGB image and depth image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111418398.7A CN114140418A (en) 2021-11-26 2021-11-26 Seven-degree-of-freedom grabbing posture detection method based on RGB image and depth image

Publications (1)

Publication Number Publication Date
CN114140418A true CN114140418A (en) 2022-03-04

Family

ID=80388548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111418398.7A Pending CN114140418A (en) 2021-11-26 2021-11-26 Seven-degree-of-freedom grabbing posture detection method based on RGB image and depth image

Country Status (1)

Country Link
CN (1) CN114140418A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115147488A (en) * 2022-07-06 2022-10-04 湖南大学 Workpiece pose estimation method based on intensive prediction and grasping system
CN115187781A (en) * 2022-07-12 2022-10-14 北京信息科技大学 Six-degree-of-freedom grabbing detection algorithm based on semantic segmentation network
CN115797332A (en) * 2023-01-29 2023-03-14 高视科技(苏州)有限公司 Target object grabbing method and device based on example segmentation
CN117656083A (en) * 2024-01-31 2024-03-08 厦门理工学院 Seven-degree-of-freedom grabbing gesture generation method, device, medium and equipment

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115147488A (en) * 2022-07-06 2022-10-04 湖南大学 Workpiece pose estimation method based on intensive prediction and grasping system
CN115187781A (en) * 2022-07-12 2022-10-14 北京信息科技大学 Six-degree-of-freedom grabbing detection algorithm based on semantic segmentation network
CN115187781B (en) * 2022-07-12 2023-05-30 北京信息科技大学 Six-degree-of-freedom grabbing detection method based on semantic segmentation network
CN115797332A (en) * 2023-01-29 2023-03-14 高视科技(苏州)有限公司 Target object grabbing method and device based on example segmentation
CN115797332B (en) * 2023-01-29 2023-05-30 高视科技(苏州)股份有限公司 Object grabbing method and device based on instance segmentation
CN117656083A (en) * 2024-01-31 2024-03-08 厦门理工学院 Seven-degree-of-freedom grabbing gesture generation method, device, medium and equipment
CN117656083B (en) * 2024-01-31 2024-04-30 厦门理工学院 Seven-degree-of-freedom grabbing gesture generation method, device, medium and equipment

Similar Documents

Publication Publication Date Title
CN114140418A (en) Seven-degree-of-freedom grabbing posture detection method based on RGB image and depth image
CN108280856B (en) Unknown object grabbing pose estimation method based on mixed information input network model
Cohen et al. Inference of human postures by classification of 3D human body shape
CN111695562B (en) Autonomous robot grabbing method based on convolutional neural network
CN110298886B (en) Dexterous hand grabbing planning method based on four-stage convolutional neural network
Qian et al. Grasp pose detection with affordance-based task constraint learning in single-view point clouds
CN109048918B (en) Visual guide method for wheelchair mechanical arm robot
CN113065546A (en) Target pose estimation method and system based on attention mechanism and Hough voting
Ni et al. A new approach based on two-stream cnns for novel objects grasping in clutter
CN112465903A (en) 6DOF object attitude estimation method based on deep learning point cloud matching
Cao et al. Residual squeeze-and-excitation network with multi-scale spatial pyramid module for fast robotic grasping detection
CN114782347A (en) Mechanical arm grabbing parameter estimation method based on attention mechanism generation type network
CN112288809B (en) Robot grabbing detection method for multi-object complex scene
Abouelnaga et al. Distillpose: Lightweight camera localization using auxiliary learning
Yu et al. Object recognition and robot grasping technology based on RGB-D data
CN114998573B (en) Grabbing pose detection method based on RGB-D feature depth fusion
CN110852272A (en) Pedestrian detection method
Ouyang et al. Robot grasp with multi-object detection based on RGB-D image
CN114049318A (en) Multi-mode fusion feature-based grabbing pose detection method
Nguyen et al. Bin-picking solution for industrial robots integrating a 2D vision system
Wang et al. 3D hand gesture recognition based on Polar Rotation Feature and Linear Discriminant Analysis
Wu et al. Object Pose Estimation with Point Cloud Data for Robot Grasping
Geng et al. A Novel Real-time Grasping Method Cobimbed with YOLO and GDFCN
Asif et al. Model-free segmentation and grasp selection of unknown stacked objects
Wu et al. Real-Time Pixel-Wise Grasp Detection Based on RGB-D Feature Dense Fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination