CN114140418A

CN114140418A - Seven-degree-of-freedom grabbing posture detection method based on RGB image and depth image

Info

Publication number: CN114140418A
Application number: CN202111418398.7A
Authority: CN
Inventors: 孙景慷; 张克勤; 裘焱枫; 杨根科; 褚健
Original assignee: Ningbo Institute Of Artificial Intelligence Shanghai Jiaotong University
Current assignee: Ningbo Institute Of Artificial Intelligence Shanghai Jiaotong University
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2022-03-04

Abstract

The invention discloses a seven-degree-of-freedom grabbing posture detection method based on RGB images and depth images, which relates to the field of computer vision and comprises the following steps: step 1, converting a depth image into point cloud data and projecting to obtain a three-channel image X-Y-Z image; step 2, using ResNet-50 to encode the information of the RGB image and the X-Y-Z image to obtain a target segmentation result and a semantic segmentation result which can be captured; step 3, completing the depth image to obtain dense point cloud; step 4, calculating normal vectors and two principal curvature directions of the feasible grabbing points by using the feasible grabbing points and the dense point cloud to form a grabbing coordinate system; step 5, sampling the grabbing depth and the grabbing width of the feasible grabbing points to generate a plurality of grabbing candidates, wherein each grabbing candidate corresponds to one grabbing closed area; step 6, inputting points in the grabbing closed area into PointNet, and filtering grabbing candidates to obtain a final grabbing gesture set; and 7, projecting the grabbing candidates to the target to generate a final grabbing posture.

Description

Seven-degree-of-freedom grabbing posture detection method based on RGB image and depth image

Technical Field

The invention relates to the field of computer vision, in particular to a seven-degree-of-freedom grabbing posture detection method based on an RGB image and a depth image.

Background

Robust robotic arm grasping is a fundamental requirement for robots in most industrial scenarios and daily life. The grabbing of the whole mechanical arm is divided into two parts: and (4) grabbing detection and path planning. The grabbing detection refers to scene information obtained by using a monocular camera, a depth camera or a binocular camera and other sensors to generate a six-degree-of-freedom posture which is required to be reached by the tail end of the mechanical arm. The six-degree-of-freedom attitude refers to a position and a coordinate system to be reached by the center of the tail end of the mechanical arm. The path planning means how to plan a motion path of the mechanical arm in a working space aiming at a six-degree-of-freedom posture generated by grabbing detection, so that the mechanical arm does not collide with a scene, and the motion dynamics constraint of the mechanical arm is met.

In recent years, with the development of deep learning, a mechanical arm grabbing detection algorithm based on vision is rapidly developed. Vision-based robotic arm capture detection schemes can be divided into three broad categories (see "Fang HS, Wang C, Gou M, et al, grappnet-1 Billion: a Large-Scale Benchmark for General Object capturing, CVPR 2020"), in which the first category of detection schemes takes as input RGB images captured by a monocular or monocular camera sensor, and then detects a feasible capture frame in the 2D image, the capture frame containing the capture position and an angle information representing in-plane rotation. The algorithm limits grabbing to be vertical to the desktop, seriously limits the grabbing freedom degree, and can cause that stacked objects in a cluttered scene are difficult to grab; the second type of detection scheme is to transform the corresponding grabbing of the Object in the reference coordinate system into the coordinate system by Detecting the transformation of the six-degree-of-freedom pose of the Object (see "Zhao W, Zhang S, Guan Z, et al. The algorithm has the problems that the algorithm can only be used for grabbing existing targets in a data set, and for new targets, 3D modeling is needed firstly, and grabbing postures are marked manually, so that the cost for obtaining the data set is too high; the third type of detection scheme is to use Point cloud data as input, utilize geometrical and semantic information of the Point cloud in a 3D space, and directly obtain the six-degree-of-freedom attitude (see 'Liang H, Ma X, Li S, et al. Point networks GPD: detection grassg configuration from Point Sets, ICRA 2019') which the end of the mechanical arm needs to reach in a single-stage or double-stage manner. The method has the advantages that the trained model is good in universality and can obtain unlimited grabbing postures, but in most cases, the detection scheme only takes unstable point cloud data as input and cannot carry out targeted grabbing.

The existing six-degree-of-freedom grabbing gesture detection schemes are few, and the RGB data is not applied to the scheme for overcoming the instability of point cloud data and generating object-oriented grabbing. In the patent application entitled "a robot grasping detection method based on multi-class object segmentation" (chinese patent No. 112861667 a) of beautiful et al, RGB images are used for image segmentation and semantic recognition, and a grasping rectangular frame containing an in-plane rotation angle is generated for the segmented object; in the 'robot grasping pose estimation method based on object recognition deep learning model' (chinese patent No. 01810803444), the li ming yang et al uses a method of fusing two-dimensional visual information and three-dimensional visual information to obtain a target object point cloud, and then uses a mode of registering the point cloud of the target object with an object point cloud template in a template library to estimate the pose of the target object; qian 22531et al, in the patent application "a robot grabbing detection scheme based on example segmentation under single view point cloud" (Chinese invention patent No. 110363815A), use RGB image to perform target segmentation, then map the segmented target point set into point cloud, then generate initial grabbing coordinate system on randomly sampled points according to the geometric structure of the original point cloud data, and finally generate final six-degree-of-freedom grabbing pose through translation and filtering.

With the continuous development of deep learning, the role of RGB data in gesture detection is gradually mined. RGB data can be used to predict points on an image with specific semantic information, such as key points on the Human body (see "Sun K, Xiao B, Liu D, et al. deep High-Resolution reproduction Learning for Human dose Estimation, CVPR 2019"), and can also be used to predict the grab rotation matrix for each point on the image (see "Gou M, Fan H S, Zhu Z, et al. RGB materials: Learning 7-DoF Grasp Poses on cellular RGBD Images, ICRA 2021").

Therefore, for the capture detection part, those skilled in the art are dedicated to developing a seven-degree-of-freedom capture pose detection method based on RGB images and depth images, overcoming the defects of point cloud data instability and incapability of performing targeted capture in the point cloud-based capture method, and improving the accuracy and stability of the generated capture pose.

Disclosure of Invention

In view of the above defects in the prior art, the technical problem to be solved by the present invention is how to solve the problems of instability of point cloud data and lack of pertinence of generated capture in the existing point cloud-based six-degree-of-freedom capture detection method, so that the accuracy and stability of the finally generated capture pose are improved.

In order to achieve the above object, the present invention provides an improved grab detection method using RGB data, which generates a robust seven-degree-of-freedom grab pose for a parallel two-finger grab based on monocular RGB data and depth data for a grab detection section. Compared with the six-degree-of-freedom posture, the seven-degree-of-freedom grabbing posture increases the grabbing width.

The invention provides a seven-degree-of-freedom grabbing posture detection method based on an RGB image and a depth image, which comprises the following steps of:

step 1, converting a depth image into point cloud data, and projecting coordinates of the point cloud data into a two-dimensional image to obtain a three-channel coordinate image X-Y-Z image;

step 2, using ResNet-50 to encode the information of the RGB image and the X-Y-Z image, then using a target segmentation decoding network and a feasible capture semantic segmentation decoding network to simultaneously decode the encoded information to obtain a target segmentation result and a feasible capture semantic segmentation result of each pixel in the image, and obtaining a feasible capture point from the feasible capture semantic segmentation result;

step 3, completing the depth image by using a PENet and the RGB image to obtain a completed dense depth image and further obtain dense point cloud;

step 4, calculating normal vectors and two principal curvature directions of the feasible grabbing points by using the feasible grabbing points obtained from the feasible grabbing semantic segmentation result in the step 2 and the dense point cloud obtained in the step 3 to form a grabbing coordinate system;

step 5, sampling the grabbing depth and grabbing width of the feasible grabbing points by using a heuristic algorithm to obtain a plurality of grabbing candidates, wherein the grabbing depth of the grabbing candidates is the maximum, and the grabbing width of the grabbing candidates is the minimum; each grabbing candidate corresponds to one grabbing closed area;

step 6, inputting points in the grabbing closed area of the grabbing candidates into PointNet, and filtering out infeasible grabbing candidates to obtain a final grabbing gesture set;

and 7: and combining the target segmentation result in the step 2, projecting the grabbing candidates in the grabbing gesture set onto the target of the target segmentation result, and generating a targeted grabbing gesture.

Further, the grabber adopted in the method is a two-finger parallel grabber, and the grabbing parameters of the seven degrees of freedom are expressed as follows: g ═ x, y, z, α, β, γ, w, where (x, y, z) denotes the position of the gripper in the world coordinate system, (α, β, γ) denotes the rotation of the gripper coordinate system of the gripper around the x, y, z axes of the world coordinate system, and w denotes the tip width of the gripper.

Further, in the step 1, the specific method of projecting the coordinates of the point cloud data to the two-dimensional image to obtain the X-Y-Z image is as follows:

Z＝D

wherein D represents the depth image, u₀,v₀,f_x,f_yRepresenting camera internal parameters.

Further, in the step 2, a multitask semantic segmentation module is used for carrying out pixel-by-pixel target segmentation and feasible capture semantic segmentation on the RGB image and the X-Y-Z image; the object segmentation is used for detecting the class of the pixel, and the feasible grabbing semantic segmentation is used for detecting whether the pixel is suitable to be used as a grabbing center;

the target segmentation decoding network and the feasible capture semantic segmentation decoding network are composed of intensive up-sampling convolution modules with different layer numbers;

the loss function of the target division decoding network uses an improved cross-entropy loss function L_semDefined as:

wherein N represents the total number of pixels of the image; n is a radical of_cRepresenting a category total; w is a_cRepresents the weight of the category c in all categories and has the calculation formula of

T_cIndicates the total number of pixels with class truth value of c, T_cFor balancing the case of an unbalanced number of classes;

values of 0 or 1, 0 being indicative of the classThe class truth value is different from the class truth value corresponding to the pixel, and the fact that the class truth value is 1 means that the class truth value is the same as the class truth value corresponding to the pixel;

confidence score representing that pixel x belongs to list c, using

The difficulty degrees of different samples are balanced, the loss weight of the sample with higher confidence score is reduced, and gamma is an adjustable parameter.

Furthermore, the snatchable semantic segmentation decoding network is a two-class network and adopts a common cross entropy loss function L_gaSpecifically defined as:

where N represents the total number of pixels of the image, x_gE 0,1 represents the class true value for pixel x,

the measured confidence score is expressed, and α is set to 1 and β is set to 0.1 so that the loss of the label as a graspable point occupies a larger weight ratio.

Further, the loss function L of the multitask semantic segmentation module is defined as:

L＝γ₁L_sem+γ₂L_ga

wherein, γ₁And gamma₂Is an adjustable parameter.

Further, in the step 3, a depth image completion module is used to complete the depth image;

the PENet algorithm adopts a two-channel framework, and a similar encoder-decoder network is constructed by using a deep convolution neural network and a deconvolution mode, wherein one channel takes color information as dominant input to obtain a color dominant depth map; and the other channel uses the original depth image as a dominant input, combines the color dominant depth map to obtain a depth dominant depth map, then fuses the obtained color dominant depth map and the depth dominant depth map in a weighting mode to obtain a preliminary dense depth image, and finally refines the dense depth image by using DA-CSPN + + to obtain a final complemented dense depth image.

Further, in the step 4, the rotation matrix detection module is used for calculating the grabbing coordinate system where the feasible grabbing point is located;

sampling K nearest neighbor points near the feasible grabbing point by using a K nearest neighbor algorithm to form a point set, fitting a plane nearest to the point set, obtaining a normal direction of the feasible grabbing point according to the fitted plane, cutting a curved surface where the feasible grabbing point is located by using the plane passing through the normal direction, wherein the curved surface and the plane can be intersected to form a curve, the curve has a curvature at the feasible grabbing point, and selecting a direction with the maximum curvature and the minimum curvature at the feasible grabbing point in different curves as two main curvature directions of the feasible grabbing point.

Further, in the step 5, a grasp depth and grasp width detection module is used to determine the grasp depth and grasp width of the feasible grasp point;

taking the z value of the feasible grabbing point as an interval center, sampling two sides of the interval center by adopting a heuristic algorithm, and judging whether different z and w meet the following conditions: 1) the grabber does not collide with the scene point cloud before closing; 2) the closed area of the gripper needs to contain a centre of grip.

Further, in said step 6 and said step 7, determining said targeted grasp gesture using a grasp classification and assignment module;

and using the PointNet as an encoder to encode the generated information of the points in the grabbing closed region of the grabbing candidates, classifying by using a full connection layer, and filtering the grabbing candidates which are infeasible to obtain the final grabbing gesture set.

The improved capture detection method using RGB data provided by the invention at least has the following technical effects:

1. the prior six-degree-of-freedom grabbing detection method simply uses point cloud data as input, and obtains a candidate of a feasible grabbing point by randomly sampling point clouds, and has two problems: firstly, point cloud data obtained by a depth camera has more noise and is sparser at some thin edges of an object, so that noise points are sampled and grabbing cannot be generated at the thin edges of the object; secondly, the distribution of feasible capture points is not uniform in the scene, so that a random sampling method results in a large amount of invalid operations, which causes unnecessary computation overhead. Most of the existing six-degree-of-freedom grabbing detection generates non-pertinence grabbing, and even a grabbing gesture can be generated at a background meeting the grabbing space requirement, so that the six-degree-of-freedom grabbing detection cannot be applied to a scene with a high stability requirement. The technical scheme provided by the invention provides a multitask semantic segmentation module, and the multitask semantic segmentation module is combined with RGB data and point cloud data to obtain pixel-by-pixel class labels and labels which can be captured or not, and the pixel-by-pixel class labels and the labels which can be captured or not are respectively used for subsequent generation of targeted capture and capture posture, so that the blindness of random sampling and the instability of point cloud are overcome, and the formation of final targeted capture is facilitated;

2. according to the technical scheme provided by the invention, a depth image completion algorithm is introduced into a six-degree-of-freedom grabbing detection algorithm, so that the generation of a grabbing coordinate system at the thin edge of an object which is not caught by a sensor is facilitated, the problem of instability of point cloud data in the six-degree-of-freedom grabbing detection method is solved remarkably, and a stable and targeted seven-degree-of-freedom grabbing posture can be detected.

The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.

Drawings

FIG. 1 is an overall flow diagram of a preferred embodiment of the present invention;

FIG. 2 is an overall framework of a multitask semantic segmentation module in the embodiment shown in FIG. 1;

FIG. 3 is a diagram of exemplary pertinence grabbing of original scene, object segmentation, feasible grabbing semantic segmentation and generation in the embodiment shown in FIG. 1.

Detailed Description

The technical contents of the preferred embodiments of the present invention will be more clearly and easily understood by referring to the drawings attached to the specification. The present invention may be embodied in many different forms of embodiments and the scope of the invention is not limited to the embodiments set forth herein.

The existing six-degree-of-freedom grabbing detection method simply uses point cloud data as input, and candidates of feasible grabbing points are obtained by randomly sampling the point cloud data. This method has two problems: firstly, because point cloud data obtained by a depth camera has more noise and is sparser at some thin edges of an object, noise points are sampled and grabbing cannot be generated at the thin edges of the object; secondly, the distribution of feasible capture points is not uniform in a scene, and a random sampling mode can cause a large amount of invalid operations, thereby causing unnecessary operation expenditure. Meanwhile, most of the existing six-degree-of-freedom grabbing detection methods generate non-pertinence grabbing, and even a grabbing gesture may be generated at a background meeting the grabbing space requirement, so that the existing six-degree-of-freedom grabbing detection methods cannot be applied to scenes with higher stability requirements.

In order to achieve the above object, the present invention provides an improved grab detection method using RGB data, which generates a robust seven-degree-of-freedom grab pose for a parallel two-finger grab based on monocular RGB data and depth data for a grab detection section. Compared with the six-degree-of-freedom posture, the seven-degree-of-freedom grabbing posture increases the grabbing width. Specifically, RGB images and depth images of a scene are obtained firstly, coded information is processed by using ResNet-50, target segmentation and feasible capture semantic segmentation are performed through two decoding networks, and the depth images are complemented by using the RGB images, so that dense point clouds are obtained. Generating normal vectors and two principal curvature directions of feasible grabbing points in the dense point cloud to serve as a grabbing coordinate system, then sampling the grabbing depth and width in a heuristic mode, and reserving the grabbing posture with the largest grabbing depth and the smallest grabbing width to serve as a grabbing candidate with seven degrees of freedom. And finally, classifying the grabbing candidates by using PointNet, filtering infeasible grabbing to obtain a final feasible seven-degree-of-freedom grabbing posture, and projecting the grabbing posture to a corresponding target according to a grabbing center to obtain a targeted grabbing posture.

As shown in fig. 1, the method comprises the following steps after obtaining the RGB image and the depth image in advance:

step 2, using ResNet-50 to encode the information of RGB image and X-Y-Z image, then using target segmentation decoding network and feasible capture semantic segmentation decoding network to decode the encoded information at the same time, obtaining the target segmentation result and feasible capture semantic segmentation result of each pixel in the image, and obtaining feasible capture point from the feasible capture semantic segmentation result;

step 3, completing the depth image by using a PENet and an RGB image to obtain a completed dense depth image and further obtain dense point cloud;

step 6, inputting points in the grabbing closed area of the grabbing candidates into PointNet, filtering out infeasible grabbing candidates, and obtaining a final grabbing gesture set;

and 7: and (3) combining the target segmentation result in the step (2), projecting the grabbing candidates in the grabbing gesture set to the target of the target segmentation result, and generating a targeted grabbing gesture.

Specifically, the grabber adopted in the method is a two-finger parallel grabber, and grabbing parameters of seven degrees of freedom are expressed as follows: g ═ x, y, z, α, β, γ, w, where (x, y, z) denotes the position of the gripper in the world coordinate system, (α, β, γ) denotes the rotation of the gripper coordinate system of the gripper around the x, y, z axis of the world coordinate system, and w denotes the tip width of the gripper.

Because the RGB image has abundant semantic information and texture information in a two-dimensional space, and the point cloud data has semantic information and space geometric information in a three-dimensional space, the two are combined to perform a semantic segmentation task.

In step 1, the specific method for projecting the coordinates of the point cloud data to the two-dimensional image to obtain the X-Y-Z image comprises the following steps:

Z＝D

where D denotes a depth image, u₀,v₀,f_x,f_yRepresenting camera internal parameters.

And then using the obtained (R, G, B, X, Y, Z) six-channel image as the input image of the step 2.

In step 2, a multitask semantic segmentation module is used for carrying out pixel-by-pixel target segmentation and feasible capture semantic segmentation on the RGB image and the X-Y-Z image (namely (R, G, B, X, Y, Z) six-channel image); the object segmentation is used for detecting the class to which the pixel belongs, and the feasible capture semantic segmentation is used for detecting whether the pixel is suitable to be used as a capture center.

The target segmentation decoding network and the feasible capture semantic segmentation decoding network are composed of intensive upsampling convolution modules with different layer numbers;

loss function usage of target-partition decoding network improved cross-entropy loss function L_semDefined as:

the value is 0 or 1, the value of 0 indicates that the class is different from the class truth value corresponding to the pixel, and the value of 1 indicates that the class is the same as the class truth value corresponding to the pixel;

confidence score representing that pixel x belongs to list c, using

The feasible capture semantic segmentation decoding network is a two-class network and adopts a common cross entropy loss function L_gaSpecifically defined as:

The loss function L of the multitask semantic segmentation module is defined as:

L＝γ₁L_sem+γ₂L_ga

wherein, γ₁And gamma₂Is an adjustable parameter.

The training data of the multitask semantic segmentation module is from a GraspNet-1Billion data set. The dataset contains object segmentation labels, and to obtain feasible capture semantic segmentation labels, the capture center of the 6DOF capture pose in the dataset may be projected into the 2D image.

As shown in fig. 2, it is an overall framework of the multitask semantic segmentation module in step 2. In this step, a multitask semantic segmentation module is used to perform pixel-by-pixel target segmentation and feasible capture semantic segmentation on the RGB image and the X-Y-Z image. The object segmentation is used for detecting the class to which the pixel belongs, and the feasible capture semantic segmentation is used for detecting whether the pixel is suitable to be used as a capture center. Specifically, information of an RGB image and an X-Y-Z image is coded by using ResNet-50, and then a target segmentation result and a feasible capture semantic segmentation result of each pixel are obtained by decoding the coded information by using a target segmentation decoding network and a feasible capture semantic segmentation decoding network. The target segmentation decoding network and the feasible capture semantic segmentation decoding network both adopt dense upsampling convolutional networks, but the layer number is different.

Specifically, in step 3, the depth image is complemented using a depth image complementing module.

The PENet algorithm adopts a two-channel framework, and a similar encoder-decoder network is constructed by using a deep convolution neural network and a deconvolution mode, wherein one channel takes color information as dominant input to obtain a color dominant depth map; and the other channel takes the original depth image as a leading input, combines the color leading depth map to obtain a depth leading depth map, then fuses the obtained color leading depth map and the depth leading depth map in a weighting mode to obtain a primary dense depth image, and finally refines the dense depth image by using DA-CSPN + + to obtain a final complete dense depth image.

Specifically, in step 4, a rotation matrix detection module is used to calculate a grabbing coordinate system where the feasible grabbing points are located.

Sampling K nearest Neighbor points near the feasible grabbing point by using a K-nearest Neighbor (KNN) algorithm to form a point set, fitting a plane nearest to the point set, obtaining the normal direction of the feasible grabbing point according to the fitted plane, cutting a curved surface where the feasible grabbing point is located by using the plane passing through the normal direction, intersecting the curved surface and the plane to obtain a curve, wherein the curve has a curvature at the feasible grabbing point, and selecting the direction with the maximum curvature and the minimum curvature at the feasible grabbing point in different curves as two main curvature directions of the feasible grabbing point.

Specifically, in step 5, the grab depth and grab width of the feasible grab point are determined using the grab depth and grab width detection module.

Taking the z value of the feasible grabbing point as an interval center, sampling two sides of the interval center by adopting a heuristic algorithm, and judging whether different z and w meet the following conditions: 1) the grabber does not collide with the scene point cloud before closing; 2) the closed area of the gripper needs to contain the centre of grip.

Specifically, in step 6 and step 7, the targeted grab pose is determined using a grab classification and assignment module.

And using PointNet as an encoder to encode information of points in the grabbing closed region of the generated grabbing candidates, classifying by using a full-connection layer, and filtering infeasible grabbing candidates to obtain a final grabbing posture set.

As shown in fig. 3, the exemplary graphs of the original scene, the target segmentation, the feasible capture semantic segmentation, and the generated targeted capture in the method shown in fig. 1 are shown, where the graph at the upper left corner in fig. 3 is the original scene, the graph at the upper right corner is the target separation, the graph at the lower left corner is the feasible capture semantic segmentation, and the graph at the lower right corner is the generated targeted capture.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A seven-degree-of-freedom grabbing posture detection method based on an RGB image and a depth image is characterized by comprising the following steps of:

2. The RGB image and depth image-based seven-degree-of-freedom grab gesture detection method of claim 1, wherein the grabber used in the method is a two-finger parallel grabber, and the grab parameters of the seven degrees of freedom are expressed as: g ═ x, y, z, α, β, γ, w, where (x, y, z) denotes the position of the gripper in the world coordinate system, (α, β, γ) denotes the rotation of the gripper coordinate system of the gripper around the x, y, z axes of the world coordinate system, and w denotes the tip width of the gripper.

3. The method for detecting a capturing pose with seven degrees of freedom based on an RGB image and a depth image as claimed in claim 1, wherein the step 1 of projecting the coordinates of the point cloud data to the two-dimensional image to obtain the X-Y-Z image comprises:

4. The RGB image and depth image-based seven-degree-of-freedom capture pose detection method of claim 1, wherein in the step 2, a multitask semantic segmentation module is used to perform pixel-by-pixel target segmentation and feasible capture semantic segmentation on the RGB image and the X-Y-Z image; the object segmentation is used for detecting the class of the pixel, and the feasible grabbing semantic segmentation is used for detecting whether the pixel is suitable to be used as a grabbing center;

confidence score representing that pixel x belongs to list c, using

5. The RGB image and depth image based seven-degree-of-freedom grab gesture detection method of claim 4, wherein the feasible grab semantic segmentation decoding network is one or twoA classification network using a common cross entropy loss function L_gaSpecifically defined as:

6. The method of seven-degree-of-freedom grabbing posture detection based on an RGB image and a depth image as claimed in claim 5, wherein the loss function L of the multitask semantic segmentation module is defined as:

L＝γ₁L_sem+γ₂L_ga

wherein, γ₁And gamma₂Is an adjustable parameter.

7. The RGB image and depth image-based seven-degree-of-freedom capture pose detection method of claim 1, wherein in the step 3, the depth image is complemented using a depth image complementing module;

8. The RGB image and depth image-based seven-degree-of-freedom grab gesture detection method of claim 1, wherein in the step 4, the grab coordinate system where the feasible grab point is located is calculated using the rotation matrix detection module;

9. The RGB image and depth image-based seven-degree-of-freedom grab gesture detection method of claim 2, wherein in the step 5, the grab depth and the grab width of the feasible grab point are determined using a grab depth and grab width detection module;

10. The RGB image and depth image-based seven-degree-of-freedom grab gesture detection method of claim 1, wherein in the step 6 and the step 7, the targeted grab gesture is determined using a grab classification and assignment module;