CN116277030A

CN116277030A - Model-free grabbing planning method and system based on depth vision

Info

Publication number: CN116277030A
Application number: CN202310480343.1A
Authority: CN
Inventors: 彭刚; 关尚宾
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2023-06-23

Abstract

The invention discloses a model-free grabbing planning method and system based on depth vision, wherein the system comprises the following steps: the system comprises an image acquisition module, a pose generation module, a processor and a track planning module; the image acquisition module is used for acquiring an RGB image, a depth image and a point cloud; the pose generation module comprises an image-based pose generation module and a point cloud-based pose generation module; the processor is used for comparing the HSV color space of the object with the HSV color space of the desktop, inputting the grabbing pose generated by the pose generating module based on the point cloud into the track planning module when the HSV color space of the object is in the HSV color space range of the desktop, and inputting the grabbing pose generated by the pose generating module based on the image into the track planning module when the HSV color space of the object is not in the HSV color space range of the desktop; the track planning module is used for controlling the mechanical arm to move to the grabbing pose and executing grabbing operation. The invention can be used for complex grabbing scenes without establishing a model, and has high grabbing success rate when the color of the object is similar to that of a desktop.

Description

Model-free grabbing planning method and system based on depth vision

Technical Field

The invention belongs to the technical field of robot grabbing, and particularly relates to a model-free grabbing planning method and system based on depth vision.

Background

Object grabbing is a common means of robot work, and in a mechanical arm grabbing task based on an object model, multiple objects of different types in a scene are usually needed to be grabbed. The 3D model of the object needs to be manually manufactured by a professional through a certain technical means, the cost is high, and the 3D model of all the objects is difficult to obtain. However, the conventional model grabbing method needs a 3D model of an object as input, and is difficult to adapt to complex grabbing scenes with more objects.

In addition, the idea of grabbing the target through color segmentation is to segment through color difference of the object and the desktop color, when the object color is similar to the desktop, the object similar to the desktop color is difficult to perceive through the RGB image, and deviation exists when the object is segmented, so that grabbing failure is caused.

Therefore, the prior art has the technical problems that the model is difficult to build, the model is difficult to adapt to complex grabbing scenes with more objects, and the grabbing success rate is low when the colors of the objects are similar to the colors of the desktop.

Disclosure of Invention

Aiming at the defects or improvement demands of the prior art, the invention provides a model-free grabbing planning method and system based on depth vision, which solve the technical problems that the prior art is difficult to establish a model, difficult to adapt to complex grabbing scenes with more objects and has low grabbing success rate when the colors of the objects are similar to the colors of a desktop.

To achieve the above object, according to one aspect of the present invention, there is provided a model-free grasp planning system based on depth vision, including: the system comprises an image acquisition module, a pose generation module, a processor and a track planning module;

the image acquisition module is used for acquiring an object to be grabbed, an RGB image, a depth image and a point cloud of a desktop where the object to be grabbed is located;

the pose generation module comprises an image-based pose generation module and a point cloud-based pose generation module;

the image-based pose generation module is used for acquiring an RGB image and a depth image, removing desktop pixels in the RGB image to obtain a pixel area of an object, calculating the minimum circumscribed rectangle of the pixel area of the object to obtain a 2D pixel position of the object to be grabbed in the RGB image, mapping the 2D pixel position into the depth image, and generating a grabbing pose under a world coordinate system by combining camera parameter information;

the pose generation module based on the point cloud is used for acquiring the point cloud, removing the desktop point cloud after downsampling the point cloud, clustering the rest object point clouds to form independent point clouds, calculating the minimum outsourcing rectangular box of the independent point clouds, and generating the grabbing pose under the world coordinate system through the minimum outsourcing rectangular box and camera parameter information;

The processor is used for comparing the HSV color space of the object to be grabbed with the HSV color space of the desktop where the object is located, inputting the grabbing pose generated by the pose generation module based on the point cloud into the track planning module when the HSV color space of the object is in the HSV color space range of the desktop, and inputting the grabbing pose generated by the pose generation module based on the image into the track planning module when the HSV color space of the object is not in the HSV color space range of the desktop;

and the track planning module is used for controlling the mechanical arm to move to the grabbing pose and executing grabbing operation.

Further, the image-based pose generation module includes:

the minimum circumscribed rectangle forming module is used for graying the pixel area of the object to obtain a pixel point set of the object, setting an initial angle to enable the rectangle to surround the pixel point set of the object, rotating the rectangle, calculating the rectangular area under each rotating angle, and taking the minimum area rectangle as the minimum circumscribed rectangle.

Further, the image-based pose generation module further includes:

the capturing pose generation module is used for acquiring the coordinates of the center point of the minimum circumscribed rectangle under the pixel coordinate system, calculating the coordinates of two endpoints of two short sides of the center point perpendicular to the minimum circumscribed rectangle under the pixel coordinate system, acquiring the coordinates of the center point and the depth values of the two endpoints through the depth map, converting the coordinates of the center point under the pixel coordinate system and the depth values thereof into the coordinates of the center point under the camera coordinate system through the camera internal parameters, converting the coordinates of the two endpoints and the depth values thereof into the coordinates of the two endpoints under the camera coordinate system through the camera internal parameters, calculating the projection of vectors of the two endpoints on an X-O-Y plane and the included angle of the X axis, obtaining the corner of the capturing pose along the world coordinate system Z axis, calculating the coordinates of the capturing center under the world coordinate system through the coordinates of the center point under the camera coordinate system, and the corner of the capturing pose along the world coordinate system Z axis to form the capturing pose.

Further, the image-based pose generation module further includes:

the pixel segmentation module is used for describing the RGB image as an RGB color space, eliminating the pixel area of the desktop in the RGB color space, reserving the pixel area in the RGB color space of the object, or converting the RGB color space into an HSV color space, eliminating the pixel area of the desktop in the HSV color space, and reserving the pixel area of the object in the HSV color space.

Further, the pose generation module based on the point cloud comprises:

the point cloud segmentation module is used for equally dividing a three-dimensional space formed by the point clouds into a plurality of cubes, reserving the center point of each cube to obtain a down-sampled point cloud, randomly selecting N points from the down-sampled point cloud as intra-office points, fitting the intra-office points into an initial plane, traversing all extra-office points except the intra-office points in the down-sampled point cloud, adding an intra-office point set if the distance between the extra-office points and the initial plane is smaller than a threshold value T, and after multiple iterations, obtaining the largest Ping Miandian cloud of the most intra-office point set, eliminating the largest plane point cloud, wherein the rest extra-office points are object point clouds.

Further, the pose generating module based on the point cloud further comprises:

The minimum outsourcing rectangular box forming module is used for dividing object point clouds into a plurality of independent point clouds through European clustering, each independent point cloud corresponds to an object, calculating a coordinate mean value and a covariance matrix of point cloud data for each independent point cloud, wherein the coordinate mean value is the mass center of the point clouds, forming a rotation transformation matrix by feature vectors of the covariance matrix, mapping the point cloud data into a coordinate system formed by the rotation transformation matrix and translation vectors corresponding to the mass center, and generating an OBB rectangular box serving as the minimum outsourcing rectangular box.

Further, the pose generating module based on the point cloud further comprises:

the grabbing pose generation module is used for combining the minimum outsourcing rectangular box with camera parameter information, solving a rotation matrix from the OBB main axis coordinate system to the camera coordinate system, converting a translation vector obtained by the OBB main axis coordinate system to the rotation matrix of the camera coordinate system and the centroid coordinate of the point cloud into a rotation matrix and a translation vector under the world coordinate system, combining Euler angles solved by the rotation matrix under the world coordinate system and the translation vector under the world coordinate system to form grabbing poses.

According to another aspect of the present invention, there is provided a model-free grab planning system based on depth vision, comprising: the system comprises an image acquisition module, a pose generation module and a track planning module;

The image acquisition module is used for acquiring an object to be grabbed and an RGB image and a depth image of a desktop where the object to be grabbed is located;

the pose generation module is used for acquiring an RGB image and a depth image, removing desktop pixels in the RGB image to obtain a remaining pixel area, calculating the minimum circumscribed rectangle of the remaining pixel area to obtain a 2D pixel position of an object to be grabbed in the RGB image, mapping the 2D pixel position into the depth image, and generating a grabbing pose under a world coordinate system by combining camera parameter information;

the image acquisition module is used for acquiring the object to be grabbed and the point cloud of the desktop where the object to be grabbed is located;

the pose generation module is used for acquiring point clouds, removing desktop point clouds after downsampling the point clouds, clustering the rest object point clouds to form independent point clouds, calculating a minimum outsourcing rectangular box of the independent point clouds, and generating a grabbing pose under a world coordinate system through the minimum outsourcing rectangular box and camera parameter information;

According to another aspect of the present invention, there is provided a model-free grab planning method, including:

collecting an RGB image, a depth image and a point cloud of a desktop where an object to be grabbed is located;

when the HSV color space of the object is not in the HSV color space range of the desktop, eliminating desktop pixels in the RGB image to obtain a remaining pixel area, calculating the minimum circumscribed rectangle of the remaining pixel area to obtain the 2D pixel position of the object to be grabbed in the RGB image, mapping the 2D pixel position into a depth image, and generating a grabbing pose under a world coordinate system by combining camera inside and outside parameter information;

when the HSV color space of the object is in the HSV color space range of the desktop, eliminating the desktop point cloud after downsampling the point cloud, then clustering the rest object point clouds to form independent point clouds, calculating a minimum outsourcing rectangular box of the independent point clouds, and generating a grabbing pose under a world coordinate system through the minimum outsourcing rectangular box and camera inside and outside parameter information;

and the mechanical arm moves to the grabbing pose, and grabbing operation is performed.

In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:

(1) The system uses the processor to judge whether the colors of the object and the desktop are similar, when the HSV color space of the object is in the HSV color space range of the desktop, the object is similar to the desktop in color, and the capturing pose generated by the pose generating module based on the point cloud is used for executing the capturing task, because the pose generating module based on the point cloud is found through experiments, the obtained pose can be better used for capturing planning tasks of unknown object models, the average capturing success rate of 83.3% is achieved, the system can adapt to the object with the similar color of the desktop background, and the capturing success rate of 82.9% can still be achieved when the color is similar to the desktop background. When the HSV color space of the object is not in the HSV color space range of the desktop, the fact that the colors of the object and the desktop are dissimilar is indicated, at the moment, the grabbing task is executed by using the grabbing pose generated by the image-based pose generating module, and experimental results show that the image-based pose generating module provided by the invention can be better used for grabbing planning tasks of unknown object models without taking a 3D model of the object as input, and the average grabbing success rate of 84.6% is achieved. The invention can be well adapted to complex grabbing scenes with more objects without building a model, and has high grabbing success rate when the colors of the objects are similar to the desktop. According to the invention, according to the change of the grabbing scene, the grabbing gestures generated in different modes are selected to execute grabbing tasks, so that the efficiency is improved, and meanwhile, the grabbing accuracy under different scenes is improved.

(2) The invention accurately acquires the minimum area rectangle which can completely surround the target outline in a rotating rectangle mode. The minimum circumscribed rectangle of the object can be accurately found, so that the accurate grabbing pose can be formed, and the grabbing success rate is improved. Because the tail end of the mechanical arm usually uses the parallel two-finger gripper to grip an object, in the gripping pose, the center of the parallel two-finger gripper corresponds to the three-dimensional coordinate of the center point of the minimum circumscribed rectangle under the world coordinate system, and in order to ensure that the parallel two-finger gripper can perform gripping operation to the maximum extent, the gripping direction should be vertically downward along the short sides of the minimum circumscribed rectangle, so that the two end points of the center point perpendicular to the two short sides of the minimum circumscribed rectangle are used for calculating the rotation angle of the gripping pose along the Z axis of the world coordinate system. When the points on the pixel coordinate system are mapped to the camera coordinate system, since the RGB map is registered with the depth map, the depth value corresponding to the pixel can be obtained through the depth map.

(3) The color information of the desktop in the scene is relatively fixed, so that the desktop pixels can be segmented by adopting a threshold value in a color space, object pixels can be obtained by separation, the desktop pixels can be segmented by adopting RGB or HSV, and the maximum advantage of the RGB color space is that the method is suitable for a hardware display system, and is visual and easy to understand; the HSV color space better describes the way humans observe colors, the hue H and saturation S are closely tied to the way humans feel colors, and brightness changes do not affect the hue and saturation components of an image. Because HSV can describe the perception of human eyes to colors more intuitively, the object with a certain specific color is easier to track than RGB color space, and therefore, when the object is segmented and grabbed by the invention, the HSV space is preferably selected.

(4) The scene point cloud data volume obtained through the RGB-D camera is large, a plurality of redundant point clouds exist, if the redundant point clouds are not processed and are directly used as the input of an algorithm frame, huge burden and waste are caused on computing resources, and the real-time performance is poor, so that the point clouds are required to be subjected to the preprocessing of downsampling. After the preprocessed point cloud is obtained, the point cloud of the desktop is required to be segmented, and the desktop can be segmented through the depth point cloud without being influenced by the colors of the point cloud, so that the method can be better suitable for desktops with different colors and is suitable for scenes with similar colors of objects and desktops. The invention uses a voxel filtering mode to obtain the downsampling point cloud with minimum density and most sufficient information quantity within the error precision allowable range of the algorithm. The method reduces the number of the point clouds through downsampling, and simultaneously saves the shape characteristics of the point clouds as much as possible.

(5) The invention aims to prevent the problem that the maximum plane is not a desktop when the local point set with the maximum number of points is regarded as the desktop. After the desktop is separated, the remaining point cloud is the target object point cloud, and the point cloud needs to be separated into independent point cloud of each target object. There are two methods for solving the bounding rectangle, namely an OBB box (Oriented Bounding Box, directed bounding box) and an AABB box (Axis Aligned Bounding Box ), wherein the OBB box is closer to the object than the AABB box, and therefore the invention employs an OBB rectangle.

(6) The position and pose generation module in the system eliminates desktop pixels in the RGB image to obtain the rest pixel areas, calculates the minimum circumscribed rectangle of the rest pixel areas to obtain the 2D pixel position of the object to be grabbed in the RGB image, maps the 2D pixel position to the depth image, generates the grabbing pose under the world coordinate system by combining camera parameter information, does not need a 3D model of the object as input based on the RGB-D model pose generation technology, can be better used for grabbing planning tasks of unknown object models, adapts to complex grabbing scenes with more objects, has high grabbing success rate, and greatly shortens the time for generating grabbing poses.

(7) The pose generation module in the system separates the plane point cloud of the desktop, calculates the minimum outsourcing rectangular box of the target object and acquires the pose of the target object, the pose of the target object acquired by the model-free pose generation technology based on the point cloud can be better used for the grabbing planning task of an unknown object model, the system adapts to complex grabbing scenes of more objects, the grabbing success rate is high, the system can adapt to objects with similar colors as the desktop background, and the system still has high grabbing success rate when the colors are similar to the desktop background.

Drawings

FIG. 1 is a schematic diagram of a model-free grabbing planning system based on depth vision according to an embodiment of the present invention;

FIG. 2 is a schematic view of a 4-DOF pose provided by an embodiment of the present invention;

fig. 3 is a top view of a gripping gesture provided by an embodiment of the present invention;

FIG. 4 (a) is a schematic view of a voxel grid according to an embodiment of the present invention;

fig. 4 (b) is a voxel diagram provided by an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

As shown in fig. 1, a model-free grabbing planning system based on depth vision includes: the system comprises an image acquisition module, a pose generation module, a processor and a track planning module;

Example 1

The mechanical arm end effector of the model-free grabbing planning system based on depth vision is a typical parallel two-finger gripper.

To clearly describe the grabbing operation, 4 parameters are typically used to describe a parallel two-finger gripper: gripper closing area width hand_depth, gripper closing area height hand_height, gripper closing area length hand_width, gripper finger thickness finger_width.

In embodiment 1 of the present invention, the parameters of the parallel two finger gripper are:

the maximum width of the grabbing object is as follows:

obj_width _max ＝hand_width-2*finger_width＝70mm

the object set grabbed by the grabbing task in the real environment is common in various life, such as fruits, biscuits, boxes and the like, most of the objects are not established and are difficult to acquire a 3D model, a small amount of objects are similar to the color of a desktop, and the object of the grabbing task is to grab and collect all the objects on the desktop to a designated area.

The mechanical arm moves to an initial pose at first, and captures scene information including RGB images and depth information through an image acquisition module; then inputting visual information into a pose generation module to acquire the grabbing pose of a target object, wherein the invention provides that the grabbing pose is always vertical downwards, namely, the 4-DOF grabbing pose is calculated; and finally, controlling the mechanical arm to move to the grabbing pose through the track planning module, and executing grabbing operation. In the image acquisition module, the mechanical arm firstly moves to an acquisition pose, simultaneously acquires RGB and a depth map, and generates a point cloud through a camera SDK; in the pose generation module, the RGB image and the depth information are used for calculating the pose of the grabbing target, and the grabbing pose is generated; in the track planning module, an RRTConnect algorithm is used, a mechanical arm is firstly moved to a position, which is right above a z axis of a 4-DOF pose of a grabbing target, by means of joints, and then the tail end of the mechanical arm is controlled to move towards an object in a linear mode along the z axis direction of the grabbing pose by a distance L, at the moment, two parallel gripper devices are closed to clamp the object, and the object is moved to a designated place to finish grabbing operation.

As shown in fig. 2, the 4-DOF capture pose may be described as

Wherein the coordinate point of the grabbing pose in the world coordinate system is P (x, y, z), and the rotation angle along the z axis of the target pose is +.>

When the mechanical arm performs grabbing planning, the planning failure may be caused by phenomena such as grabbing blank or grabbing deviation caused by grabbing pose deviation, so that a certain index is needed to be used for judging whether grabbing is successful or not, wherein the commonly used index is a force closed model.

The two-finger clamp is used to achieve dynamic balance when the clamp gives force to counteract other forces in other directions of the object, and only 2 contact points are left with the object when the clamp is used to grasp. According to the Nguyen theorem, the force closure condition may be described as: whether the connecting line between the contact points of the two finger holders and the target object is in the friction cone or not, if the connecting line is in the friction cone, the closing condition of the conforming force can be indicated to grasp and plan successfully; otherwise, the force closing condition is not met, and the grabbing planning fails.

The pose generation module comprises: an image-based pose generation module and a point cloud-based pose generation module.

The processor is used for comparing the object to be grabbed with the HSV color space of the desktop where the object is located, inputting the grabbing pose generated by the pose generation module based on the point cloud into the track planning module when the HSV color space of the object is in the HSV color space range of the desktop, and inputting the grabbing pose generated by the pose generation module based on the image into the track planning module when the HSV color space of the object is not in the HSV color space range of the desktop.

Based on an image pose generating module, on the basis of camera calibration, an RGB image and a depth map of a current scene are obtained, desktop pixels are separated through priori knowledge of known desktop colors, and the rest pixel areas with a certain scale are regarded as grabbing targets; obtaining the 2D pixel position of the target object in the RGB image by calculating the minimum circumscribed rectangle of each pixel area; and mapping the 2D position into a depth map, and finally generating the pose of the object under the world coordinate system through the depth information and the camera internal and external parameter information.

RGB is the most common color model for hardware-oriented display devices (e.g., PC displays, mobile terminal displays, etc.). According to the human eye structure, all colors can be regarded as superposition combinations of different proportions of 3 basic colors, namely three primary colors of chromatic light, namely red (R), green (G) and blue (B).

Normalizing the RGB color space cube to a unit cube, then RGB values [ O,255]Normalized to interval [0,1 ]]Among them. The origin O (0, 0) of the coordinate system in the color space is black, the vertex W (1, 1) farthest from the origin corresponds to white, and the gray distribution value from black to white is on the body diagonal

And (3) upper part. According to the RGB color space model, each color can be expressed as luminance values of 0-255 for the three primary color planes, and the variation of 3 color channels can be superimposed on each other to form 16777216 (i.e., 256 ³ ) And (5) a color.

The greatest advantage of the RGB color space is that it is suitable for a hardware display system, intuitive and easy to understand, but when describing colors, the three primary colors of color light are highly correlated among 3 components, and the uniformity is poor, for example, the brightness of the same color changes, and all three components change correspondingly.

HSV color space is a perceptually based color model that describes color as 3 attributes: h (Hue), S (Saturation), V (Value), specifically described as:

a) Tone: the wavelength of light reflected by or transmitted through the object is distinguished by color, such as red, green;

b) Brightness: the brightness of the color, such as dark red and bright red;

c) Saturation: the color is dark and light, such as dark red and light red.

The HSV color space better describes the way humans observe colors, the hue H and saturation S are closely connected to the way humans feel colors, and brightness changes do not affect the hue and saturation components of an image, the model of which corresponds to a cone in a cylindrical coordinate system.

In the cone HSV color space, the vertex of the cone has brightness v=0, the bottom surface has v=1, and the brightness V increases linearly along the cone generatrix from the vertex of the cone; color H is determined by the angle of rotation about the conic midline, for example R, G, B, with red corresponding angle of 0 °, green corresponding angle of 120 °, blue corresponding angle of 240 °, and each color 180 ° different from its complement; the saturation S is s=0 at the conic midline, s=1 at the conic side, and increases linearly from inside to outside.

Since colors exist objectively in reality, different color spaces are only descriptions of different angles, so that the parameters of the RGB color space and the HSV color space have unique corresponding relations and can be mutually converted through the following transformation relations:

a) Conversion of RGB to HSV color space:

v＝max

where r, g, b are components of the corresponding channel in RGB color space, max=max (r, g, b), min=min (r, g, b), respectively.

b) Conversion of HSV to RGB color space:

first, an intermediate variable is calculated:

where h, s, v are the values of the three channels of the HSV color space, respectively, and mod is the remainder symbol. The components of the three channels of the RGB color space obtained through the intermediate variables are as follows:

as the HSV can more intuitively describe the perception of the human eyes to the color, the object with a certain specific color is easier to track than the RGB color space, and therefore, the embodiment of the invention adopts the HSV space when the object is split and grabbed.

Based on the look-up correlation data, HSV values for some representative colors are shown in Table 1:

TABLE 1 HSV color space threshold for typical colors

Except that the hues of red in table 1 have two ranges of 0-10 and 156-180, the hues of other colors have only one range.

Although the desktop color is quasi-white, the desktop color belongs to atypical colors, and the invention designs the visual software for dividing the desktop with the specified color based on Qt. When the method is used for dividing, firstly, an RGB image is required to be collected as a sample, then, the software drags the slide bar to change the channel value of the HSV space to divide the desktop color, and the interval of the HSV color space which belongs to the divided desktop is obtained as follows:

It can be seen that the RGB image has culled the pixels of the desktop portion, leaving only the pixels of the target object. After the RGB image is grayed, a plurality of pixel point sets exist in the obtained gray image, the target object point sets are distinguished through a Moore-Neighbor contour searching algorithm, the basic idea is to search all continuous pixel points with a certain scale in the binary image, the contour composed of the continuous pixel points is extracted and returned, and the point set enclosed by each contour corresponds to the pixel point set of an object. After the object pixel points are distinguished, the minimum circumscribed rectangle of each object pixel point set can be obtained, and then the pixel position of the object in the RGB image is obtained. The minimum circumscribed rectangle is a rectangle with the minimum area which can completely surround the target outline, when solving, an initial angle is firstly set to enable the rectangle to just surround the object pixel point set, the rectangle is rotated (0-90 degrees) in a coordinate system with a certain step length, the area of the rectangle surrounded under each rotation angle is calculated, the rectangle with the minimum area is found, and then the center point P (u) of the rectangle is solved _o ，v _o ) And a rotation angle θ.

After the minimum circumscribed rectangle TR of the object is obtained through the RGB image, the coordinate of the TR center point under the pixel coordinate system can be obtained to be P (u _o ，v _o ) And the horizontal angle θ between TR and the RGB image. In the grabbing position, the center corresponding point P of the parallel two-finger clamp is three-dimensional under the world coordinate system, so as to ensure that the parallel two-finger clamp can carry out grabbing operation to the maximum extent, and the grabbing direction should be alongThe smallest circumscribed rectangle short side is vertically downward as shown in fig. 3.

By P (u) _o ，v _o ) And θ calculates the coordinates of both end points of the short side in the pixel coordinate system as A (u _A ，v _A ) And B (u) _B ，v _B ). When the points on the pixel coordinate system are mapped to the camera coordinate system, the RGB image of the vision sensor is registered with the depth image, so that the depth value z corresponding to the pixel can be obtained through the depth image. For a point (u, v) and its depth value z in the pixel coordinate system, the pixel's coordinate (x, y, z) in the camera coordinate system can be solved by:

wherein f _x 、f _y 、u ₀ 、v ₀ Is an internal reference of the camera. (u) ₀ ，v ₀ ) The principal point coordinates of the camera are given subscripts of 0, and the p point coordinates (u _o ，v _o ) Is denoted by O. f (f) _x 、f _y Is the focal length of the camera.

By calculating P (u) _o ，v _o )、A(u _A ，v _A ) And B (u) _B ，v _B ) The coordinates in the camera coordinate system are P respectively _P (x _P ，y _P ，z _P )、P _A (x _A ，y _A ，z _A )、P _B (x _B ，y _B ，z _B ). To obtain the coordinates of three points P, A, B

The coordinates of the type 4-DOF Pose in the world coordinate system of Pose are the grabbing center P _o (x _o ，y _o ，z _o ) Pose corner along the z-axis of the world coordinate system >

For vector->

Projection on the X-O-Y plane +.>

Included angle with x-axis, i.e.)>

The parameters can be calculated by the following formula.

Wherein atan2 is defined as follows:

it follows that the atan2 function can calculate the angle between [ -pi, pi ] from the coordinate values of the inputs x and y and determine the quadrant in which the angle is located from the sign of x and y, whereas the inverse trigonometric function arctan can generally calculate two sets of solutions or no solutions, so that the atan2 function is more stable than the inverse trigonometric function arctan.

The length L of the short side of the minimum bounding rectangle TR can be expressed as a vector in three-dimensional space

Projection of +.>

Is represented by the following formula:

to ensure that the object can be grasped, it is necessary to ensure the length obj_width of the jaws extended _max Less than the minimum width of the object, i.e. the length L of the short side of the minimum bounding rectangle TR. If obj_width _max If the object is less than L, the grabbing planning can be continued, if the object is obj with _max And if the gripping condition is not met, and the gripping is abandoned.

The pose generation module based on the point cloud is Intel RealSense D435i, besides the depth map, the depth visual information acquired by the camera can also be calculated through the SDK of the camera to obtain the depth point cloud, and the space pose of the grabbing target is calculated based on the depth point cloud. The scene point cloud data volume obtained through the RGB-D camera is large, a plurality of redundant point clouds exist, if the redundant point clouds are not processed and are directly used as the input of an algorithm frame, huge burden and waste are caused on computing resources, the instantaneity is poor, and therefore the point clouds are required to be subjected to down-sampling preprocessing. After the preprocessed point cloud is obtained, the point cloud of the desktop needs to be segmented, one typical characteristic of the desktop in the grabbing task is the maximum plane in the visual field, and the segmentation of the desktop through the depth point cloud can be free of the influence of the color of the desktop, so that the method can be better suitable for desktops with different colors and is suitable for scenes with similar colors of objects and desktops. And after the desktop point clouds are separated, clustering the point clouds of the rest objects, separating each grabbing target point cloud into independent point clouds, finally calculating a minimum outsourcing rectangular box of the independent point clouds, and solving the pose of the target object through parameters of the rectangular box, so that grabbing poses are obtained.

Since the point cloud obtained by the RGB-D camera SDK is a dense point cloud, the amount of data is large, and thus downsampling is required. The method of the invention using Voxel Grid (Voxel filtering) filtering, as shown in (a) in fig. 4 and (b) in fig. 4, has the core idea of dividing a three-dimensional space into a plurality of small cubes with side length v, wherein deltax=deltay=deltaz=v, and only the point P nearest to the center of the cube is reserved for all points in the cube _center And filtering out other points, so as to obtain the downsampled point cloud with the minimum density and the most sufficient information quantity within the allowable range of the error precision of the algorithm. The method reduces the number of the point clouds through downsampling, and simultaneously saves the shape characteristics of the point clouds as much as possible.

After down-sampling the input dense point cloud, voxel filtering greatly reduces redundant point clouds for the largest desktop in the point clouds, and the largest plane after filtering is still the desktop, so that desktop segmentation is not influenced; for grabbing the target object, as the object has a certain volume, the voxel filtering can still keep the key point cloud of the object, and the influence on target identification is small. However, voxel filtering has the disadvantage that objects are small enough and smaller than the edge length of the voxel grid, which may be filtered out, and the size of the voxel grid needs to be determined according to the edge length of the grabbing object. According to the invention, the size of the voxel filtering grid is used as an adjustable super parameter, when the grabbing target is generally smaller, the smaller voxel grid is adopted, so that the small target cannot be filtered, but the problems of slow algorithm processing data and poor real-time performance are caused by fewer filtering point clouds; when the target is generally large in size, a large voxel grid size can be adopted, so that the calculated amount can be remarkably reduced, and the capturing instantaneity is improved.

After voxel filtering downsampling, the desktop can be segmented out as a background, and in the grabbing task, the desktop is the largest plane in the field of view, and can be fitted and segmented by adopting a RANSAC algorithm, the method assumes that the distribution of the desktop can be described by the same model parameter in space, the point cloud conforming to the distribution model is an intra-office point (inlier), the point cloud unsuitable for the model is an out-of-office point (outlier), and for input point cloud data, the steps of the RANSAC algorithm are as follows:

1) Randomly selecting N points as local points, and fitting the N points into a specified model;

2) Substituting the outlier into the fitted model, judging whether the outlier belongs to the interior point group, and recording the number of the local interior points;

3) And (3) designating the iteration times N, repeating the step (2) for N times, and taking the model with the largest number of local points as a solving result.

According to the steps of the RANSAC algorithm, when a desktop is segmented by a grabbing task, 3 points in a point cloud are randomly selected as local points in each iteration, an initial plane is generated by the 3 local points, all external points except the local points are traversed, and if the distance between the external points and the initial plane is smaller than a threshold T, a local point set is added. After N iterations, the most local point sets in the whole process are the point clouds with the largest plane (namely the desktop), the local points are combined to be separated, and the rest local point sets are the point clouds of the grabbing target object, wherein T and N are adjustable super parameters.

When the local point set with the largest number of points is regarded as a desktop, in order to prevent the problem that the maximum plane is not the desktop, posture constraint is established. Maximum plane normalization equation ax+by+cz+d=0 fitted By RANSAC algorithm, where a ² +B ² +C ² =1, then the planar normal vector is converted into the world coordinate system to obtain

From a priori knowledge that the known desktop normal vector is always perpendicular to the ground, the fitted planar normal vector should be parallel to the z-axis in the world coordinate system, or normal vector within the tolerance of error +.>

Angle with the z-axis->

Less than a certain angle value θ, the embodiment of the invention takes θ=5°, i.e., +.>

And->

Cosine value of included angle:

in a practical sense due to the problems of the invention

And->

The included angle is +.>

The cosine function cos decreases with increasing angle in this interval, so that there is c+.cos5+= 0.99619469809.

After the desktop is separated, the remaining point cloud set is the target object point cloud, and the point cloud set needs to be separated into independent point cloud sets of each target object, wherein one common point cloud clustering algorithm is Euclidean clustering. The core idea of Euclidean clustering is that points with Euclidean distance smaller than a certain threshold value are all classified into the same category, and the specific process is as follows: selecting an unprocessed point, if the point is not classified, establishing a new category by the point, searching points (becoming 'neighbor points') with surrounding Euclidean distances smaller than a threshold T, and adding the points into the category; if it is classified, only the neighboring points around need to be found, and then the iterative processing is continued for the newly added points until no newly added points exist, so as to obtain all the points of the class. The final point cloud is divided into a plurality of point sets according to categories, each point set represents a target object, and the pose of the point set is the pose of the target object.

After the object point cloud is divided into a plurality of point clouds through European clustering, as independent point clouds corresponding to each object are not regular, the pose of the object point cloud is relatively difficult to directly obtain, and therefore, a rectangular bounding box of the object point clouds can be obtained first. Two methods for solving the bounding rectangle are respectively an OBB box (Oriented Bounding Box, directed bounding box) and an AABB box (Axis Aligned Bounding Box, axis alignment bounding box), wherein the OBB box is closer to an object than the AABB box, so that the method adopts the OBB rectangle, and the main steps are as follows:

(1) Taking a single object point cloud set distinguished by European clustering as input and traversing, obtaining coordinate information of each point, and calculating a coordinate mean and a covariance matrix of point cloud data, wherein the coordinate mean is a centroid P (x, y, z) of the point cloud;

(2) Solving eigenvectors and eigenvalues of the covariance matrix, and forming the eigenvectors into a rotation transformation matrix Rot;

(3) And mapping the point cloud data into a coordinate system formed by the Rot rotation transformation and the translation transformation corresponding to the centroid P (x, y, z), and generating the OBB rectangular box.

Through the steps, a rotation matrix Rot from an OBB principal axis coordinate system to a camera coordinate system is obtained, a translation vector Trans (x, y, z) is obtained through point cloud centroid coordinates P (x, y, z), the two are combined into a rotation matrix T, and the T is converted into a world coordinate system to obtain T _w Wherein T is _w Is Rot _w The translation vector is Trans (x _w ，y _w ，z _w ) Then T is taken _w The rotation matrix part Rot in (a) _w Solving Euler angles to obtain the pose of the target object, and finally obtaining the 4-DOF grabbing pose always perpendicular to the desktop downwards

Example 2

The RGB-D based method comprises the following steps:

collecting RGB images and depth images of an object to be grabbed and a desktop where the object to be grabbed is located;

obtaining an RGB image and a depth image, removing desktop pixels in the RGB image to obtain a remaining pixel area, calculating the minimum circumscribed rectangle of the remaining pixel area to obtain a 2D pixel position of an object to be grabbed in the RGB image, mapping the 2D pixel position into the depth image, and generating a grabbing pose under a world coordinate system by combining camera parameter information;

and controlling the mechanical arm to move to the grabbing pose, and executing grabbing operation.

Example 3

The method based on the point cloud comprises the following steps:

collecting point clouds of an object to be grabbed and a desktop where the object to be grabbed is located;

after downsampling point clouds, eliminating desktop point clouds, clustering the rest object point clouds to form independent point clouds, calculating a minimum outsourcing rectangular box of the independent point clouds, and generating a grabbing pose under a world coordinate system through the minimum outsourcing rectangular box and camera parameter information;

Example 4

The method for generating the pose based on the RGB-D and the pose generating method based on the point cloud and the pose estimating method DOPE and PointNetGPD used in the prior art are used for carrying out grabbing experiments, and detailed analysis is carried out on experimental results.

Most of the objects in the object set of the grabbing experiment do not build a 3D model, and a small part of the objects are similar to the background color of the desktop. When the contrast experiment of grabbing tasks is designed, the following 4 groups of contrast experiment scenes are set:

1) Scene 1:5 known model objects;

2) Scene 2:3 known models+2 unknown model objects;

3) Scene 3:5 unknown model objects;

4) Scene 4:5 unknown model objects, including 2 objects with similar colors to the desktop background;

the method of the invention is compared with the existing pose estimation methods DOPE and PointNetGPD. Each group of experiments are repeated for 20 times, each task is grabbed for 7 times, the evaluation index is the grabbing success rate, the grabbing success rate/total grabbing time is calculated, in addition, as the movement time of the mechanical arm is determined by the actual pose of the object, the real-time performance of the average time-consuming evaluation algorithm for generating the grabbing pose is only used as a control variable. The experimental results are shown in table 2.

Table 2 results of the grasp comparison experiment

* And (3) injection: the RGB-D based method requires acquisition of one image sample, and this time has been averaged to each experiment.

From the experimental results of table 2, the analysis was made as follows: 1) The DOPE has higher grabbing success rate for the object with the known 3D model when acquiring the pose of the target object, but the grabbing success rate is obviously reduced along with the increase of the object with the unknown model, if the object with the known 3D model is not in the scene, the method is completely invalid, and the pose of the object with the unknown model cannot be acquired; 2) The PointNetGPD can acquire the 6-DOF capturing pose of the target object when the 3D model of the object is unknown, but the method needs to take a long time when calculating the optimal 6-DOF pose; 3) According to the method, the pose generation method based on RGB-D is used for grabbing, a 3D model of an object is not required to be input, the object with an unknown model in a scene can be grabbed, the average grabbing success rate is 84.6%, the grabbing success rate is improved by 49.8% compared with the DOPE grabbing success rate, the grabbing success rate is reduced by 3.0% compared with the PointNetGPD, and because the method only calculates 4-DOF grabbing poses, the time consumption is reduced by 73.1% compared with PointNetGPD, the grabbing pose generation speed is greatly improved, and the method is more suitable for grabbing scenes with high real-time performance; 4) According to the invention, the capturing is limited by the precision of the depth sensor by using the pose generation method based on the point cloud, compared with the method based on the RGB-D, the average capturing success rate is reduced by 1.5%, the pose acquisition time is increased by 1.1 times, but in a scene containing 2 objects similar to the desktop background color, the capturing success rate of the method based on the point cloud is improved by 11.7% compared with the method based on the RGB-D, and the average time consumption is reduced by 43.6% compared with the PointNetGPD.

While embodiment 1 has two modules for generating the grabbing pose, embodiments 2 and 3 have only one grabbing pose generating method, it can be seen that, in the model-free pose generating method based on RGB-D in embodiment 2, the 4-DOF pose of the target object is obtained according to the pixel information of the separated object. Experimental results show that the RGB-D-based model-free pose generation method provided by the invention does not need a 3D model of an object as input, can be better used for a grabbing planning task of an unknown object model, and achieves an average grabbing success rate of 84.6%; embodiment 3 a model-free pose generation method based on point cloud separates the point cloud of the desktop where the desktop is located, calculates the minimum outsourcing rectangular box of the target object and obtains the 4-DOF pose of the target object. Experimental results show that the target object pose obtained by the model-free pose generation method based on the point cloud can be better used for a grabbing planning task of an unknown object model, achieves an average grabbing success rate of 83.3%, can adapt to objects with similar colors to a desktop background, and still achieves a grabbing success rate of 82.9% when the colors are similar to the desktop background. The embodiment 1 comprises two modules corresponding to the two methods, different modules are selected according to actual conditions to calculate the grabbing pose, the grabbing pose is used for guiding the mechanical arm to grab tasks, and experimental results show that the method provided by the invention has good performance when grabbing the object of the unknown model in the real environment, the grabbing efficiency and the instantaneity are high, and the method based on the point cloud can adapt to the object with the color similar to the desktop background. According to the change of the grabbing scene, different grabbing pose is selected to execute grabbing tasks, so that the grabbing accuracy under different scenes can be improved while the efficiency is improved.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A model-free grasping and planning system based on depth vision, comprising: the system comprises an image acquisition module, a pose generation module, a processor and a track planning module;

the pose generation module based on the point cloud is used for acquiring the point cloud, eliminating the desktop point cloud, clustering the rest object point clouds to form independent point clouds, calculating a minimum outsourcing rectangular box of the independent point clouds, and generating a grabbing pose under a world coordinate system through the minimum outsourcing rectangular box and camera parameter information;

2. The depth vision based model-free grab planning system of claim 1, wherein the image-based pose generation module comprises:

3. The depth vision based model-less grab planning system of claim 2, wherein the image-based pose generation module further comprises:

4. A depth vision based model-less grab planning system according to claim 3, wherein the image-based pose generation module further comprises:

5. The depth vision based model-free grab planning system of claim 1, wherein the point cloud based pose generation module comprises:

6. The depth vision based model-free grab planning system of claim 5, wherein the point cloud based pose generation module further comprises:

7. The depth vision based model-free grab planning system of claim 6, wherein the point cloud based pose generation module further comprises:

8. A model-free grasping and planning system based on depth vision, comprising: the system comprises an image acquisition module, a pose generation module and a track planning module;

9. A model-free grab planning system, comprising: the system comprises an image acquisition module, a pose generation module and a track planning module;

the pose generation module is used for acquiring point clouds, eliminating desktop point clouds, clustering the rest object point clouds to form independent point clouds, calculating a minimum outsourcing rectangular box of the independent point clouds, and generating a grabbing pose under a world coordinate system through the minimum outsourcing rectangular box and camera parameter information;

10. The model-free grabbing planning method based on depth vision is characterized by comprising the following steps of:

When the HSV color space of the object is in the HSV color space range of the desktop, eliminating the desktop point cloud, then clustering the rest object point clouds to form independent point clouds, calculating a minimum outsourcing rectangular box of the independent point clouds, and generating a grabbing pose under a world coordinate system through the minimum outsourcing rectangular box and camera inside and outside parameter information;