CN115861999A

CN115861999A - Robot grabbing detection method based on multi-mode visual information fusion

Info

Publication number: CN115861999A
Application number: CN202211212605.8A
Authority: CN
Inventors: 高剑; 郭靖伟; 陈依民; 李宇丰; 张昊哲; 杨旭博; 张福斌
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-03-28

Abstract

The invention relates to a robot grabbing detection method based on multi-mode visual information fusion, which comprises the following steps: acquiring an RGB image and a depth image through a depth camera; detecting a circumscribed rectangular frame of the target object by using the deep learning YOLO; acquiring a bounding box of the target object by combining the depth image; segmenting and extracting point clouds of the target object; processing the point cloud, and successively performing down-sampling, point cloud filtering and point cloud clustering operations; calculating the mass center of the target point cloud and calculating the principal component direction by utilizing a PCA algorithm; randomly and uniformly sampling the target point cloud to generate candidate grabbing postures; encoding the grasping candidate internal point cloud into a multi-channel image, and predicting scores by using a convolutional neural network; and fusing global point cloud information and local point cloud information, and selecting the grabbing attitude with the highest quality as an execution pose through weighted summation. The method fully utilizes the color image, the depth image and the global and local information of the target object point cloud, and improves the interaction capacity of the mechanical arm and the environment.

Description

Robot grabbing detection method based on multi-mode visual information fusion

Technical Field

The invention belongs to the technical field of robot target object grabbing, and particularly relates to a grabbing attitude detection method based on image and point cloud visual information fusion.

Background

With the development of artificial intelligence, robots are widely used in the fields of industry and service industry. The grabbing operation of the robot plays an important role, and the various picking and placing tasks of human beings are effectively avoided. Compared with the conventional method for grabbing a fixed object or grabbing a fixed position, the robot has great significance for automatically and accurately grabbing the specified object. The key point of the method is that the object existing in the scene needs to be known during autonomous operation, and how to detect the grabbing gesture under different visual information. Therefore, target recognition and grasp pose estimation are hot research points in the field of robots.

The deep learning technology is an important technical method for estimating the direction of target recognition and grabbing postures. The YOLO is widely applied to the field of robot vision detection as a real-time single-stage target detection network. For 6DoF grabbing pose detection, the traditional grabbing pose estimation method generates a grabbing pose based on the object pose of a known CAD model, and an object is difficult to acquire an accurate model in an actual situation. At present, a model-free grabbing posture detection process is to sample and generate grabbing candidates and then evaluate the grabbing candidates. For example, a network is generated based on The grabbing gesture of The point cloud, such as (a.ten pages, m.guarderi, k.saenko, and r.platt, "grassp point detection in point groups," The International Journal of Robotics Research (IJRR), 2017), the method directly judges The grabbing quality according to The grabbing closed internal point cloud, and has The advantages of greatly reducing The data volume and strong generalization, but because The global point cloud information of The object is not fully utilized, stable grabbing cannot be generated, and meanwhile, the type information of The object to be grabbed is not known, and The method cannot be directly applied to a complex scene. The fusion extraction of the image and point cloud visual information features is the key of the autonomous operation of the robot.

Disclosure of Invention

Technical problem to be solved

In order to avoid the defects of the prior art, the invention provides the robot grabbing detection method based on the multi-mode visual information fusion, the stable grabbing posture generation of the specified target object under the complex environment can be realized, the color image, the depth image and the global and local information of the point cloud of the target object are fully utilized, and the interaction capacity of the mechanical arm and the environment is improved.

Technical scheme

A robot grabbing detection method based on multi-modal visual information fusion is characterized by comprising the following steps:

step 1: acquiring an RGB image and a depth image containing a target object through a depth camera;

and 2, step: detecting a circumscribed rectangular frame of a target object in the RGB image by using a deep learning YOLO network;

and step 3: acquiring the three-dimensional coordinates of the bounding box of the target object under a camera coordinate system by using the depth information corresponding to the rectangular frame;

and 4, step 4: segmenting and extracting a point cloud set of the target object according to the angular point information of the target object bounding box;

and 5: performing voxel downsampling on the point cloud obtained by segmentation, filtering the point cloud, and clustering the point cloud to obtain an ideal point cloud of a target object;

step 6: calculating the centroid of the target ideal point cloud and calculating the normal vector direction of the point cloud by using a PCA (principal component analysis) algorithm to serve as a target grabbing approaching reference direction, and taking the target grabbing approaching reference direction as global information of the target object point cloud;

and 7: randomly and uniformly sampling the target point cloud to generate candidate grabbing postures, and expanding more grabbing postures through rotation and translation;

and 8: encoding the grasping candidate internal point clouds into compressed multi-channel images, and calculating the score of each grasping candidate by using a convolutional neural network;

and step 9: and calculating Euclidean distances between the center points of the grabbing candidates and the center of mass of the point cloud of the object and an included angle between the grabbing approaching direction and the grabbing approaching reference direction, and selecting the grabbing with the highest quality as the pose of the robot grabbing target by calculating the sum of the grabbing candidate score, the Euclidean distances and the included angle weighted score, so as to effectively fuse the global point cloud and the local point cloud information of the target object.

The invention further adopts the technical scheme that: the camera in the step 1 is a depth camera or a binocular camera, and can be fixed at the tail end of the mechanical arm to move along with the mechanical arm or can be completely fixed.

The further technical scheme of the invention is as follows: the deep learning YOLO network in the step 2 is to train a YOLO target object detection model by using object rectangular frame object types in the RGD image and using a relative ratio of a center point coordinate to an image width and a relative ratio of a height width to an image width and a height as labels, and obtain pixel coordinates corresponding to a target bounding box corner point.

The further technical scheme of the invention is as follows: the step 3 specifically comprises the following steps: based on a target rectangular frame in the RGB image, the geometric information of the bounding box of the target object under the camera coordinate system and the corner coordinates under the camera coordinate system are calculated by fusing the maximum value and the minimum value of the depth information of the region where the rectangular frame is located in the depth image.

The further technical scheme of the invention is as follows: and 4, the point cloud segmentation and extraction in the step 4 is to ensure that all point clouds of the target object are kept according to bounding box information, namely the coordinate range of the target object in the camera coordinate system along the three axes of x, y and z, and an amplification factor lambda is added, so that the point cloud direct filter is used for segmenting and extracting the target point cloud.

The further technical scheme of the invention is as follows: and 5, processing the point cloud extracted by segmentation, wherein the point cloud voxelization downsampling operation is used for reducing the data volume of the point cloud, the point cloud filtering can adopt radius filtering or statistical filtering and is used for removing outliers, and the point cloud clustering operation is used for obtaining a maximum clustering point set to obtain the point cloud of the target object.

The further technical scheme of the invention is as follows: the global information of the target object point cloud C in the step 6 comprises a centroid P _cen (x _cen ,y _cen ,z _cen ) And a grasping approach direction v ₃ And wherein the point cloud C contains n points, represented as:

C＝(c ₁ ,c ₂ ,...,c _n )

center of mass P _cen The calculation formula is the average value of coordinate values of all points in the point cloud set:

calculating the characteristic value lambda of the point cloud covariance matrix S through the PCA algorithm in the grabbing approaching direction ₁ ,λ ₂ ,λ ₃ And corresponding feature vector v ₁ ,v ₂ ,v ₃ Is obtained, wherein λ ₁ ≥λ ₂ ≥λ ₃ (ii) a The direction with the minimum variance is the normal vector direction of the target point cloud and is used as a reference direction for capturing the approaching object, namely v ₃ (ii) a The point cloud covariance calculation method comprises the following steps:

the invention further adopts the technical scheme that: step 7, randomly sampling the target point cloud to establish a grabbing coordinate system, wherein each sampling point is taken as a sphere center, a plurality of points with the nearest peripheral distance form a point cloud set of the sampling points, and a PCA algorithm is utilized to calculate a homogeneous coordinate matrix of the point cloud set, so that the grabbing coordinate system of the point is established;

in order to generate more grabbing candidate postures, grabbing and rotating around a z axis are performed by adding a translation amount along a y axis of a current grabbing coordinate system, and meanwhile, continuously performing translation along an x axis to enable an x coordinate value to meet a critical value that the grabbing postures do not collide with a target point cloud, so that the grabbing candidate postures are expanded;

grabbing candidate poses requires two conditions to be satisfied:

(1) The position of the grabbing pose is located so that the target point cloud is not contained, namely the clamping jaw of the mechanical arm does not collide with the target point cloud;

(2) Target point clouds are contained in the grabbing pose closed area, namely the target point clouds are contained in the clamping jaw;

and screening the generated capture candidate postures according to the conditions, and removing invalid capture candidates.

The further technical scheme of the invention is as follows: step 8, inputting the convolutional neural network for predicting the capture candidate score by projecting each generated effective capture candidate internal point cloud along the directions of three coordinate axes of a capture coordinate system, compressing the point cloud into a depth map according to the coordinate values of the directions, namely only reserving depth information along the current coordinate axis, and simultaneously mapping the normal vector of each point on the point cloud surface in the direction into an RGB type normal characteristic map; the depth feature map and the normal feature map are generated and used for extracting local surface features and geometric features of the point cloud, and the depth map and the normal feature map are used as input of the convolutional neural network together.

The invention further adopts the technical scheme that: step 9 of grabbing candidate center points p _g (x _g ,y _g ,z _g ) With the object point cloud centroid p _cen (x _cen ,y _cen ,z _cen ) The calculation formula of the Euclidean distance d is as follows:

wherein d is _max A reference value of the candidate center point and the maximum value of the object point cloud center is captured;

the method for calculating the included angle between the grabbing approaching direction and the grabbing approaching reference direction in the step 9 is to utilize a vector v of the grabbing approaching direction _approach (x _a ,y _a ,z _a ) And grabbing a vector of the reference direction

Calculating an angle α, where the vector v _approach And v ₃ Are unit vectors, and the calculation formula is as follows:

namely, the value range of alpha is [0,90];

the process of calculating the final grasping mass fraction S in step 9 is normalization and then weighting, and the calculation formula is:

wherein gamma, lambda and eta are respectively network score, distance and angle weighting coefficient, and the value range is [0,1 ], g _max And predicting a score maximum reference value for the grab network.

Advantageous effects

According to the robot grabbing detection method based on multi-mode visual information fusion, stable grabbing posture generation of a specified target object in a complex environment can be achieved, the color image, the depth image and global and local information of a target object point cloud are fully utilized, and the interaction capacity of a mechanical arm and the environment is improved.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.

FIG. 1 is a general flow chart of the disclosed method.

FIG. 2 is a diagram showing the results of detection of a YOLO target.

Fig. 3 is a schematic diagram of a PCA algorithm for calculating a grabbing coordinate system of a target point cloud.

Fig. 4 is a schematic diagram of a grabbing coordinate system of the sampling points.

Fig. 5 is a point cloud (characteristic) code diagram inside the clamping jaw: (a) a depth profile; and (b) a normal characteristic diagram.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1, which is an overall flow chart in the method of the present invention, the present invention discloses a robot grasping detection method based on multi-modal visual information fusion, which mainly comprises nine steps: step 1, acquiring an RGB image and a depth image through a depth camera; step 2, detecting an external rectangular frame of the target object by utilizing the deep learning YOLO; step 3, acquiring a bounding box of the target object by combining the depth image; step 4, segmenting and extracting point clouds of the target object; step 5, processing the point cloud, and successively performing downsampling, point cloud filtering and point cloud clustering operations; step 6, calculating the mass center of the target point cloud and calculating the principal component direction by utilizing a PCA algorithm; step 7, randomly and uniformly sampling the target point cloud to generate candidate grabbing postures; step 8, encoding the grasping candidate internal point cloud into a multi-channel image, and predicting the score by using a convolutional neural network; and 9, fusing the global point cloud information and the local point cloud information, and selecting the grabbing attitude with the highest quality as an execution pose through weighted summation.

The implementation of the invention requires a depth camera or a binocular camera and a GPU, and the specific implementation process adopts one notebook of a Geforce 1060GPU and one realsense d435i depth camera.

The method disclosed by the invention specifically comprises the following steps:

step 1, acquiring an RGD image and a depth image containing a target object by a depth camera; the camera is a depth camera or a binocular camera, can be fixed at the tail end of the mechanical arm to move along with the mechanical arm, and can also be completely fixed.

And 2, detecting the target object by using deep learning YOLO detection based on the RGB image as shown in FIG. 2, and outputting the frame information of the target object circumscribed rectangle.

Specifically, the ratio px, py of the horizontal and vertical coordinates of the center point in a camera coordinate system to the width and height of an input image, and the ratio pw, ph of the width and height of a rectangular frame to the width and height of the input image, wherein the input is a fixed-size image with the width w and the height h, and the information is utilized to calculate the coordinates p of four corner points of the rectangular frame _tl (x ₁ ,y ₁ )，p _tr (x ₂ ,y ₂ )，p _bl (x ₃ ,x ₃ )，p _br (x ₄ ,y ₄ ) The calculation formula is as follows:

/>

step 3, searching the maximum value z of the depth information in the range of the target circumscribed rectangle frame in the depth image _max With a minimum value z _min Therefore, three-dimensional coordinates of eight corner points of the bounding box of the target object under the camera coordinate system are comprehensively calculated.

Step 4, segmenting and extracting a point cloud set of the target object according to the angular point information of the target object bounding box;

specifically, according to the coordinate range of a target object in a camera coordinate system along three axes of x, y and z, an amplification factor lambda is added to ensure that all point clouds of the target object are reserved, and then a point cloud condition filter is used for segmenting and extracting a target point cloud set.

And step 5, carrying out voxelization downsampling on the point cloud, filtering the point cloud to remove outliers, and clustering the point cloud to obtain an ideal point cloud of the target object. The point cloud voxelization downsampling operation is used for reducing the data volume of the point cloud, the point cloud filtering can adopt radius filtering or statistical filtering and is used for removing outliers, and the point cloud clustering operation is used for obtaining a maximum clustering point set to obtain a target object point cloud;

step 6, calculating the centroid of the target ideal point cloud and calculating a homogeneous coordinate matrix of the point cloud by using a PCA (principal component analysis) algorithm, taking the direction with the minimum variance as a target point cloud grabbing approaching direction, and simultaneously establishing a reference coordinate system of the target point cloud as shown in FIG. 3, wherein the centroid and the grabbing approaching direction are taken as global information of the target point cloud;

the global information of the target object point cloud C comprises a centroid P _cen (x _cen ,y _cen ,z _cen ) And a grasping approach direction v ₃ And wherein the point cloud C contains n points, represented as:

C＝(c ₁ ,c ₂ ,...,c _n )

calculating the characteristic value lambda of the point cloud covariance matrix S through PCA algorithm in the grabbing approaching direction ₁ ,λ ₂ ,λ ₃ (wherein λ) ₁ ≥λ ₂ ≥λ ₃ ) And corresponding feature vector v ₁ ,v ₂ ,v ₃ And (4) obtaining. The direction with the minimum variance is the normal vector direction of the target point cloud and is used as the reference direction (namely v) for capturing the approaching object ₃ ). The point cloud covariance calculation method comprises the following steps:

and 7, randomly and uniformly sampling the target point cloud to generate candidate grabbing postures, and expanding more grabbing postures through rotation and translation.

Each sampling point is taken as a sphere center, a plurality of points with the nearest peripheral distance form a point cloud set of the sampling points, a PCA algorithm is utilized to calculate a homogeneous coordinate matrix of the point cloud set, and a grabbing coordinate system of the point is established,

particularly, each sampling point is taken as the center of sphere, and a plurality of points with the nearest peripheral distance form a point cloud set C of the sampling points _sample (ii) a By sampling point p _sample (x _s ,y _s ,z _s ) Using PCA algorithm to calculate the characteristic value lambda of the covariance of the local point cloud set _s1 ,λ _s2 ,λ _s3 And the feature vector v _s1 ,v _s2 ,v _s3 Thereby establishing a grasping coordinate system g (p) of the point _sample ,v _s1 ,v _s2 ,v _s3 ) (ii) a As shown in fig. 4.

In order to generate more grabbing candidate postures, grabbing and rotating amount is added along the y axis of the current grabbing coordinate system, and meanwhile, the current grabbing coordinate system is continuously translated along the x axis so that the x coordinate value meets the critical value that the grabbing postures do not collide with the target point cloud, and therefore the grabbing candidate postures are expanded.

Grabbing candidate poses requires two conditions to be satisfied:

(1) The position of the grabbing pose does not contain the target point cloud, namely the clamping jaw of the mechanical arm does not collide with the target point cloud.

(2) Target point clouds are contained in the grabbing pose closed area, namely the target point clouds are contained in the clamping jaws.

And screening the generated grabbing candidate postures according to the conditions, and removing invalid grabbing candidates to obtain a grabbing candidate set G which comprises n grabbing postures G.

The convolutional neural network for predicting the capture candidate score stated in step 8 inputs that each generated effective capture candidate internal point cloud is projected along three coordinate axis directions of the capture coordinate system, and is compressed into a depth map according to coordinate values of the directions, that is, only depth information along the current coordinate axis is retained, and meanwhile, a normal vector of each point on the point cloud surface in the direction is mapped into an RGB type normal characteristic map. The depth feature map and the normal feature map are generated and used for extracting local surface features and geometric features of the point cloud, and the depth map and the normal feature map are used as input of the convolutional neural network together. As shown in fig. 5.

The calculation formula of the Euclidean distance d between the candidate grabbing center point and the object point cloud center in the step 9 is as follows:

wherein d is _max And the reference value of the candidate center point and the maximum value of the object point cloud center is grabbed.

The method for calculating the included angle between the grabbing approaching direction and the principal direction of PCA in the step 9 is to utilize a vector A (x) of the grabbing approaching direction _a ,y _a ,z _a ) And PCA principal component direction vector B (x) _b ,y _b ,z _b ) And (3) calculating an included angle, wherein the vectors A and B are unit vectors, and the calculation formula is as follows:

namely, the value range of alpha is [0,90]

The process of calculating the final grasping mass fraction S in step 9 is normalization and then weighting. The calculation formula is as follows:

wherein gamma, lambda and eta are nets respectivelyThe weighting coefficients of the channel score, the distance and the angle are in the value range of [0,1 ], g _max And predicting a maximum reference value for the captured network.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications or substitutions can be easily made by those skilled in the art within the technical scope of the present disclosure.

Claims

1. A robot grabbing detection method based on multi-modal visual information fusion is characterized by comprising the following steps:

step 2: detecting a circumscribed rectangular frame of a target object in the RGB image by using a deep learning YOLO network;

and 3, step 3: acquiring the three-dimensional coordinates of the bounding box of the target object under a camera coordinate system by using the depth information corresponding to the rectangular frame;

and step 9: calculating the Euclidean distance between the candidate grabbing center point and the object point cloud center and the included angle between the grabbing approaching direction and the grabbing approaching reference direction, selecting the grabbing with the highest quality as the pose of the robot grabbing target by calculating the candidate grabbing score, the Euclidean distance and the included angle weighted score, and effectively fusing the global point cloud and the local point cloud information of the target object.

2. The robot grabbing detection method based on multi-modal visual information fusion according to claim 1, characterized in that: the camera in the step 1 is a depth camera or a binocular camera, and can be fixed at the tail end of the mechanical arm to move along with the mechanical arm or can be completely fixed.

3. The robot grasping detection method based on the multi-modal visual information fusion as claimed in claim 1, wherein: the deep learning YOLO network in the step 2 is to train a YOLO target object detection model by using object rectangular frame object types in the RGD image and using a relative ratio of a center point coordinate to an image width and a relative ratio of a height width to an image width and a height as labels, and obtain pixel coordinates corresponding to a target bounding box corner point.

4. The robot grabbing detection method based on multi-modal visual information fusion according to claim 1, characterized in that: the step 3 specifically comprises the following steps: based on a target rectangular frame in the RGB image, the geometric information of the bounding box of the target object in the camera coordinate system and the corner point coordinates in the camera coordinate system are calculated by fusing the maximum value and the minimum value of the depth information of the region where the rectangular frame is located in the depth image.

5. The robot grabbing detection method based on multi-modal visual information fusion according to claim 1, characterized in that: and 4, the point cloud segmentation and extraction in the step 4 is to ensure that all point clouds of the target object are kept according to bounding box information, namely the coordinate range of the target object in the camera coordinate system along the three axes of x, y and z, and an amplification factor lambda is added, so that the point cloud direct filter is used for segmenting and extracting the target point cloud.

6. The robot grabbing detection method based on multi-modal visual information fusion according to claim 1, characterized in that: and 5, processing the point cloud extracted by segmentation, wherein the point cloud voxelization downsampling operation is used for reducing the data volume of the point cloud, the point cloud filtering can adopt radius filtering or statistical filtering and is used for removing outliers, and the point cloud clustering operation is used for obtaining a maximum clustering point set to obtain the point cloud of the target object.

7. The robot grasping detection method based on the multi-modal visual information fusion as claimed in claim 1, wherein: the global information of the target object point cloud C in the step 6 comprises a centroid P _cen (x _cen ,y _cen ,z _cen ) And a grasping approach direction v ₃ And wherein the point cloud C contains n points, represented as:

C＝(c ₁ ,c ₂ ,...,c _n )

8. the robot grasping detection method based on the multi-modal visual information fusion as claimed in claim 1, wherein: step 7, randomly sampling the target point cloud to establish a grabbing coordinate system, wherein each sampling point is taken as a sphere center, a plurality of points with the nearest peripheral distance form a point cloud set of the sampling points, and a PCA algorithm is utilized to calculate a homogeneous coordinate matrix of the point cloud set, so that the grabbing coordinate system of the point is established;

grabbing candidate poses requires two conditions to be satisfied:

9. The robot grabbing detection method based on multi-modal visual information fusion according to claim 1, characterized in that: step 8, inputting the convolutional neural network for predicting the capture candidate score by projecting each generated effective capture candidate internal point cloud along the directions of three coordinate axes of a capture coordinate system, compressing the point cloud into a depth map according to the coordinate values of the directions, namely only reserving depth information along the current coordinate axis, and simultaneously mapping the normal vector of each point on the point cloud surface in the direction into an RGB type normal characteristic map; the depth feature map and the normal feature map are generated and used for extracting local surface features and geometric features of the point cloud, and the depth map and the normal feature map are used as input of the convolutional neural network together.

10. The robot grabbing detection method based on multi-modal visual information fusion according to claim 1, characterized in that: step 9 of grabbing candidate center points p _g (x _g ,y _g ,z _g ) With the object point cloud centroid p _cen (x _cen ,y _cen ,z _cen ) The calculation formula of the Euclidean distance d is as follows:

the method for calculating the included angle between the grabbing approaching direction and the grabbing approaching reference direction in the step 9 is to use the vector v of the grabbing approaching direction _approach (x _a ,y _a ,z _a ) And grabbing a vector of the reference direction

/>

namely, the value range of alpha is [0,90];