CN115631401A

CN115631401A - Robot autonomous grabbing skill learning system and method based on visual perception

Info

Publication number: CN115631401A
Application number: CN202211652001.5A
Authority: CN
Inventors: 吴鸿敏; 鄢武; 徐智浩; 周雪峰; 谷世超
Original assignee: Institute of Intelligent Manufacturing of Guangdong Academy of Sciences
Current assignee: Institute of Intelligent Manufacturing of Guangdong Academy of Sciences
Priority date: 2022-12-22
Filing date: 2022-12-22
Publication date: 2023-01-20

Abstract

The invention discloses a vision perception robot autonomous grabbing skill learning system and a vision perception robot autonomous grabbing skill learning method, wherein the system comprises a data processing module, a data acquisition module and a data processing module, wherein the data processing module is used for acquiring images, and marking positions which can be grabbed by clamping and positions which cannot be grabbed by clamping in each acquired image to obtain marked images; the model training module is used for building a lightweight generative convolutional neural network architecture, performing supervised learning on the marked image by using the built lightweight generative convolutional neural network architecture to obtain the optimal target grabbing position and posture, and storing the optimal network parameters; the model deployment module is used for loading the stored optimal network parameters, reading in images acquired by the camera, and reasoning the read images by utilizing the built neural network architecture to obtain the robot control quantity; and the motion planning module is used for planning a collision-free track among the starting point, the grabbing point and the terminal point of the robot according to the control quantity of the robot. The invention effectively improves the grabbing efficiency of the robot.

Description

Robot autonomous grabbing skill learning system and method based on visual perception

Technical Field

The invention relates to the field of robots, in particular to a system and a method for learning robot independent grabbing skills through visual perception.

Background

In recent years, the development of the robot technology is gradually relieving the problems of intensive manual labor, aggravated aging of population, difficult enterprise recruitment and the like in China, wherein the robot grabbing technology is that target objects are sequentially taken out of a stack of unordered objects, is a key link in automation scenes such as logistics sorting, loading and unloading of machine tool workpieces, stacking and the like, can reduce the workload of workers, improves the working efficiency, and works continuously for 24 hours. Robot grabbing mainly comprises three key subtasks: the method comprises the steps of object identification and positioning, grabbing pose generation and motion planning, and the subtasks are advanced layer by layer, so that the method is an important process for realizing the autonomous grabbing task of the robot. The object identification and positioning task is to take a picture by using a camera and acquire the position information of a target based on the picture; the grabbing pose generation task is to determine the direction and the posture of the target in a three-dimensional space and then judge the optimal grabbing point position of the target, so that grabbing failure caused by the fact that a grabbing point robot cannot execute the grabbing point position is avoided; and the motion planning part controls the robot or the actuator to move to a corresponding position, avoids collision and a singular point position of a robot joint, and finishes a grabbing task.

As the robot autonomous grabbing operation comprises three sub-tasks which are very challenging, better effect can be obtained only by the cooperation of the various engineers in different fields, the efficiency is low, the labor cost is increased for related enterprises, and the obstruction is formed for the automation process of related products.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides the vision-perceived robot autonomous grabbing skill learning system and method, realizes that the grabbing control quantity of the robot is directly obtained by a three-dimensional vision picture, and effectively improves grabbing efficiency.

In order to realize the purpose, the technical scheme of the invention is as follows:

in a first aspect, the present invention provides a vision-aware robot autonomous grasping skill learning system, including:

the data processing module is used for acquiring images and marking positions which can be clamped and grabbed and positions which can not be clamped and grabbed in each acquired image to obtain marked images;

the model training module is used for building a lightweight generative convolutional neural network architecture, performing supervised learning on the marked image by using the built lightweight generative convolutional neural network architecture to obtain the optimal target grabbing position and posture, and storing the optimal network parameters;

the model deployment module is used for loading the stored optimal network parameters, reading in images acquired by the camera, and reasoning the read images by using the built neural network architecture to obtain the robot control quantity;

and the motion planning module is used for planning the collision-free track among the starting point, the grabbing point and the terminal point of the robot according to the control quantity of the robot.

Further, in the data processing module, the image acquisition comprises acquiring pictures disclosed on the internet and pictures shot by a local three-dimensional camera;

the pictures shot by the local three-dimensional camera are acquired in the following mode:

the method comprises the following steps of locally shooting a color image, a depth image and a point cloud image of an object under a simple background, and reserving a background image without a target object; the camera is adopted to fix the visual angle, the object placing position is changed, three different postures of each group of objects are subjected to data enhancement and checking, and the pictures which are not beneficial to processing are timely subjected to complementary shooting and enhanced processing.

Further, the lightweight generated convolutional neural network architecture comprises:

three convolutional layers, respectively: convolutional layers 9x9,32Filters, step 3; convolutional layers of 5x5,32filters, step size 2; convolutional layer 3x3,8filters, step 2;

three transposed convolution layers, respectively: transpose the convolutional layers 3x3,8filters, step size 2; transposing the convolution layer 3x3, 16Filters, step size 2; transpose convolutional layers 9x9,32Filters, step 3.

Further, the lightweight generative convolutional neural network architecture performs a mapping from picture to grab target pose:

；

represent the rows and columns respectively

And

a picture, a picture matrix;

in order to grasp the pose of the object,

is an integer set;

is the position of the target, and is,

in order to obtain the target attitude angle,

in order to have the width of the opened clamping jaws,

indicating the expected probability of success of the current grab.

Further, the motion planning module plans collision-free tracks between a starting point, a grabbing point and an end point of the robot according to the robot control quantity and by using a collision detection algorithm based on a model describing the robot and the obstacle by a hierarchical envelope box.

In a second aspect, the invention provides a vision-perceived robot autonomous grasping skill learning method, which includes:

and (3) data processing: collecting images, and marking positions which can be clamped and grabbed and positions which can not be clamped and grabbed in each collected image to obtain marked images;

model training: building a lightweight generative convolutional neural network architecture, performing supervised learning on the marked image by using the built lightweight generative convolutional neural network architecture to obtain the optimal target grabbing position and posture, and storing the optimal network parameters;

model deployment step: loading the stored optimal network parameters, reading in images acquired by a camera, and reasoning the read images by using the built neural network architecture to obtain robot control quantity;

and (3) movement planning step: and planning a collision-free track between the starting point, the grabbing point and the terminal point of the robot according to the control quantity of the robot.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a vision perception robot autonomous grabbing skill learning system and method based on advanced technologies such as artificial intelligence, machine vision, robots and the like, designs a light neural network framework special for complex object grabbing tasks, can directly obtain multiple control quantities required by robot grabbing through three-dimensional vision pictures, and greatly improves grabbing efficiency.

Drawings

Fig. 1 is a general framework diagram of a vision-aware robot autonomous grasping skill learning system provided in embodiment 1 of the present invention;

fig. 2 is a flowchart of a specific working principle of the vision-aware robot autonomous grasping skill learning system according to embodiment 1 of the present invention;

FIG. 3 is a labeling result of positive and negative examples in the data processing module;

FIG. 4 is a schematic diagram of a lightweight convolutional neural network architecture;

FIG. 5 is a result of network identification for deployment;

fig. 6 shows the results of building a hierarchical envelope box and planning a collision-free trajectory.

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1:

object pose estimation is a key problem for a robot grabbing task. Compared with the traditional plane vision, the three-dimensional vision integrates an active structured light or grating emitter, is insensitive to the change of illumination, has the same three-dimensional information as the real world due to the imaging characteristic not being a plane any more, has richer characteristics, and does not have the phenomenon of 'big or small in size'. The traditional object identification and pose estimation method mainly extracts artificially designed points, lines, edges and other features, such as algorithms of two-dimensional feature description operators SIFT, SURF, ORB, hough and three-dimensional feature operators FPFH, SHOT and the like. However, the artificial features are easily interfered by dynamic targets and illumination, and still need to be debugged in real scenes, which results in extremely low efficiency and poor universality, and is easily influenced by uncertainties such as external interference, task change, complex object structure, environmental noise, robot errors, sensing errors and the like in an unstructured dynamic environment, and thus, the requirements of practical application are difficult to meet. In recent years, with the deep fusion of deep learning and machine vision, the deep neural network has been used for visual feature extraction and learning, which has better adaptability and generalization capability, and has gained wide attention.

The generation of the grabbing pose mainly comprises an analytical method and an empirical method (a data sampling method), wherein the analytical method simulates a clamping jaw and a grabbing target of the robot through a real physical model or a simulation environment, and the closed grabbing behavior of the target is analyzed to convert the closed grabbing behavior into an optimized problem to be solved. The analysis method needs to establish a grabbing model in advance, determine an optimization target and an optimization function, has large calculation amount and large early-stage input workload, and is not suitable for the situations of multiple types of industrial scene targets and quick transformation. Unlike analytical methods, empirical methods rely on classifying and ranking, preferably selecting, candidate capturers sampled from an image or point cloud according to a particular index. The empirical method needs to process data in advance and has low search efficiency, and in most cases, the operations of identifying the target and extracting and grabbing candidate points are separated, so that the calculation time is long from several seconds to tens of seconds, the efficiency depends on a search algorithm, and the speed is slow. Therefore, the prior art is rarely used for closed-loop grabbing execution, even in a static environment, grabbing can be successfully carried out only by means of accurate camera calibration and accurate robot control, and popularization and application are difficult.

The robot autonomous grabbing skill learning is an important research aspect crossing the fields of artificial intelligence, machine vision, robots and the like, the robot is enabled to have autonomous grabbing skills through a deep neural network and visual perception, control quantities such as grabbing points, grabbing poses, grabbing angles and the like of objects which are seen or not seen are obtained, grabbing tasks of the robot are standardized and automated, and teaching-free and rapid deployment of a robot system is achieved.

The invention provides a vision-aware robot autonomous grabbing skill learning system based on advanced technologies such as artificial intelligence, machine vision, robots and the like, designs a lightweight neural network framework special for complex object grabbing tasks, can directly obtain multiple control quantities required by robot grabbing through three-dimensional vision pictures, and greatly improves grabbing efficiency.

Specifically, referring to fig. 1, the vision-aware robot autonomous grasping skill learning system provided in this embodiment mainly includes a data processing module, a model training module, a model deployment module, and a motion planning module.

The specific operation principle of each module is described in detail below with reference to fig. 2:

the data processing module is used for collecting images, and marking positions which can be clamped and grabbed and positions which can not be clamped and grabbed in each collected image to obtain marked images.

Since a large amount of image capture data is required for training the deep learning network in the model training step described below, in this embodiment, image acquisition is performed by combining images published on the internet and images taken by a local three-dimensional camera. The color image, the depth image and the point cloud image of the object under the simple background are shot locally, a background image without the target object is reserved, and subsequent background difference processing is facilitated to extract the target. In the embodiment, a mode of fixing a visual angle and changing the placement position of objects is adopted, and each group of objects has three different postures; then, data enhancement and manual check are carried out by utilizing an image processing technology, and pictures which are not beneficial to processing are timely subjected to complementary shooting and enhancement processing; as shown in fig. 3, software is used to mark the grabbing positions in the color image and the depth image, where rectangular square boxes are used to mark the positions of the end clips, the positions of the clips are divided into two major categories, positions that can be clamped (positive examples) and positions that cannot be clamped (negative examples), and the two categories of each image are marked respectively. Storing the pixel values of four vertexes of each rectangle, and outputting the pixel values to a text file for calling of subsequent model training; finally, the annotated data was partitioned into training and test sets by 80% and 20%, respectively.

The model training module is used for building a lightweight generative convolutional neural network architecture, the built lightweight generative convolutional neural network architecture is used for performing supervised learning on the marked image to obtain the optimal target grabbing position and posture, and the optimal network parameters are stored.

Specifically, as shown in fig. 4, the lightweight convolutional neural network architecture includes three convolutional layers, which are: convolutional layers 9x9,32filters, step size 3; convolutional layers of 5x5,32filters, step size 2; convolutional layers 3x3,8filters, step 2; three transposed convolution layers, respectively: transpose the convolutional layers 3x3,8filters, step size 2; transposed convolutional layers 3x3, 16Filters, step size 2; transpose convolutional layers 9x9,32Filters, step 3. Therefore, by adopting the generated convolutional neural network architecture, the robot grabbing control quantity can be directly obtained from the depth map, and the grabbing efficiency is effectively improved. The specific principle is as follows: definition of the gripping point in a plane

Here, the

Which represents the point of grasping of the object,

indicating the angle of rotation of the end clip about the z-axis in a plane,

the width of the open clip is indicated because the size of the object is available with a three-dimensional camera, so this width information can be directly obtained from the depth map. Finally, it is

The expected probability of success of the current grab is indicated. The proposed model architecture utilizes a deep neural network to accomplish a mapping from a picture to a grabbed target pose:

；

represent the rows and columns respectively

And

a picture, a picture matrix;

for grabbing object position，

Is an integer set;

is the position of the target, and is,

in order to obtain the target attitude angle,

in order to have the width of the opened clamping jaws,

indicating the expected probability of success of the current grab (grab quality). The relation formed by four quantities of the grabbing pose is also regarded as a multidimensional matrix, and the actual network reaction is the optimal mapping relation between the two matrices. The model training module firstly initializes model parameters, inputs labeled picture data, carries out multiple iterative training on the network and calculates a loss function, adjusts the weight and the learning rate of the network by using a BP algorithm to carry out multiple rounds of training, and finally stores the optimal network parameters according to the training result.

The model deployment module directly loads the stored optimal network parameters during specific application, reads in photos collected by a camera, utilizes the built lightweight generative convolutional neural network frame to carry out reasoning to obtain control quantities such as a robot grabbing point, a grabbing pose, a grabbing angle, grabbing quality and the like, transfers pixel coordinates to a robot coordinate system according to a hand-eye calibration result, and outputs joint angles required to rotate when the robot reaches a target. The deployed network identification results are shown in fig. 5.

Aiming at the grabbing skill of the robot, after an object is identified and the optimal grabbing point is judged, the grabbing point, the placing point and the like of the robot need to be planned, and the safe operation of the robot is guaranteed. For this purpose, the motion planning module establishes an envelope of a main target object for a scene according to a robot control amount and using a collision detection algorithm that describes a model of the robot and an obstacle based on an hierarchical envelope Box (OBB), and plans collision-free trajectories between a start point, a grasp point, and an end point, as shown in fig. 6, which mainly relates to collision detection, path search, path smoothing, and robot action execution.

In summary, compared with the prior art, the invention has the following technical advantages:

(1) The light weight generation type convolutional neural network architecture special for the autonomous learning of the grabbing control quantity of the complex object is provided, the control quantities such as the grabbing point, the grabbing pose, the grabbing angle and the grabbing quality of the robot can be automatically obtained only by taking a depth image as input, and the programming and deployment efficiency of the robot is improved;

(2) The vision-aware robot autonomous grabbing skill learning system is built, integration of four key modules including data processing, model training, model deployment and motion planning is achieved through the advanced technologies such as machine vision and robot skill learning, and the application requirement of complex object grabbing in an unstructured environment is met.

Example 2:

the embodiment provides a robot autonomous grasping skill learning method based on visual perception, which comprises the following steps:

model deployment: loading the stored optimal network parameters, reading in images acquired by a camera, and reasoning the read images by using the built neural network architecture to obtain robot control quantity;

and (3) movement planning step: and planning a collision-free track between the starting point, the grabbing point and the end point of the robot according to the control quantity of the robot.

The specific principle and flow of the above steps are the same as the working principle of each module in the above embodiment 1, and are not described again in this embodiment.

The above embodiments are only for illustrating the technical concept and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention accordingly, and not to limit the protection scope of the present invention accordingly. All equivalent changes or modifications made in accordance with the spirit of the present disclosure are intended to be covered by the scope of the present disclosure.

Claims

1. A vision-aware robotic autonomic grasping skill learning system, comprising:

and the motion planning module is used for planning a collision-free track among the starting point, the grabbing point and the terminal point of the robot according to the control quantity of the robot.

2. The vision-aware robotic autonomous grasping skill learning system according to claim 1, wherein in the data processing module, the image acquisition includes acquiring pictures published on the internet and pictures taken by a local three-dimensional camera;

3. The vision-aware robotic autonomic crawling skill learning system of claim 1, wherein the lightweight generative convolutional neural network architecture comprises:

three convolutional layers, respectively: convolutional layers 9x9,32Filters, step 3; convolutional layers 5x5,32Filters, step 2; convolutional layer 3x3,8filters, step 2;

4. The vision-aware robotic autonomous crawling skill learning system of claim 1 or 3, wherein said lightweight generative convolutional neural network architecture performs a mapping from picture to crawling target pose:

；

represent the rows and columns respectively

And

a picture, a picture matrix;

in order to grasp the pose of the object,

is an integer set;

is the position of the target, and is,

in order to obtain the target attitude angle,

in order to have the width of the opened clamping jaws,

indicating the predicted probability of success of the current grab.

5. The vision-aware robot autonomous grasping skill learning system according to claim 1, wherein the motion planning module plans collision-free trajectories between a robot start point-grasp point-end point according to robot control quantities and using a collision detection algorithm based on a model describing the robot and the obstacle with a hierarchical envelope box.

6. A robot autonomous grasping skill learning method based on visual perception is characterized by comprising the following steps:

model deployment step: loading the stored optimal network parameters, reading in images acquired by a camera, and reasoning the read images by using the built neural network architecture to obtain the robot control quantity;

7. The vision-aware robot autonomous crawling skill learning method according to claim 6, wherein in the data processing step, the capturing images includes capturing pictures published on the internet and pictures taken by a local three-dimensional camera;

8. The vision-aware robot-autonomous crawling skill learning method of claim 6, wherein said lightweight generative convolutional neural network architecture comprises:

three convolutional layers, respectively: convolutional layers 9x9,32filters, step size 3; convolutional layers of 5x5,32filters, step size 2; convolutional layers 3x3,8filters, step 2;

9. The vision-aware robot autonomous crawling skill learning method of claim 6 or 8, wherein said lightweight generative convolutional neural network architecture performs a mapping from picture to crawling target pose:

；

represent the rows and columns respectively

And

a picture, a picture matrix;

in order to grasp the pose of the object,

is an integer set;

is the position of the target, and is,

in order to obtain the target attitude angle,

in order to have the width of the opened clamping jaws,

indicating the expected probability of success of the current grab.

10. The vision-aware robot autonomous grasping skill learning method according to claim 6, wherein in the motion planning step, collision-free trajectories between a robot start point, a grasping point, and an end point are planned in accordance with a robot control amount and using a collision detection algorithm based on a model describing a robot and an obstacle by a hierarchical envelope box.