CN111360862A

CN111360862A - Method for generating optimal grabbing pose based on convolutional neural network

Info

Publication number: CN111360862A
Application number: CN202010134999.4A
Authority: CN
Inventors: 庞剑坤; 魏武
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-02-29
Filing date: 2020-02-29
Publication date: 2020-07-03
Anticipated expiration: 2040-02-29
Also published as: CN111360862B

Abstract

The invention discloses a convolution neural network-based method for generating an optimal grabbing pose, which comprises the following steps of: s1, setting parameters for representing the grabbing quality in the grabbing process; s2, constructing a convolutional neural network model; s3, training a neural network model by adopting a Cornell Graspining data set; and S4, inputting the object depth map acquired by the camera into the trained neural network model, and calculating grabbing parameters, wherein the grabbing parameters are used for driving the mechanical arm to grab. The algorithm for generating the optimal grabbing pose based on the convolutional neural network model can quickly obtain the optimal grabbing pose of an object only by inputting the depth information of the object, is simple and can be widely popularized in the fields of mechanical arm visual grabbing, dynamic tracking and the like.

Description

Method for generating optimal grabbing pose based on convolutional neural network

Technical Field

The invention relates to the field of mechanical arm visual grabbing, in particular to a method for generating an optimal grabbing pose based on a convolutional neural network.

Background

In recent years, with the rapid development of computer vision, a mechanical arm and vision are combined to integrate more environment perception capabilities, and the method is also gradually a research hotspot. If the mechanical arm wants to grab an object, the specific position of the object is firstly obtained through a camera (sensor), and then the optimal grabbing pose suitable for the object is found through an internal evaluation algorithm, wherein two processes are involved, the type of the object is confirmed, and the optimal grabbing pose is screened out according to the state of the object. If such objects were not seen before the computer (algorithm), it is more difficult to generate the optimal capture pose for the unseen objects. The method solves the problems, and the university of Berkeley, California 'Dex-Net 2.0: Deep Learning to plane Robust scales with Synthetic Point cloud and analytical Grasp Metrics' proposes an algorithm of a convolutional neural network, the algorithm has a high grabbing success rate for general objects, unfortunately, the algorithm contains too many network parameters (million levels), the operation rate is relatively low, and the algorithm is difficult to reappear on general machines, so the method faces challenges in practical popularization.

Disclosure of Invention

The invention provides a convolutional neural network-based method for generating an optimal grabbing pose, which mainly solves the problem that the current grabbing algorithm is lack of the method for quickly generating the optimal grabbing pose for unseen objects, realizes a series of work of data set processing, network training and model optimization, can quickly obtain the optimal grabbing pose of the objects only by inputting the depth information of the objects, is simple in model, has training parameters far smaller than those of other networks, has excellent generalization capability for identifying and generating the grabbing pose of the objects in daily life including the unseen objects, and can be widely popularized in the fields of mechanical arm visual grabbing, dynamic tracking and the like, and meanwhile, the success rate of identifying and generating the grabbing pose reaches over 90%.

The invention is realized by at least one of the following technical schemes.

A convolutional neural network-based method for generating an optimal grabbing pose comprises the following steps:

s1, setting parameters for representing the grabbing quality in the grabbing process;

s2, constructing a convolutional neural network model;

s3, training a neural network model by adopting a Cornell Graspining data set;

and S4, inputting the object depth map acquired by the camera into the trained neural network model, and calculating grabbing pose parameters, wherein the grabbing pose parameters are used for driving the mechanical arm to grab.

Further, the parameters in step S1 include G, Q, Φ, W; where G represents a series of parameters in each capture, corresponding to each pixel:

for a given 2.5D depth map

H represents the height of the depth map, W represents the width of the depth map, H and W parameters are obtained from the camera,

a representative dimension;

q represents the quality of each grabbing and is a scalar within (0,1), and the closer Q is to 1, the higher the grabbing quality is;

phi denotes the angle of rotation required for the jaw to reach the desired position during each grab,

the ideal position is the position of the optimal grabbing rectangle set in the data set, and the rotation angle refers to the rotation angle of the grabbing rectangle relative to the horizontal line;

w indicates the width of the jaws that need to be opened during gripping to ensure complete gripping of the object.

Further, the Cornell grading data set of step S3 provides 1035 pictures of 280 different objects, each with RGB map, depth information and data set for the best grabbed rectangle used to grab the object, including the size of the rectangle, the three-dimensional position of the center point of the rectangle.

Further, the structure of the neural network model includes different network layers: the first layer of the neural network model comprises 9 × 9 convolution kernels and 32 filters, the moving step is 3, the second layer comprises 5 × 5 convolution kernels and 16 filters, the moving step is 2, the third layer comprises 3 × 3 convolution kernels and 8 filters, the moving step is 2, the fourth, fifth and sixth layers are deconvolution layers, the purpose is to keep the resolution of input and output consistent, the fourth layer is a deconvolution layer and comprises 3 × 3 convolution kernels and 8 filters, the moving step is 2, the fifth layer is a deconvolution layer and comprises 3 × 3 convolution kernels and 16 filters, the moving step is 2, the sixth layer is a deconvolution layer and comprises 9 × 9 convolution kernels and 32 filters, and the moving step is 3.

Further, the loss function of the neural network model adopts L₂A loss equation, which is a measure for evaluating the performance of the network, is approximated by a neural network to a complex equation M: I → G, and the calculation of the parameters of the neural network model includes:

M(I)＝(Q,Φ,W)

G＝(Φ,W,Q)

M_θ(I)＝(Q_θ,Φ_θ,W_θ)≈M(I)

wherein M (I) represents I, M for the input depth image_θ(I) An equation Q formed by actual capture pose parameters obtained by a neural network model_θ,Φ_θ,W_θIs Q_T,Φ_T,W_TRepresenting the grabbing parameters of all objects in the whole network, and depth information I in the data set_TInputting the data into a neural network model for training to obtain the optimal grabbing pose G_TThus, the loss function is defined as:

further, the grasp pose parameters of step S4 include the grasp mass Q_TAngle of rotation phi_TWidth W of jaw opening_TThe calculation method is as follows:

grasping mass Q_T: when an object is grabbed, inputting the depth information of the object acquired by the Intel Realsense SR300 camera into the model trained in the step S3, comparing the depth information of the object with the information in the model, if the part with consistent depth information is set as 1, the inconsistent part is set as 0, then counting the values of 1 and 0 in all pixels, and calculating the grabbing quality value Q of the object_T；

Rotation angle

The range of values is

And according to sin (2. phi.)_T) And cos (2. phi.)_T) To obtain a unique true value phi_T：

Width W of jaw opening_TThe width of the object is obtained by adding 1 cm-2 cm on the basis of the width of the object, the width of the object is obtained through the depth information of the object, and the depth information of the object is obtained through an Intel Realsense SR300 camera.

Further, the equation for calculating the grabbing pose by the neural network model is as follows:

M_θ(I)＝(Q_θ,Φ_θ,W_θ)

wherein, I, Q_θ,Φ_θ,W_θAre picture parameters, Q, respectively_θ,Φ_θ,W_θAre each Q_T,Φ_T,W_TRepresents the grab parameters of all objects in the neural network model.

Further, the training process of the convolutional neural network model comprises the following steps:

(1) initializing a weight value by the convolutional neural network model;

(2) selecting 80% of a Cornell Graspeing data set as a training set of a network model, inputting depth information data of the training set into a convolutional neural network model, and obtaining an output value through propagation of a convolutional layer and a deconvolution layer;

(3) solving the error between the output value of the network model and the target value, namely the value of the loss function;

(4) when the error is larger than the expected value, the error is transmitted back to the network model, and the error of each network layer is sequentially obtained; the error of each network layer is the total error of the network;

(5) updating the weight according to the obtained error, and then entering the step (2); when the error is equal to or less than the desired value, the training is ended.

Compared with the prior art, the invention has the advantages and beneficial effects that:

1. the method solves the problem that the current grabbing algorithm is lack of the method for quickly generating the optimal grabbing pose for the unseen object, realizes a series of work of data set processing, network training and model optimization, can quickly obtain the optimal grabbing pose of the object only by inputting the depth information of the object, is simple in model and high in operation efficiency, and the trained parameters are far smaller than those of other networks.

2. For objects in daily life including unseen objects, the success rate of recognizing and generating the grabbing pose reaches over 90%, the robot has excellent generalization capability and simple deployment, and can be widely popularized in the fields of mechanical arm visual grabbing, dynamic tracking and the like.

Drawings

FIG. 1 is a schematic diagram of a neural network model structure according to the present embodiment;

FIG. 2 is a flowchart illustrating a method for generating an optimal capture pose based on a convolutional neural network according to an embodiment;

FIG. 3 is a schematic diagram illustrating calculation of the rotation angle and the opening width of the clamping jaw in the process of grabbing a target object according to the embodiment

FIG. 4 is a hierarchy diagram of a network architecture in an embodiment of the present invention;

in the figure: 1-grabbing a rectangle a; 2-grabbing a rectangle b; 3-object grasping object.

Detailed Description

The working principle and working process of the present invention will be further explained in detail with reference to the accompanying drawings.

s1, as shown in FIG. 2, setting parameters for representing the grabbing quality in the grabbing process, including G, Q, phi and W; where G represents a series of parameters in each capture, corresponding to each pixel:

for a given 2.5D depth map

a representative dimension;

the ideal position is the position of the optimal grabbing rectangle set in the data set, the rotation angle refers to the rotation angle of the rectangle relative to the horizontal line, and specific numerical values are also included in the data set;

S2, constructing a convolutional neural network model, as shown in fig. 4, the structure of the neural network model includes different network layers: the first layer comprises 9 × 9 convolution kernels and 32 filters, the moving step is 3, the second layer comprises 5 × 5 convolution kernels and 16 filters, the moving step is 2, the third layer comprises 3 × 3 convolution kernels and 8 filters, the moving step is 2, the fourth layer to the sixth layer are deconvolution layers, the purpose is to keep the resolution of input and output consistent, the fourth layer is a deconvolution layer comprising 3 × 3 convolution kernels and 8 filters, the moving step is 2, the fifth layer is a deconvolution layer comprising 3 × 3 convolution kernels and 16 filters, the moving step is 2, the sixth layer is a deconvolution layer comprising 9 × 9 convolution kernels and 32 filters, and the moving step is 3.

Loss function of neural network model adopts L₂A loss equation, which is a measure for evaluating network performance, is approximated by a neural network model to a complex equation M: I → G, and the calculation of parameters of the neural network model includes:

M(I)＝(Q,Φ,W)

G＝(Φ,W,Q)

M_θ(I)＝(Q_θ,Φ_θ,W_θ)≈M(I)

m (I) represents an equation formed by theoretically optimal grabbing pose parameters for the input depth image I; m_θ(I) An equation consisting of actual grabbing pose parameters obtained by a neural network model is referred to; q_θ,Φ_θ,W_θIs Q_T,Φ_T,W_TRepresents the grab parameters of all objects in the whole network. Depth information I in data set_TTraining as network input to obtain ideal optimal capture pose G_T，G_TRepresents corresponds to I_TThus defining a loss function as:

λ(G_T,M_θ(I_T) Is) represents the L2 norm loss function equation, i.e., the least squares error equation.

S3, adopting Cornell grading data set (Kannell university grabing source data sethttp:// pr.cs.cornell.edu/grasping/rect_data/data.php) To train the model, which is an open source dataset for research and development at the university of cornellThere are 1035 pictures (RGB map and depth information) of 280 different objects, each of which was taken from a different direction or at a different angle, each picture being accompanied by artificially set data for the best rectangle to capture, including the size of the rectangle, the three-dimensional position of the center point of the rectangle, and the angle of rotation of the rectangle relative to the horizontal. This data set may provide important parameters regarding grabbing; the depth information in the data set is input into the neural network model for training, the trained model has good generalization capability, the model is simplified, and the network parameters needing to be trained are greatly reduced. 80% of the Cornell Graspeing dataset was selected as the training set for the network model and 20% was kept as the evaluation dataset. The training process comprises the following steps:

(1) initializing a weight value by the convolutional neural network model;

(2) inputting the depth information data of the training set, and obtaining an output value through propagation of a convolution layer and a deconvolution layer;

(4) and when the error is larger than the expected value, the error is transmitted back to the network model, and the errors of the network layers are sequentially obtained. The error of each network layer is the total error of the network;

And S4, inputting the object depth map acquired by the camera into the trained neural network model, and calculating grabbing pose parameters, wherein the grabbing pose parameters are used for driving a mechanical arm to grab, and the mechanical arm can be a UR5 mechanical arm. As shown in fig. 1 and 4, a 300 x 300 depth image is input to a trained neural network model, and the optimal capture pose G is obtained_θ。

The model obtained after step S3 has extracted the G, Q, Φ information in the Cornell rasping dataset, and then for the newly input depth image, its capture quality Q is calculated_TAngle of rotation phi_TThe clamping jaw is wide in openingDegree W_TThe calculation method is as follows:

grasping mass Q_T: when an object is grabbed, the depth information of the object acquired from an Intel Realsense SR300 camera is taken as input and input into a model trained by a network, the depth information of the object is compared with the information in the model, if the part with consistent depth information is set as 1, the part with inconsistent depth information is set as 0, then the values of 1 and 0 in all pixels are counted, and the grabbing quality value Q of the object can be calculated_T；

Setting a rotation angle

The range of values is

And according to sin (2. phi.)_T) And cos (2. phi.)_T) To obtain a unique true value phi_T；

Width W for jaw opening_TThe width of the object can be obtained by adding 1cm to 2cm on the basis of the width of the object, the width of the object can be obtained through the depth information of the object, and the depth information of the object can be obtained through an Intel Realsense SR300 camera, as shown in FIG. 3, 3 is an irregular object with a thin top and a thick bottom, 1 is a grabbing rectangle a, and 2 is a grabbing rectangle b.

The method can be applied to industrial mechanical arms or mechanical arms for scientific research experiments. In industry, the mechanical arm needs to grab materials or goods, the method can quickly generate the optimal grabbing pose of the materials or the goods and send pose information to the mechanical arm, so that the mechanical arm can quickly grab specified materials or goods. In scientific research, the convolutional neural network model provided by the method has few parameters, the trained model is simple, the generalization capability is strong, and the method has a certain reference value for the research of mechanical arm visual capture.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents or improvements made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for generating an optimal grabbing pose based on a convolutional neural network is characterized by comprising the following steps:

s2, constructing a convolutional neural network model;

s3, training a neural network model by adopting a Cornell Graspining data set;

2. The convolutional neural network-based method of generating an optimal grab pose of claim 1, wherein: the parameters in step S1 include G, Q, Φ, W; where G represents a series of parameters in each capture, corresponding to each pixel:

for a given 2.5D depth map

a representative dimension;

3. The convolutional neural network-based method of generating an optimal grab pose of claim 1, wherein: the Cornell grading data set described in step S3 provides 1035 pictures of 280 different objects, each with RGB maps, depth information, and data set for the best grabbed rectangle used to grab the object, including the size of the rectangle, and the three-dimensional position of the center point of the rectangle.

4. The network structure of the convolutional neural network-based method of generating an optimal grab pose according to claim 1, wherein: the structure of the neural network model includes different network layers: the first layer contains 9 × 9 convolution kernels and 32 filters, the shift step is 3, the second layer contains 5 × 5 convolution kernels and 16 filters, the shift step is 2, the third layer contains 3 × 3 convolution kernels and 8 filters, the shift step is 2, the fourth five six layers are deconvolution layers, the purpose is to keep the resolution of the input and the output consistent, the fourth layer is a deconvolution layer contains 3 × 3 convolution kernels and 8 filters, the shift step is 2, the fifth layer is a deconvolution layer contains 3 × 3 convolution kernels and 16 filters, the shift step is 2, the sixth layer is a deconvolution layer contains 9 × 9 convolution kernels and 32 filters, and the shift step is 3.

5. The convolutional neural network-based method of generating an optimal grab pose of claim 1, wherein: loss function of neural network model adopts L₂Loss equation as a measure of evaluating network performance, approximating a complex equation M I → G with a neural networkThe computer comprises

M(I)＝(Q,Φ,W)

G＝(Φ,W,Q)

M_θ(I)＝(Q_θ,Φ_θ,W_θ)≈M(I)

6. the convolutional neural network-based method of generating an optimal grab pose of claim 1, wherein: the grasp pose parameters of step S4 include grasp quality Q_TAngle of rotation phi_TWidth W of jaw opening_TThe calculation method is as follows:

Rotation angle

The range of values is

7. The convolutional neural network-based method of generating an optimal grab pose of claim 1, wherein: the equation for calculating the grabbing pose by the neural network model is as follows:

M_θ(I)＝(Q_θ,Φ_θ,W_θ)

wherein I, Q_θ,Φ_θ,W_θRespectively representing picture parameters, Q_θ,Φ_θ,W_θAre each Q_T,Φ_T,W_TRepresents the grab parameters of all objects in the neural network model.

8. The convolutional neural network-based method of generating an optimal grab pose of claim 1, wherein: the training process of the convolutional neural network model comprises the following steps:

(1) initializing a weight value by the convolutional neural network model;

(2) selecting 80% of a CornelGrasp data set as a training set of a network model, inputting depth information data of the training set into a convolutional neural network model, and obtaining an output value through propagation of a convolutional layer and a deconvolution layer;