CN115578460B - Robot grabbing method and system based on multi-mode feature extraction and dense prediction - Google Patents
Robot grabbing method and system based on multi-mode feature extraction and dense prediction Download PDFInfo
- Publication number
- CN115578460B CN115578460B CN202211407718.3A CN202211407718A CN115578460B CN 115578460 B CN115578460 B CN 115578460B CN 202211407718 A CN202211407718 A CN 202211407718A CN 115578460 B CN115578460 B CN 115578460B
- Authority
- CN
- China
- Prior art keywords
- dimensional
- pixel
- network
- dense
- convolution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J19/00—Accessories fitted to manipulators, e.g. for monitoring, for viewing; Safety devices combined with or specially adapted for use in connection with manipulators
- B25J19/02—Sensing devices
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1602—Programme controls characterised by the control system, structure, architecture
- B25J9/161—Hardware, e.g. neural networks, fuzzy logic, interfaces, processor
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1656—Programme controls characterised by programming, planning systems for manipulators
- B25J9/1661—Programme controls characterised by programming, planning systems for manipulators characterised by task planning, object-oriented languages
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/766—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10028—Range image; Depth image; 3D point clouds
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Robotics (AREA)
- Mechanical Engineering (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Automation & Control Theory (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Biodiversity & Conservation Biology (AREA)
- Biomedical Technology (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a robot grabbing method and system based on multi-mode feature extraction and density prediction, wherein a scene color image and a depth image are obtained, a scene three-dimensional point cloud and adaptive convolution receptive fields with different scales are calculated from the depth image, and a surface normal vector image is obtained according to the scene three-dimensional point cloud; and constructing a multi-mode feature extraction and dense prediction network, processing a scene color image and a surface normal vector image to obtain dense three-dimensional attitude information and three-dimensional position information predicted by each type of object, calculating to obtain the three-dimensional attitude and the three-dimensional position of the corresponding object, forming a three-dimensional pose by the three-dimensional attitude and the three-dimensional position, and sending the three-dimensional pose to a robot grasping system to complete the grasping task of the corresponding object in the scene. The method disclosed by the invention integrates multi-mode color and depth data, retains two-dimensional plane characteristics and depth information in feature extraction, has a simple structure and high prediction precision, and is suitable for a robot grabbing task in a complex scene.
Description
Technical Field
The invention relates to the field of robot three-dimensional vision, object three-dimensional pose estimation and grabbing application, in particular to a robot grabbing method and system based on multi-modal feature extraction and density prediction.
Background
Robot grabbing is an important task in the field of industrial automation and is used for replacing manual work to finish complex and repeated operations in product production, such as part loading, assembly, sorting, carrying and the like. In order to accurately complete the grabbing task, the robot must identify a target object from a working scene by using a vision system of the robot and accurately estimate a three-dimensional pose of the target object, and then perform grabbing operation by using a motion control system of the robot. Generally, parts in an industrial scene are various in types, different in shapes, poor in surface texture and uneven in scene illumination, and the parts are randomly placed, so that great challenges are brought to visual identification and object three-dimensional pose estimation of a robot.
In recent years, small-sized low-cost three-dimensional cameras have been widely used with the development of sensor technology. Compared with a two-dimensional camera, the method can provide additional scene depth and object surface geometric texture information, enhance scene image information and improve target identification and pose estimation precision of a visual algorithm. At present, two three-dimensional image processing modes are mainly adopted, firstly, a scene depth image is used as an extra channel to be combined with a color image three channel to form a 4-channel image, and then feature extraction, information processing and the like are carried out; and secondly, converting the color image and the depth image into a scene three-dimensional point cloud, and completing feature extraction, target identification and the like by using a point cloud data processing method. In a related processing algorithm, a template matching algorithm is generally adopted in a traditional mode, the best matching position of a pre-defined template of a target object is searched from scene data to identify the object and estimate the pose of the object, the calculation of the template depends on manual design, the influence of noise, illumination and texture characteristics is large, and the algorithm robustness is poor.
In recent years, thanks to the development of deep learning technology, the image processing method based on the convolutional neural network is widely applied, and the effect is remarkably improved. Densefusion is used as a leader in the object three-dimensional pose estimation method, and through a mode of combining two three-dimensional data processing, color image information is processed by a two-dimensional convolution network and point cloud data converted from a depth image is processed by a point cloud convolution network, and then different dimensional features are fused, so that the performance is remarkably improved. However, in the process of converting image data from a two-dimensional image to a serialized point cloud, scene two-dimensional structure information is lost, feature extraction is affected, physical information quantization difference exists between color images and depth images, and robust features cannot be obtained through simple dimension fusion.
Therefore, how to solve the problems of feature extraction and information fusion between different dimensions and characteristic images in three-dimensional vision and the design of a target object pose parameter regression model, and meeting the requirement of robot high-precision grabbing becomes a problem to be solved by the technical staff in the field.
Disclosure of Invention
The invention aims to provide a robot grabbing method and system based on multi-mode feature extraction and dense prediction, which effectively meet the pose estimation requirements of weak textures and complex and diverse parts in an industrial scene by adopting a robot three-dimensional vision technology.
In order to solve the technical problems, the invention provides a robot grasping method based on multi-modal feature extraction and dense prediction, which comprises the following steps:
s1, acquiring a color image and a depth image of a robot under a multi-class object grabbing scene;
s2, calculating scene three-dimensional point clouds and adaptive convolution receptive fields of different scales from the depth image, and obtaining a surface normal vector image according to the scene three-dimensional point clouds;
s3, constructing a multi-mode feature extraction and dense prediction network by combining self-adaptive convolution receptive fields of different scales, inputting a preset training set into the network for training to obtain the trained multi-mode feature extraction and dense prediction network, calculating a total loss value of the network according to a preset loss function, and reversely propagating and updating network parameters of the network to obtain an updated multi-mode feature extraction and dense prediction network;
s4, processing the scene color image and the surface normal vector image through the updated multi-modal feature extraction and dense prediction network to obtain dense three-dimensional attitude information and dense three-dimensional position information predicted by each type of object;
and S5, calculating to obtain the three-dimensional posture of the corresponding object according to the predicted dense three-dimensional posture information of each type of object, calculating to obtain the three-dimensional position of the corresponding object according to the predicted dense three-dimensional position information of each type of object, forming the three-dimensional posture of the corresponding object by the three-dimensional posture and the three-dimensional position, and sending the three-dimensional posture to a robot grabbing system to finish the grabbing task of the corresponding object in the scene.
Preferably, the multi-mode feature extraction and dense prediction network in S3 includes a multi-mode feature extraction network and three regression branch networks, the multi-mode feature extraction network is configured to perform feature extraction and feature fusion from the scene color image and the surface normal vector image to obtain multi-mode features, and the three regression branch networks are configured to respectively predict multi-class semantic information, three-dimensional posture information, and three-dimensional position information of the pixel-by-pixel target object from the multi-mode features.
Preferably, the multi-modal feature extraction network comprises a first convolution network, a second convolution network and a multi-scale feature fusion module, wherein the first convolution network extracts multi-scale color convolution features from a scene color image under the guidance of adaptive convolution receptive fields of different scales, the second convolution network extracts multi-scale normal vector convolution features from a surface normal vector image under the guidance of adaptive convolution receptive fields of different scales, and the multi-scale feature fusion module fuses the multi-scale color convolution features and the multi-scale normal vector convolution features to obtain the multi-modal features.
Preferably, the first convolution network and the second convolution network respectively use ResNet-18 as a backbone network, a third layer and subsequent convolution layers of the backbone network are abandoned, adaptive deep convolution receptive fields of different scales are used for replacing an original conventional convolution receptive field of the network, the multi-scale feature fusion module comprises a first sub-module and a second sub-module, the first sub-module is used for performing multi-mode convolution feature fusion on color convolution features and normal vector convolution features of the same scale in different scales to obtain multi-mode features of different scales, and the second sub-module performs up-sampling and scale information fusion on the obtained multi-mode features of different scales by adopting a feature pyramid structure to obtain the scene pixel-by-pixel multi-mode features.
Preferably, the three regression branch networks are a pixel-by-pixel semantic prediction network, a pixel-by-pixel three-dimensional attitude prediction network and a pixel-by-pixel three-dimensional position prediction network respectively, the pixel-by-pixel semantic prediction network performs intensive pixel-by-pixel semantic information prediction on the input multi-modal features to obtain pixel-by-pixel multi-class semantic information, the pixel-by-pixel three-dimensional attitude prediction network performs intensive pixel-by-pixel three-dimensional attitude prediction on the input multi-modal features to obtain pixel-by-pixel three-dimensional attitude information, and the pixel-by-pixel three-dimensional position prediction network performs intensive pixel-by-pixel three-dimensional position prediction on the input multi-modal features to obtain pixel-by-pixel three-dimensional position information.
Preferably, the scene three-dimensional point cloud is calculated from the depth image in S2, and the specific formula is as follows:
in the formula (I), the compound is shown in the specification,for three-dimensional point cloud coordinates, based on the location of the point cloud>、/>、/>、/>Is the internal reference of the camera>And &>Is a coordinate of the depth image, is based on the value of the intensity of the light beam>Is the depth of the depth image;
in S2, self-adaptive convolution receptive fields of different scales are obtained by calculation from the depth image, and the specific formula is as follows:
in the formula (I), the compound is shown in the specification,is a pixel>Corresponding adaptive depth convolution field of different scale->Is pixel->Corresponding conventional convolution field,. Or>Is pixel->An offset of the position;
in S2, a surface normal vector image is obtained according to the scene three-dimensional point cloud, and the specific formula is as follows:
in the formula (I), the compound is shown in the specification,is a surface normal vector image, is based on>For all three-dimensional point clouds in the scene, < >>For the number of point clouds>Is->Weiquan->And (5) vector quantity.
Preferably, the predicted dense three-dimensional posture information of each type of object in S4 specifically includes:
in the formula (I), the compound is shown in the specification,is a pixel>In the three-dimensional posture of the object, and>represents a gesture >>Is a pixel>In quaternion form of the three-dimensional pose of the object at ^ 5>A value;
the predicted dense three-dimensional position information of each type of object in the S4 specifically comprises the following steps:
wherein, the first and the second end of the pipe are connected with each other,is pixel->In the three-dimensional position of an object, is present>Indicates a position>Is pixel->In a three-dimensional position offset of an object representing a pixel @>Is at the 3D point of the corresponding object>Three-dimensional position from objectUnitized three-dimensional offset of (1).
Preferably, in S5, the three-dimensional posture of the corresponding object is calculated according to the predicted dense three-dimensional posture information of each type of object, and the specific formula is as follows:
wherein, the first and the second end of the pipe are connected with each other,for the three-dimensional pose of the object in category obj>The corresponding dense prediction number for the class obj object; />
And S5, calculating to obtain the three-dimensional position of the corresponding object according to the predicted dense three-dimensional position information of each type of object, wherein the specific formula is as follows:
in the formula (I), the compound is shown in the specification,is the three-dimensional position of the object in the category obj>In order to predict the dense three-dimensional position,the corresponding dense predicted quantities for the class obj object.
Preferably, the loss function preset in S3 is specifically:
in the formula (I), the compound is shown in the specification,for a total loss of the network, is>、/>And &>Weight factors for a semantically predicted branch, a three-dimensional pose predicted branch, and a three-dimensional position predicted branch, respectively, < >>For the loss function of the semantic prediction network, a cross entropy loss function is adopted,predicting a network penalty function for three-dimensional poses, <' >>Predicting a network loss function for a three-dimensional location, <' >>And &>Predict values and truth values for the three-dimensional pose prediction network, respectively>And &>Predicting values and truth values for a three-dimensional position prediction network, respectively>For the number of object classes in a scene>The corresponding dense predicted number for each class of objects.
A robot grabbing system adopts a robot grabbing method based on multi-modal feature extraction and density prediction to grab a target object in a scene, and comprises a robot posture calculation module, a communication module, a grabbing module and an image acquisition module,
the image acquisition module is used for acquiring color images and depth images under multi-class object scenes in real time and sending the color images and the depth images to the pose calculation module;
the pose calculation module calculates the pose of the target object by adopting a robot grabbing method based on multi-mode feature extraction and density prediction and sends the pose to the grabbing module through the communication module;
the grabbing module receives the 6D pose information of the target object and grabs the target object.
According to the robot grasping method and system based on multi-mode feature extraction and dense prediction, a scene depth image is converted into a scene three-dimensional point cloud and a scene color image, and multi-mode feature extraction is carried out on the scene three-dimensional point cloud and the scene color image together.
Drawings
FIG. 1 is a flowchart of a robot crawling method based on multi-modal feature extraction and dense prediction according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a multi-modal feature extraction and dense prediction network in an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the present invention is further described in detail below with reference to the accompanying drawings.
In one embodiment, the robot grasping method based on multi-modal feature extraction and dense prediction specifically includes:
s1, acquiring a color image and a depth image of a robot under a multi-class object grabbing scene;
s2, calculating scene three-dimensional point clouds and adaptive convolution receptive fields of different scales from the depth image, and obtaining a surface normal vector image according to the scene three-dimensional point clouds;
s3, constructing a multi-modal feature extraction and dense prediction network by combining self-adaptive convolution receptive fields with different scales, inputting a preset training set into the network for training to obtain the trained multi-modal feature extraction and dense prediction network, calculating the total loss value of the network according to a preset loss function, and reversely transmitting network parameters of the updated network to obtain the updated multi-modal feature extraction and dense prediction network;
s4, processing the scene color image and the surface normal vector image through the updated multi-modal feature extraction and dense prediction network to obtain dense three-dimensional attitude information and dense three-dimensional position information predicted by each object;
and S5, calculating to obtain the three-dimensional posture of the corresponding object according to the predicted dense three-dimensional posture information of each type of object, calculating to obtain the three-dimensional position of the corresponding object according to the predicted dense three-dimensional position information of each type of object, forming the three-dimensional posture of the corresponding object by the three-dimensional posture and the three-dimensional position, and sending the three-dimensional posture to a robot grabbing system to finish the grabbing task of the corresponding object in the scene.
Specifically, referring to fig. 1, fig. 1 is a flowchart of a robot crawling method based on multi-modal feature extraction and dense prediction according to an embodiment of the present invention; FIG. 2 is a schematic structural diagram of a multi-modal feature extraction and dense prediction network in an embodiment of the present invention.
A robot grabbing method based on multi-modal feature extraction and density prediction includes the steps that firstly, color images and depth images of a robot under a multi-class object grabbing scene are obtained; then, calculating scene three-dimensional point cloud and adaptive convolution receptive fields with different scales from the depth image, and calculating a surface normal vector image (the same as the normal vector image in the attached figure 2) through scene three-dimensional point cloud data; then, constructing a multi-modal feature extraction and dense prediction network by combining the adaptive convolution receptive fields with different scales and training; then processing the scene color image and the surface normal vector image through the trained network to obtain dense three-dimensional attitude information and dense three-dimensional position information predicted by each type of object; calculating the three-dimensional attitude of the corresponding object by adopting a mean value mode aiming at the predicted dense three-dimensional attitude information of each type of object, calculating the three-dimensional position offset of each pixel according to the diameter of the object aiming at the predicted dense three-dimensional position information of each type of object, then adding the three-dimensional position offset and the corresponding three-dimensional point cloud to obtain the dense three-dimensional position of the object, calculating the three-dimensional position of the corresponding object by adopting the mean value mode, and forming the three-dimensional attitude of the corresponding object by adopting the three-dimensional attitude and the three-dimensional position; and finally, sending the three-dimensional pose to a robot grabbing system to complete a grabbing task of the corresponding object in the scene.
In one embodiment, the multi-modal feature extraction and dense prediction network in S3 includes a multi-modal feature extraction network and three regression branch networks, the multi-modal feature extraction network is configured to perform feature extraction and feature fusion from the scene color image and the surface normal vector image to obtain multi-modal features, and the three regression branch networks are configured to respectively predict multi-category semantic information, three-dimensional posture information, and three-dimensional position information of the pixel-by-pixel target object from the multi-modal features.
In one embodiment, the multi-modal feature extraction network comprises a first convolution network, a second convolution network and a multi-scale feature fusion module, wherein the first convolution network extracts multi-scale color convolution features from a scene color image under the guidance of adaptive convolution receptive fields with different scales, the second convolution network extracts multi-scale normal vector convolution features from a surface normal vector image under the guidance of adaptive convolution receptive fields with different scales, and the multi-scale feature fusion module fuses the multi-scale color convolution features and the multi-scale normal vector convolution features to obtain the multi-modal features.
In one embodiment, the first convolution network and the second convolution network respectively use ResNet-18 as a main network, a third layer and a subsequent convolution layer of the main network are abandoned, adaptive depth convolution receptive fields with different scales are used for replacing an original conventional convolution receptive field of the network, the multi-scale feature fusion module comprises a first sub-module and a second sub-module, the first sub-module is used for performing multi-mode convolution feature fusion on color convolution features and normal vector convolution features with the same scale in different scales to obtain multi-mode features with different scales, and the second sub-module performs up-sampling and scale information fusion on the obtained multi-mode features with different scales by adopting a feature pyramid structure to obtain multi-mode features with pixel-by-pixel scenes.
Specifically, the first convolutional network and the second convolutional network are two same convolutional neural networks, both use ResNet-18 as a backbone network, abandon the third layer and the subsequent convolutional layer of the backbone network (namely abandon the 1/16 scale layer and the subsequent convolutional layer), and use the adaptive deep convolutional receptive fieldReplacement of the original conventional convolution receptive field receptor field->And completing convolution guidance, and then respectively extracting convolution characteristics of the scene color image and the surface normal vector image to obtain respective multi-scale convolution characteristic layers, wherein the multi-scale characteristic fusion module comprises two sub-modules: the first sub-module performs multi-mode convolution feature fusion on the color convolution features and normal vector convolution features of the same scale in different scales to obtain multi-mode features of different scales, and the second sub-module performs up-sampling and scale information fusion on the obtained multi-mode convolution features of different scales by adopting a feature pyramid structure to obtain pixel-by-pixel multi-mode features of a scene.
In one embodiment, the three regression branch networks are a pixel-by-pixel semantic prediction network, a pixel-by-pixel three-dimensional posture prediction network and a pixel-by-pixel three-dimensional position prediction network respectively, the pixel-by-pixel semantic prediction network performs intensive pixel-by-pixel semantic information prediction on input multi-modal characteristics to obtain pixel-by-pixel multi-category semantic information, the pixel-by-pixel three-dimensional posture prediction network performs intensive pixel-by-pixel three-dimensional posture prediction on the input multi-modal characteristics to obtain pixel-by-pixel three-dimensional posture information, and the pixel-by-pixel three-dimensional position prediction network performs intensive pixel-by-pixel three-dimensional position prediction on the input multi-modal characteristics to obtain pixel-by-pixel three-dimensional position information.
Specifically, the three regression networks are a pixel-by-pixel semantic prediction network, a pixel-by-pixel three-dimensional posture prediction network and a pixel-by-pixel three-dimensional position prediction network respectively, wherein the pixel-by-pixel semantic prediction network performs intensive pixel-by-pixel semantic information prediction on input pixel-by-pixel multi-modal characteristics to obtain pixel-by-pixel multi-category semantic information; the pixel-by-pixel three-dimensional attitude prediction network carries out intensive pixel-by-pixel three-dimensional attitude prediction on the input pixel-by-pixel multi-modal characteristics to obtain pixel-by-pixel three-dimensional attitude information; and carrying out intensive pixel-by-pixel three-dimensional position prediction on the input pixel-by-pixel multi-mode characteristics by the pixel-by-pixel three-dimensional position prediction network to obtain pixel-by-pixel three-dimensional position information. In addition, the predicted pixel-by-pixel multi-class semantic information is utilized to cut out the dense three-dimensional attitude information and the dense three-dimensional position information corresponding to each class of objects from the pixel-by-pixel three-dimensional attitude and position information.
In one embodiment, the scene three-dimensional point cloud is calculated from the depth image in S2, and the specific formula is as follows:
in the formula (I), the compound is shown in the specification,for three-dimensional point cloud coordinates, based on the point location of the object>、/>、/>、/>Is the internal reference of the camera>And &>Is a coordinate of the depth image, is based on the value of the intensity of the light beam>Is the depth of the depth image;
in S2, self-adaptive convolution receptive fields of different scales are obtained by calculation from the depth image, and the specific formula is as follows:
in the formula (I), the compound is shown in the specification,is pixel->Corresponding adaptive depth convolution field of different scale->Is pixel->Corresponding conventional convolution field,. Or>Is a pixel>An offset of the position;
in S2, a surface normal vector image is obtained according to the scene three-dimensional point cloud, and the specific formula is as follows:
in the formula (I), the compound is shown in the specification,is a surface normal vector image, is based on>For all three-dimensional point clouds in a scene>Is the number of the point clouds, based on>Is->Weiquan->And (5) vector quantity.
Specifically, a scene three-dimensional point cloud and a surface normal vector image are calculated from a depth image, and the scene three-dimensional point cloudBy>Is obtained, wherein>、/>、/>And &>For a camera reference>And &>For coordinates of the depth image>For the coordinates of the depth image->The corresponding depth value;
surface normal vector imageBy>Is obtained, wherein>For all the three-dimensional point clouds in the scene, ,/>is the number of the point clouds, based on>Is->Weiquan->Vector, or>Is a natural number set;
computing adaptive depth convolution receptive fields of different scales from depth imagesWhich can be expressed as being greater than or equal to the conventional convolution field>On the basis of which a bias is added>The concrete formula is as follows:
in the formula (I), the compound is shown in the specification,for a pixel position +>Is pixel->Corresponding conventional convolution field,. Or>Is pixel->Corresponding adaptive depth convolution field of different scale->Is a pixel>The offset of the corresponding position is such that,,/>,/>is the convolution kernel size, is asserted>、/>Is the convolutional network feature layer size.
1) And calculating the pixelIs located and/or is>Neighborhood 2D plane corresponds to 3D plane normal vector->:
Using in-camera participation depth image coordinatesIs correspondingly located depth value->Reconstructed pixel->Is arranged atNumber of 3D points->,/>And then extracted through all->In 3D plane +>Its normal vector->Namely the phasor obtained by the method, the specific calculation formula is as follows:
,/>,/>whereby the ^ is calculated>. Then according to>Can calculate out>The concrete formula is as follows:
constructing 3D planesBased on the camera internal reference and the pinhole camera projection principle, will->Projected to a 2D plane to obtain a pixel->At or>.3D plane->The 3D mesh in (1) may be represented as:
in the formula (I), the compound is shown in the specification,is 3D plane->A, b are grid coefficients,,/>is a scale factor->,/>。
In one embodiment, the predicted dense three-dimensional posture information of each type of object in S4 specifically includes:
in the formula (I), the compound is shown in the specification,is pixel->In the three-dimensional posture of the object, and>represents a gesture >>Is a pixel>In quaternion form of the three-dimensional pose of the object at ^ 5>A value;
the predicted dense three-dimensional position information of each type of object in the S4 specifically comprises the following steps:
wherein the content of the first and second substances,is a pixel>In the three-dimensional position of an object, is present>Indicates a position>Is a pixelIs shifted in three-dimensional position of the object, representing the pixel->Is at the 3D point of the corresponding object>Three-dimensional position distant to an object>Unitized three-dimensional offset of (1).
In one embodiment, in S5, the three-dimensional posture of the corresponding object is calculated according to the predicted dense three-dimensional posture information of each type of object, and the specific formula is as follows:
wherein the content of the first and second substances,for the three-dimensional pose of the object in the category obj, <' >>The corresponding dense prediction number for the class obj object;
and S5, calculating to obtain the three-dimensional position of the corresponding object according to the predicted dense three-dimensional position information of each type of object, wherein the specific formula is as follows:
in the formula (I), the compound is shown in the specification,for the three-dimensional position of an object in the category obj>In order to predict the dense three-dimensional position,the corresponding dense predicted quantities for the class obj object.
Specifically, the three-dimensional posture of the target object is obtained from the three-dimensional postures of the dense prediction objects by using an averaging mode.
When calculating the three-dimensional position of the target object, firstly, calculating the predicted dense three-dimensional position information of each type of object, specifically expressed as:
in the formula (I), the compound is shown in the specification,is pixel->Is located at a three-dimensional position, and>indicating position +>Is pixel->Is shifted in three-dimensional position, i.e. pixel->Is at the 3D point of the corresponding object>Three-dimensional position distant to an object>The unitized three-dimensional offset of (1) is specifically:
wherein the content of the first and second substances,the three-dimensional model diameter corresponding to the object of the category obj;
In one embodiment, the loss function preset in S3 is specifically:
in the formula (I), the compound is shown in the specification,for a total loss of the network, is>、/>And &>Weight factors for a semantically predicted branch, a three-dimensional pose predicted branch, and a three-dimensional position predicted branch, respectively, < >>For the loss function of the semantic prediction network, a cross entropy loss function is adopted,predicting three-dimensional posesA network loss function, <' >>Predicting a network loss function for three-dimensional locations>And &>Predict values and truth values for the three-dimensional pose prediction network, respectively>And &>Predicting values and truth values for a three-dimensional position prediction network, respectively>Is the number of object classes in the scene>The corresponding dense predicted number for each class of objects.
Specifically, a multi-modal feature extraction and dense prediction network is built, the network consists of a multi-modal feature extraction network and three regression branch networks, and the multi-modal feature extraction network consists of two identical convolution networks and a multi-scale feature fusion module. The method comprises the steps of training a built multi-modal feature extraction and dense prediction network by using a training data set, supervising network learning by using provided scene color and depth images and semantic masks and three-dimensional pose true values of all target objects to obtain optimal weight parameters, presetting a loss function for each regression branch, calculating a total loss value of the network according to the preset loss function, reversely propagating and updating network parameters of the network, and obtaining an updated multi-modal feature extraction and dense prediction network.
A robot grasping system adopts a robot grasping method based on multi-modal feature extraction and density prediction to grasp a target object in a scene, and comprises a robot pose calculation module, a communication module, a grasping module and an image acquisition module,
the image acquisition module is used for acquiring color images and depth images under multi-class object scenes in real time and sending the color images and the depth images to the pose calculation module;
the pose calculation module calculates the pose of the target object by adopting a robot grabbing method based on multi-mode feature extraction and density prediction and sends the pose to the grabbing module through the communication module;
the grabbing module receives the 6D pose information of the target object and grabs the target object.
The robot grabbing method and system based on multi-mode feature extraction and dense prediction firstly acquire scene color and depth images, then calculate scene three-dimensional point clouds, surface normal vector images and adaptive convolution receptive fields with different scales from the depth images, then preset a multi-mode feature extraction and dense prediction network, train and update the network, process the scene color images and the surface normal vector images by adopting the updated network, and obtain the predicted dense three-dimensional attitude information and dense three-dimensional position information of each type of objects; and calculating to obtain the three-dimensional attitude of the corresponding object according to the predicted dense three-dimensional attitude information of each type of object, calculating to obtain the three-dimensional position of the corresponding object according to the predicted dense three-dimensional position information of each type of object, wherein the three-dimensional attitude and the three-dimensional position jointly form the three-dimensional attitude of the corresponding object, and sending the three-dimensional attitude to a robot grabbing system to finish the grabbing task of the corresponding object in the scene. The method integrates multi-mode color and depth data, adopts a two-dimensional convolution structure, retains two-dimensional plane characteristics and depth information in feature extraction, has a simple structure and high prediction precision, and is suitable for robot grabbing tasks in complex scenes.
It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.
Claims (9)
1. The robot grasping method based on multi-modal feature extraction and dense prediction is characterized by comprising the following steps of:
s1, acquiring a color image and a depth image of a robot under a multi-class object grabbing scene;
s2, calculating scene three-dimensional point clouds and adaptive convolution receptive fields with different scales from the depth images, and obtaining a surface normal vector image according to the scene three-dimensional point clouds;
s3, constructing a multi-modal feature extraction and dense prediction network by combining self-adaptive convolution receptive fields of different scales, inputting a preset training set into the network for training to obtain the trained multi-modal feature extraction and dense prediction network, calculating a total loss value of the network according to a preset loss function, and reversely propagating and updating network parameters of the network to obtain an updated multi-modal feature extraction and dense prediction network;
s4, processing the scene color image and the surface normal vector image through the updated multi-modal feature extraction and dense prediction network to obtain dense three-dimensional attitude information and dense three-dimensional position information predicted by each type of object;
s5, calculating to obtain the three-dimensional posture of the corresponding object according to the predicted dense three-dimensional posture information of each type of object, calculating to obtain the three-dimensional position of the corresponding object according to the predicted dense three-dimensional position information of each type of object, wherein the three-dimensional posture and the three-dimensional position jointly form the three-dimensional posture of the corresponding object, and sending the three-dimensional posture to a robot grabbing system to finish the grabbing task of the corresponding object in a scene;
the multi-mode feature extraction and dense prediction network in the S3 comprises a multi-mode feature extraction network and three regression branch networks, wherein the multi-mode feature extraction network is used for performing feature extraction and feature fusion from the scene color image and the surface normal vector image to obtain multi-mode features, and the three regression branch networks are used for respectively predicting multi-class semantic information, three-dimensional attitude information and three-dimensional position information of the pixel-by-pixel target object from the multi-mode features.
2. The robot grasping method based on multi-modal feature extraction and dense prediction as claimed in claim 1, wherein the multi-modal feature extraction network includes a first convolution network, a second convolution network and a multi-scale feature fusion module, wherein the first convolution network extracts multi-scale color convolution features from the scene color image under guidance of adaptive convolution receptive fields of different scales, the second convolution network extracts multi-scale normal vector convolution features from the surface normal vector image under guidance of adaptive convolution receptive fields of different scales, and the multi-scale feature fusion module fuses the multi-scale color convolution features and the multi-scale normal vector convolution features to obtain multi-modal features.
3. The robot grasping method based on multi-modal feature extraction and dense prediction as claimed in claim 2, wherein the first convolution network and the second convolution network respectively use ResNet-18 as a backbone network, a third layer and subsequent convolution layers of the backbone network are discarded, and an original conventional convolution receptive field of the network is replaced by an adaptive deep convolution receptive field of different scales, the multi-scale feature fusion module includes a first sub-module and a second sub-module, the first sub-module is used for performing multi-modal convolution feature fusion on color convolution features and normal vector convolution features of the same scale in different scales to obtain multi-modal features of different scales, and the second sub-module performs up-sampling and scale information fusion on the obtained multi-modal features of different scales by using a feature pyramid structure to obtain scene pixel-by-pixel multi-modal features.
4. The robot grasping method based on multimodal feature extraction and dense prediction as claimed in claim 2, wherein the three regression branch networks are a pixel-by-pixel semantic prediction network, a pixel-by-pixel three-dimensional pose prediction network and a pixel-by-pixel three-dimensional position prediction network, respectively, the pixel-by-pixel semantic prediction network performs dense pixel-by-pixel semantic information prediction on the input multimodal features to obtain pixel-by-pixel multi-class semantic information, the pixel-by-pixel three-dimensional pose prediction network performs dense pixel-by-pixel three-dimensional pose prediction on the input multimodal features to obtain pixel-by-pixel three-dimensional pose information, and the pixel-by-pixel three-dimensional position prediction network performs dense pixel-by-pixel three-dimensional position prediction on the input multimodal features to obtain pixel-by-pixel three-dimensional position information.
5. The robot grasping method based on the multi-modal feature extraction and the dense prediction as claimed in claim 1, wherein the scene three-dimensional point cloud is calculated from the depth image in S2, and the specific formula is as follows:
in combination with>Is the coordinate of the three-dimensional point cloud,、/>、/>、/>is the internal reference of the camera>And &>For coordinates of the depth image>Is the depth of the depth image;
in the step S2, the adaptive convolution receptive fields of different scales are calculated from the depth image, and the specific formula is as follows:
in the formula (II)>Is pixel->Corresponding differently scaled adaptive depth convolution receptive fields>Is a pixel>Corresponding conventional convolution field,. Or>Is a pixel>An offset of the position;
in the S2, a surface normal vector image is obtained according to the scene three-dimensional point cloud, and the specific formula is as follows:
6. The robot grasping method based on the multi-modal feature extraction and the dense prediction as claimed in claim 5, wherein the dense three-dimensional posture information predicted for each type of object in the S4 is specifically:
in combination with>Is pixel->In the three-dimensional posture of the object, and>represents a gesture->Is a pixel>In quaternion form of the three-dimensional pose of an object at @>A value;
the predicted dense three-dimensional position information of each type of object in the step S4 specifically includes:
wherein it is present>Is pixel->Is located at a three-dimensional position, and>the position is indicated by a position indication,is a pixel>In a three-dimensional position offset of an object representing a pixel @>At the 3D point of the corresponding objectThree-dimensional position away from the object>Unitized three-dimensional offset of (1).
7. The robot grasping method based on the multi-modal feature extraction and the dense prediction as recited in claim 6, wherein the three-dimensional pose of the corresponding object is calculated in S5 according to the predicted dense three-dimensional pose information of each type of object, and the specific formula is as follows:
wherein +>For the three-dimensional pose of the object in the category obj, <' >>The dense predicted number corresponding to the class obj object;
and in the S5, according to the predicted dense three-dimensional position information of each type of object, calculating to obtain the three-dimensional position of the corresponding object, wherein the specific formula is as follows:
8. The robot grasping method based on the multi-modal feature extraction and the dense prediction as claimed in claim 1, wherein the loss function preset in S3 is specifically:
in the formula (II)>For a total loss of the network, is>、/>And &>Weight factors for semantically predicted branches, three-dimensionally postulated predicted branches, and three-dimensionally located predicted branches, respectively>Predicting a loss function of the network for semantics, using a cross-entropy loss function, based on>Predicting a network loss function for three-dimensional poses>Predicting a network loss function for three-dimensional locations>And &>Predict values and truth values for the three-dimensional pose prediction network, respectively>And &>Predicting values and truth values for a three-dimensional position prediction network, respectively>Is the number of object classes in the scene>The corresponding dense predicted number for each class of objects.
9. A robot grabbing system for grabbing a target object in a scene by using the robot grabbing method based on multi-modal feature extraction and dense prediction as claimed in any one of claims 1 to 8, the system comprises a robot pose calculation module, a communication module, a grabbing module and an image acquisition module,
the image acquisition module is used for acquiring color images and depth images under multi-class object scenes in real time and sending the color images and the depth images to the pose calculation module;
the pose calculation module calculates the pose of the target object by adopting the method according to any one of claims 1 to 8 and sends the pose to the grabbing module through the communication module;
the grabbing module receives the 6D pose information of the target object and grabs the target object.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211407718.3A CN115578460B (en) | 2022-11-10 | 2022-11-10 | Robot grabbing method and system based on multi-mode feature extraction and dense prediction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211407718.3A CN115578460B (en) | 2022-11-10 | 2022-11-10 | Robot grabbing method and system based on multi-mode feature extraction and dense prediction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115578460A CN115578460A (en) | 2023-01-06 |
CN115578460B true CN115578460B (en) | 2023-04-18 |
Family
ID=84588865
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211407718.3A Active CN115578460B (en) | 2022-11-10 | 2022-11-10 | Robot grabbing method and system based on multi-mode feature extraction and dense prediction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115578460B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116494253B (en) * | 2023-06-27 | 2023-09-19 | 北京迁移科技有限公司 | Target object grabbing pose acquisition method and robot grabbing system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110363815A (en) * | 2019-05-05 | 2019-10-22 | 东南大学 | The robot that Case-based Reasoning is divided under a kind of haplopia angle point cloud grabs detection method |
WO2020130085A1 (en) * | 2018-12-21 | 2020-06-25 | 株式会社日立製作所 | Three-dimensional position/attitude recognition device and method |
CN115082885A (en) * | 2022-06-27 | 2022-09-20 | 深圳见得空间科技有限公司 | Point cloud target detection method, device, equipment and storage medium |
CN115256377A (en) * | 2022-07-12 | 2022-11-01 | 同济大学 | Robot grabbing method and device based on multi-source information fusion |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018051704A (en) * | 2016-09-29 | 2018-04-05 | セイコーエプソン株式会社 | Robot control device, robot, and robot system |
US11701771B2 (en) * | 2019-05-15 | 2023-07-18 | Nvidia Corporation | Grasp generation using a variational autoencoder |
CN113658254B (en) * | 2021-07-28 | 2022-08-02 | 深圳市神州云海智能科技有限公司 | Method and device for processing multi-modal data and robot |
CN114998573B (en) * | 2022-04-22 | 2024-05-14 | 北京航空航天大学 | Grabbing pose detection method based on RGB-D feature depth fusion |
CN114663514B (en) * | 2022-05-25 | 2022-08-23 | 浙江大学计算机创新技术研究院 | Object 6D attitude estimation method based on multi-mode dense fusion network |
CN115147488B (en) * | 2022-07-06 | 2024-06-18 | 湖南大学 | Workpiece pose estimation method and grabbing system based on dense prediction |
-
2022
- 2022-11-10 CN CN202211407718.3A patent/CN115578460B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020130085A1 (en) * | 2018-12-21 | 2020-06-25 | 株式会社日立製作所 | Three-dimensional position/attitude recognition device and method |
CN110363815A (en) * | 2019-05-05 | 2019-10-22 | 东南大学 | The robot that Case-based Reasoning is divided under a kind of haplopia angle point cloud grabs detection method |
CN115082885A (en) * | 2022-06-27 | 2022-09-20 | 深圳见得空间科技有限公司 | Point cloud target detection method, device, equipment and storage medium |
CN115256377A (en) * | 2022-07-12 | 2022-11-01 | 同济大学 | Robot grabbing method and device based on multi-source information fusion |
Also Published As
Publication number | Publication date |
---|---|
CN115578460A (en) | 2023-01-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112270249B (en) | Target pose estimation method integrating RGB-D visual characteristics | |
CN109344882B (en) | Convolutional neural network-based robot control target pose identification method | |
CN109255813B (en) | Man-machine cooperation oriented hand-held object pose real-time detection method | |
CN106826833B (en) | Autonomous navigation robot system based on 3D (three-dimensional) stereoscopic perception technology | |
CN100407798C (en) | Three-dimensional geometric mode building system and method | |
CN113450408B (en) | Irregular object pose estimation method and device based on depth camera | |
CN111899301A (en) | Workpiece 6D pose estimation method based on deep learning | |
CN110852182B (en) | Depth video human body behavior recognition method based on three-dimensional space time sequence modeling | |
CN110176032B (en) | Three-dimensional reconstruction method and device | |
CN112836734A (en) | Heterogeneous data fusion method and device and storage medium | |
CN109325995B (en) | Low-resolution multi-view hand reconstruction method based on hand parameter model | |
CN111998862B (en) | BNN-based dense binocular SLAM method | |
CN113409384A (en) | Pose estimation method and system of target object and robot | |
JP2018119833A (en) | Information processing device, system, estimation method, computer program, and storage medium | |
CN112767467B (en) | Double-image depth estimation method based on self-supervision deep learning | |
CN109318227B (en) | Dice-throwing method based on humanoid robot and humanoid robot | |
CN115147488B (en) | Workpiece pose estimation method and grabbing system based on dense prediction | |
CN115578460B (en) | Robot grabbing method and system based on multi-mode feature extraction and dense prediction | |
CN114782628A (en) | Indoor real-time three-dimensional reconstruction method based on depth camera | |
CN114882109A (en) | Robot grabbing detection method and system for sheltering and disordered scenes | |
CN112750198A (en) | Dense correspondence prediction method based on non-rigid point cloud | |
CN112215861A (en) | Football detection method and device, computer readable storage medium and robot | |
CN115032648A (en) | Three-dimensional target identification and positioning method based on laser radar dense point cloud | |
CN111531546B (en) | Robot pose estimation method, device, equipment and storage medium | |
CN112950786A (en) | Vehicle three-dimensional reconstruction method based on neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |