CN113935368B

CN113935368B - Method for recognizing, positioning and grabbing planar objects in scattered stacking state

Info

Publication number: CN113935368B
Application number: CN202111190937.6A
Authority: CN
Inventors: 陈志勇; 曾德财; 李振汉; 黄全杰; 黄泽麟
Original assignee: Quanzhou Bingdian Technology Co ltd; Fuzhou University
Current assignee: Quanzhou Bingdian Technology Co ltd; Fuzhou University
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2024-06-07
Anticipated expiration: 2041-10-13
Also published as: CN113935368A

Abstract

The invention provides a method for identifying, positioning and grabbing a planar object in a scattered stacking state, which provides necessary training and testing samples for various convolutional neural networks involved in subsequent system deep learning through data acquisition of a binocular structured light camera, local image block interception based on a sliding window and data enhancement processing; based on the sample, three convolutional neural networks are built and trained; the obtained deep learning network can better realize the identification of all objects to be grabbed in a visual scene, and provides reasonable grabbing area matching recommendation for three-dimensional point cloud registration operation related to the pose estimation of the hand claw of the conventional sucker; in addition, the method can further judge whether the grippable plane is blocked or not in the three-dimensional point cloud environment, so that the false recognition of the object with the small part of blocked surface is reduced, the priority gripping sequence of each object to be gripped is finally determined, and the gripping success rate of the sucker paw is ensured.

Description

Method for recognizing, positioning and grabbing planar objects in scattered stacking state

Technical Field

The invention belongs to the technical field of object identification and positioning, and particularly relates to an identification, positioning and grabbing method for a planar object which is in a scattered stacking state.

Background

As the level of intellectualization of the manufacturing industry continues to increase. Compared with the whole process of manual production, the robot replaces manual work to finish part of production procedures, so that the production efficiency can be effectively improved, and the production cost can be reduced. The intelligent sorting is one of important directions for improving the intelligent level in the manufacturing industry, and whether the intelligent identification and positioning of the object to be grabbed are accurate or not directly influences the grabbing success rate of the subsequent robot, so that the intelligent sorting is an important link of the intelligent sorting operation.

When the robot-industrialized sorting operation is carried out in early stage, the objects to be grasped are required to be orderly, individually and accurately transferred to the designated positions, and the grasping mode is finished by technicians through manual teaching. However, the robot in the mode has low intelligent level, and the precise identification and grabbing of the scattered stacked objects cannot be realized only by repeatedly executing the path taught in advance. In order to improve the intelligent level of the robot, the machine vision technology is adopted to provide external information for the robot, so that the robot has accurate identification and positioning capability on objects with different external poses, and effective grabbing is realized. In engineering, for thin objects with a grabbing plane, people often prefer to use a sucker paw to carry out pneumatic adsorption grabbing on the objects, and a three-dimensional point cloud matching method based on a machine vision technology is widely used for estimating the pose of the paw so as to confirm the grabbing pose. For point cloud registration in a complex scene, in fact, a plurality of local point clouds in the scene are registered with a point cloud template obtained in advance respectively, and the difficulty is how to obtain a proper local point cloud for registration. The selection of the local point cloud generally adopts a method of generating feature points and features corresponding to the feature points on a template by using a three-dimensional local descriptor, such as SHOT, SPIN IMAGE, FPFH and the like, extracting features from the scene point cloud by using the local feature descriptor in the scene, searching feature points in the scene according to the extracted features, and extracting the local point cloud according to the feature points for registration. However, it should be noted that, in order to extract feature points suitable for point cloud registration in a scattered stacking scene, manual repeated testing is often required to perform optimization adjustment of local descriptors and parameters. Obviously, this series of operations is highly skilled for the field personnel and is time-consuming and labor-consuming.

With the recent great increase in computing power of computers, attention has been paid to deep learning technology, which is one of data-driven methods. At present, the deep learning technology is gradually applied to various fields, and the convolutional neural network in the deep learning mainly performs tasks such as classification, segmentation and target detection in image processing. In training, the convolutional neural network can automatically extract the features according to the data set and learn the abstract representation with higher level, so that the complex work of manually designing the feature detection operator can be avoided, and the convolutional neural network has higher recognition accuracy than the traditional algorithm. Therefore, it is necessary to fully combine the machine vision and the deep learning technology to provide an intelligent recognition and positioning method for the robot aiming at the scattered stacking state and having the grabbing function of planar objects so as to improve the working efficiency and the intelligent level of the sorting operation.

Disclosure of Invention

The invention aims to provide a recognition, positioning and grabbing method for a scattered stacked state and with a grabable planar object, and mainly provides a recognition and positioning method based on machine vision and deep learning for a type of scattered stacked state and with a grabable planar object. The method provides necessary training and testing samples for various convolutional neural networks related to subsequent system deep learning through data acquisition of a binocular structured light camera, local image block interception based on a sliding window and data enhancement processing; based on the sample, three convolutional neural networks are built and trained, wherein the three convolutional neural networks comprise an image block classification network PATCHCATENET for identifying whether each local image block area contains a grabbed plane which is not blocked, a semantic segmentation network PATCHSEGNET for segmenting the grabbed plane which is not blocked in each local image block area and a semantic segmentation network SCENESEGNET for identifying all objects to be grabbed in a scene image; the obtained deep learning network can better realize the identification of all objects to be grabbed in a visual scene, and provides reasonable grabbing area matching recommendation for three-dimensional point cloud registration operation related to the pose estimation of the hand claw of the conventional sucker; in addition, the method can further judge whether the grippable plane is blocked or not in the three-dimensional point cloud environment, so that the false recognition of the object with the small part of blocked surface is reduced, the priority gripping sequence of each object to be gripped is finally determined, and the gripping success rate of the sucker paw is ensured.

The invention adopts the following technical scheme:

A method for identifying, positioning and grabbing a planar object in a random stacking state is characterized in that: obtaining training and testing samples through data acquisition of a binocular structured light camera, local image block interception based on a sliding window and data enhancement processing; based on the sample, three convolutional neural networks are built and trained, wherein the three convolutional neural networks comprise an image block classification network PATCHCATENET for identifying whether each local image block area contains a grabbed plane which is not blocked, a semantic segmentation network PATCHSEGNET for segmenting the grabbed plane which is not blocked in each local image block area and a semantic segmentation network SCENESEGNET for identifying all objects to be grabbed in a scene image; the recognition of all objects to be grabbed in the visual scene is realized, and grabbing area matching recommendation is provided for three-dimensional point cloud registration operation related to the estimation of the positions and the postures of the sucker claws; and judging whether the grippable plane is shielded or not in the three-dimensional point cloud environment, and finally determining the priority gripping sequence of each object to be gripped.

Further, the acquisition and labeling of the sample image comprises the following steps:

step S001: the binocular structured light camera is fixedly arranged right above the object placement plane to be grabbed, and a scene gray level map and a scene depth map are acquired through the binocular structured light camera;

step S002: step S001, obtaining a scene gray level map;

step S003: obtaining a scene depth map from step S001;

Step S004: in the field Jing Huidu diagram, semantic segmentation labeling is performed on all objects to be grabbed in a scene, and instance segmentation labeling is performed on the grabbed planes which are not blocked in the scene.

It should be emphasized that the step numbers provided in the present invention are not meant to limit the execution sequence of the steps, as can be seen from fig. 1 of the specification, the step S002 and the step S003 are actually steps belonging to a parallel relationship, and the execution sequence thereof can be adjusted as required; if step S004 is an operation step of accepting step S002, there is no direct correlation with step S003, so the execution sequence of step S003 and step S004 has no practical technical significance; these steps may be performed in order to achieve the same effects according to the common knowledge or custom of the person skilled in the art, which belongs to the technical equivalents of the present invention, and also belongs to the protection scope of the present invention.

Further, the training process of the image block classification network PATCHCATENET includes the following steps:

step S005: a dataset of the image block classification network PATCHCATENET is generated in the following manner:

setting a square sliding window with a fixed size according to the size of a grippable plane of an object to be grippable under an image, wherein the window size is a patch size and comprises the grippable plane and peripheral information thereof; setting the step length of the sliding window to PATCH STRIDE;

PATCHCATENET is generated on the depth map, firstly, a minimum circumscribing rectangular area containing semantic segmentation is cut out on the depth map according to the semantic segmentation label in the step S004, then, an image block is cut out on the cut-out depth map rectangular area from left to right and from top to bottom according to the size and the step length of a set sliding window from the left upper corner of the cut-out depth map rectangular area until the cutting out to the right lower corner is finished; each time an image block is cut out, the image block is subjected to the following classification judgment: if the image block has an intersection with the plurality of instance segmentation labels in step S004, sequentially calculating the percentage P _i of the intersection of the image block and each instance segmentation to each instance segmentation area, i=1, 2, …, n for the image block according to the instance segmentation labels; n represents the number of instance divisions contained in the image block, the largest percentage P _max＝Max(P₁,P₂,…,P_n is selected), P _max is compared with a preset grippable threshold P _threshold, if P _max≥P_threshold, the class of the image block is determined as positive, otherwise negative;

Step S006: an offline data enhancement mode is adopted before training the image block classification network PATCHCATENET to obtain enough and balanced positive and negative examples of the network, and the enhancement mode is as follows:

Different stacking modes of objects and different sizes of the grippable threshold P _threshold often result in unbalanced positive examples and negative examples in the dataset of the image block classification network PATCHCATENET, and often there are fewer positive examples and more negative examples. In order to alleviate the problem of unbalance of positive and negative examples, the image block can be subjected to data enhancement in a rotating mode; when the data enhancement is performed by using a rotation mode, the sampling period of the positive example sample and the negative example sample in the angle change is controlled, so that the proportion of the positive example sample and the negative example sample is adjusted. However, as the cut depth map image blocks are rotated and the peripheral information of a more sufficient grabbing plane is not obtained, the invention takes the geometric centroid of the scene depth map as a base point, and rotates and cuts the picture, so that the data set is directly expanded, and more information contained in the rotated image blocks is obtained;

step S007: building an image block classification network PATCHCATENET;

Step S008: training the image block classification network PATCHCATENET constructed in the step S007 by using the image block classification network PATCHCATENET dataset generated in the step S006;

Step S009: the trained tile classification network PATCHCATENET is obtained from step S008.

Further, the training process of the semantic segmentation network PATCHSEGNET includes the following steps:

Step S015: the image block semantic segmentation network PATCHSEGNET dataset is generated in the following manner:

obtaining a depth map image block according to the mode of the step S005, generating semantic segmentation labels for the image blocks with positive examples of class labels in the image blocks, and marking the parts of the example segmentation labels corresponding to the P _max in the image blocks, wherein the negative examples have no semantic segmentation labels;

Step S016: offline data enhancement is adopted before training of the image block semantic segmentation network PATCHSEGNET in the following manner:

Performing rotary re-clipping on the scene depth map in a mode of step S006, and acquiring image blocks and corresponding semantic segmentation labels in accordance with step S015;

step S017: building an image block semantic segmentation network PATCHCATENET;

Step S018: training the image block semantic segmentation network PATCHCATENET constructed in the step S017 by utilizing the image block semantic segmentation network PATCHCATENET data set generated in the step S016;

Step S019: the trained image block semantic segmentation network PATCHSEGNET is obtained from step S018.

Further, the training process of the semantic segmentation network SCENESEGNET includes the following steps:

Step S025: generating a data set of a scene semantic segmentation network SCENESEGNET by adopting the scene gray level map and semantic segmentation labels of all objects to be grabbed in the scene;

step S026: offline data enhancement of the data set prior to training of the field Jing Yuyi split network SCENESEGNET as needed;

Step S027: constructing a scene semantic segmentation network SCENESEGNET;

step S028: training the scene semantic segmentation network SCENESEGNET constructed in the step S027 by utilizing the scene semantic segmentation network SCENESEGNET data set generated in the step S026;

step S029: the trained scene semantic segmentation network SCENESEGNET is obtained from step S028.

Further, the method for identifying and positioning the object to be grabbed in the scene comprises the following steps:

Step S100: shooting a single object to be grabbed placed on a platform by utilizing a binocular structured light camera fixedly installed right above the object to be grabbed placing platform in a scene to obtain a scene point cloud containing the single object to be grabbed;

Step S200: deleting redundant point clouds in the scene point clouds obtained in the step S100, reserving point clouds of a grabbing plane of an object to be grabbed, so as to establish a three-dimensional point cloud template under a scene coordinate system, and selecting characteristic points one by one in the three-dimensional point cloud template, so that polygons formed by the characteristic points connected in sequence can surround the outer contour of the grabbing plane; the point cloud templates and the feature points are obtained under a scene coordinate system;

step S300: shooting all objects to be grabbed which are placed on a platform and are in a scattered stacking state by using a binocular structured light camera fixedly installed right above the object to be grabbed placing platform in a scene;

step S400: step S300, obtaining a scene gray level map;

Step S500: obtaining a scene depth map from step S300;

step S600: step S300, obtaining a scene three-dimensional point cloud;

step S700: the scene gray level map obtained in the step S400 and the scene depth map obtained in the step S500 are used as inputs, a network prediction module is input, three deep learning networks obtained by earlier training are used for prediction, and final scene depth map instance segmentation is output;

Step S800: selecting each example segmentation centroid obtained in the step S700 as a center from the three-dimensional point clouds of the scene obtained in the step S600, wherein the image block with the size of patch size is used as each local occlusion judgment area, and the scene point cloud corresponding to each area is an occlusion judgment local point cloud in one scene;

Step S900: selecting a local area corresponding to each example segmentation obtained in the step S700 as each local area to be registered in the three-dimensional point cloud of the scene obtained in the step S600, wherein the scene point cloud corresponding to each area is the local point cloud to be registered in one scene;

Step S1000: taking the template point cloud and the characteristic point cloud obtained in the step S200, the shielding judgment local point cloud in the scene obtained in the step S800 and the local point cloud to be registered in the scene obtained in the step S900 as inputs, and inputting a point cloud shielding judgment module to output a capturing recommendation of an object to be captured in the scene;

Step S1100: the object to be grasped having the highest priority of grasping recommendation is grasped, and then step S300 is returned.

Further, in the network prediction module used in step S700, a scene gray map obtained in step S400 and a scene depth map obtained in step S500 are input, and are output as a final scene depth map instance segmentation, which specifically includes the following steps:

step S710: taking the scene gray level map obtained in the step S400 as input, and inputting a scene semantic segmentation network SCENESEGNET for reasoning to obtain semantic segmentation of all objects to be grabbed in the scene;

Step S720: judging whether objects to be grabbed exist in the scene by semantic segmentation of all the objects to be grabbed in the scene output by the step S710, if so, performing the step S730, and if not, ending the round of grabbing;

Step S730: according to semantic segmentation of all objects to be grabbed in the scene output in the step S710, slightly expanding a rectangular frame based on a minimum circumscribed rectangular frame without rotation of the semantic segmentation, and taking the rectangular frame as a range to intercept the scene depth map obtained in the step S500; acquiring an image block on the intercepted depth map in a mode of intercepting the image block in the step S005;

Step S740: taking the image block obtained in the step S730 as input, inputting the input image block into an image block classification network PATCHCATENET prediction module, and outputting an image block with a category score greater than or equal to a preset category score threshold value Cate _threshold;

step S750: taking the image block output in the step S740 as input, and inputting the image block semantic segmentation network PATCHSEGNET for reasoning; obtaining the grabbed plane semantic segmentation of each image block which is not blocked;

Step S760: restoring the output semantic segmentation result to the scene depth map according to the position of the image block input in the step S750 in the scene depth map, wherein different semantic segmentations obtained after the restoration of different image blocks form a group of semantic segmentations with the size of the scene depth map and the number of PATCHSEGNET input image blocks as example segmentations of one scene depth map;

step S770: updating the class score of each PATCHSEGNET input image block using non-maximum suppression by the example segmentation result;

Step S780: judging whether the updated class score is greater than or equal to a preset non-maximum suppressed class score threshold NMS _threshold, if so, reserving the instance segmentation, and if not, discarding the instance segmentation;

step S790: and obtaining final scene depth map instance segmentation according to the reserved result of the step S780.

Further, the image block classification network PATCHCATENET prediction module used in step S740 inputs the image block as a depth map image block, and outputs an image block with a class score greater than or equal to a preset class score threshold, which specifically includes the following steps:

Step S741: inputting the depth map image blocks obtained in the step S730 into an image block classification network PATCHCATENET for reasoning to obtain a grabbing plane score of each image block which is not blocked;

step S742: judging whether the image blocks contain the unoccluded grippable plane score which is larger than or equal to a preset category score threshold value Cate _threshold one by one, if so, reserving the image blocks, and if not, discarding the image blocks;

step S743: and obtaining an image block with a category score greater than or equal to a preset category score threshold according to the reserved result in the step S742.

Further, the point cloud shielding judgment module used in step S1000 inputs the point cloud shielding judgment local point cloud in the scene, and can grasp the plane template and the feature point, the local point cloud to be registered in the scene, and outputs the point cloud to be registered in the scene as a grasping recommendation of the object to be grasped in the scene, and the specific steps are as follows:

Step S1001: registering the local point cloud to be registered in the scene obtained in the step S900 with the grabbing plane template obtained in the step S200 to obtain a transformation matrix from the grabbing plane template to the local point cloud to be registered, and transforming the grabbing plane template and the characteristic points obtained in the step S200 into the scene by using the transformation matrix;

Step S1002: fitting a graspable plane according to the graspable plane template transformed to the scene to obtain plane equation parameters of the graspable plane, and projecting the shielding judgment local point cloud in the scene obtained in the step S800 and the characteristic points transformed to the scene in the step S1001 to the fitting plane;

step S1003: judging whether the point projected to the fitting plane is near the fitting plane before projection, namely, the point on the grippable plane in the scene and the point below the plane, if not, performing step S1004, and if so, discarding the point;

Step S1004: judging whether the remaining points in the step S1003 are in the polygon formed by the characteristic points on the fitting plane, namely, the points above the grippable plane, if so, performing the step S1005, and if not, discarding the points;

Step S1005: recording the number of points output in step S1004;

step S1006: sorting the instance divisions according to the number of points above the plane recorded in step S1005;

Step S1007: and obtaining grabbing recommendation of the objects to be grabbed in the scene according to the example segmentation ordering of the step S1006.

Further, the image block classification network PATCHCATENET employs ResNet; the image block semantic segmentation network PATCHSEGNET and the scene semantic segmentation network SCENESEGNET employ a semantic segmentation network Unet.

Compared with the prior art, the invention and the preferred scheme thereof mainly aim at a class of objects which are in a scattered stacking state and can be grabbed, and provide an identification and positioning method based on machine vision and deep learning. The method can realize accurate identification and positioning of the grabbing planes with the grabbing plane objects which are randomly stacked and placed, and automatically provides a recommended registration area for point cloud registration, thereby effectively avoiding the complicated operation of various manual registration methods in a reciprocating mixed scene and improving the registration success rate; in addition, the method can further judge whether the grippable plane is blocked or not in the three-dimensional point cloud environment, so that the false recognition of the object with the small part of blocked surface is reduced, the priority gripping sequence of each object to be gripped is finally determined, and the gripping success rate of the sucker paw is ensured.

Drawings

The invention is described in further detail below with reference to the attached drawings and detailed description:

figure 1 is a flow chart of early network training in accordance with an embodiment of the present invention.

Figure 2 is a general flow chart of an embodiment of the present invention.

FIG. 3 is a flow chart of a network prediction module in a general flow chart of an embodiment of the invention.

Fig. 4 is a flowchart of the image block classification network PATCHCATENET prediction module in the network prediction module according to an embodiment of the present invention.

Fig. 5 is a flowchart of a cloud occlusion determination module in a general flowchart according to an embodiment of the present invention.

FIG. 6 is an illustration of an example gray scale map annotation in accordance with an embodiment of the present invention.

FIG. 7 is a sliding window size selection in an example of the invention.

FIG. 8 is a diagram illustrating an example of a minimum bounding rectangle (no rotation) for semantic segmentation labels of a scene depth map in an example of the present invention.

FIG. 9 is a rectangular region of an example scene depth map truncated according to semantic segmentation annotations in an example of the present invention.

FIG. 10 is a sliding window of the type just described in accordance with one embodiment of the present invention.

FIG. 11 is a sliding window with negative examples of one class in an embodiment of the invention.

Fig. 12 is a schematic diagram of a network architecture using Unet in an example of the present invention.

FIG. 13 is a set of feature points marked on a template in an example of the invention.

Fig. 14 is a diagram of a graspable plane recognition result in an example of the present invention.

Detailed Description

In order to make the features and advantages of the present patent more comprehensible, embodiments accompanied with figures are described in detail below:

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Firstly, in the embodiment, taking a scattered stacked gear with a grabbing plane as an example, the invention uses a binocular structured light camera to collect 267 groups of gear gray level images, depth images and corresponding three-dimensional point clouds in different scattered stacked states in a scene, wherein 224 groups are used for generating a training set, and 43 groups are used for generating a test set. The relevant parameter settings used in the present invention are shown in table 1.

TABLE 1 parameter settings

Parameters (parameters)	Numerical value
		Sliding window size patch size	280
Sliding window step patchstride	56
		Grippable threshold P _threshold	95％
Category score threshold value gate _threshold	40％
		Non-maximum post-suppression class score threshold NMS _threshold	40％

As shown in fig. 1 to 14, the method for identifying, positioning and grabbing a planar object in a random stacking state includes the following implementation processes, including data set generation, construction and training of the first three convolutional neural networks, which specifically includes the steps of:

Step S001: and fixedly mounting the binocular structured light camera right above the gear placing plane, and collecting a scene gray level map and a scene depth map through the binocular structured light camera.

Step S002: a scene gray map is obtained from step S001.

Step S003: a scene depth map is obtained from step S001.

Step S004: in the field Jing Huidu diagram, semantic segmentation labeling is carried out on all gears to be grabbed in a scene, and instance segmentation labeling is carried out on the grabbed planes which are not blocked in the scene. In the field Jing Huidu diagram (as shown in fig. 6), the whole gear to be grabbed is subjected to semantic segmentation labeling, and the grabbed plane is subjected to instance segmentation labeling.

Step S005: a dataset of the image block classification network PATCHCATENET is generated in the following manner: setting a square sliding window with a fixed size according to the size of a grippable plane of the gear under an image, wherein the window size is a patch size and comprises the grippable plane and peripheral information thereof; the step size of the sliding window is set to PATCH STRIDE.

In this example, the setting of the sliding window is shown in fig. 7, and the box is the set sliding window; considering that the smallest circumscribed square of the gears is about 224×224, the patch size is set to 280×280, and the patch stride is set to 1/5 of the patch size, i.e., 56.

PATCHCATENET is generated on the depth map, and a minimum circumscribed rectangular region (without rotation) containing semantic segmentation is cut out on the depth map according to the semantic segmentation label in step S004, as shown in fig. 8, which is an example of the minimum circumscribed rectangular region (without rotation) containing semantic segmentation on the depth map according to the semantic segmentation label in this example, and as shown in fig. 9, which is a rectangular region cut out according to the semantic segmentation label in fig. 8.

Cutting image blocks on the cut-out rectangular region of the depth map from left to right and from top to bottom according to the size and the step length of the set sliding window from the left upper corner of the cut-out rectangular region of the depth map until cutting to the right lower corner is finished; each time an image block is cut out, the image block is subjected to the following classification judgment: if the image block has an intersection with the plurality of instance divisions of step S004, the percentage P _i (i=1, 2, …, n; n represents the number of instance divisions included in the image block) of the intersection of the image block with each instance division in each instance division area is sequentially calculated for the image block according to the instance division labels, the maximum percentage P _max＝Max(P₁,P₂,…,P_n is selected, P _max is compared with a preset grippable threshold P _threshold, and if P _max≥P_threshold, the category of the image block is determined to be positive, otherwise negative.

As shown in fig. 10 and 11, the square frame is a sliding window, the light-colored portion near the square frame is an example segmentation mark visualization, the text is P _i and the numerical value thereof, and the grabbing threshold P _threshold of the example is set to 95%. As shown in fig. 10, the sliding window position is P _max>P_threshold of the image block in the sliding window, where P _max＝P₁ =100%, and is determined as a positive example; as shown in fig. 11, the sliding window position is P _max<P_threshold of the image block in the sliding window, where P _max＝P₁ = 75.13, and is determined as a negative example.

Different stacking modes of gears and different sizes of the grippable thresholds P _threshold often result in unbalanced positive examples and negative examples in the dataset of the image block classification network PATCHCATENET, and often there are fewer positive examples and more negative examples. In order to alleviate the problem of unbalance of positive and negative examples, the image block can be subjected to data enhancement in a rotating mode; when the data enhancement is performed by using a rotation mode, the sampling period of the positive example sample and the negative example sample in the angle change is controlled, so that the proportion of the positive example sample and the negative example sample is adjusted. However, because the cut depth map image block is rotated and the surrounding information of the grabbing plane is not obtained fully, the invention takes the geometric centroid of the scene depth map as a base point, and rotates and cuts the picture, thereby directly expanding the data set to obtain more information contained in the rotated image block.

In the embodiment, the center of a depth map of 224 training set scenes is taken as a rotation center, a positive image is acquired every 30 degrees from a home position (0 degrees) in the anticlockwise direction, and the total acquisition is 12 times, so that 18726 positive image blocks are generated; negative example images are acquired every 180 degrees, and 2 times are acquired, so that 24229 negative example image blocks are generated, and the proportion of positive and negative example samples of the PATCHCATENET training set is adjusted. The total number of the positive and negative image blocks is 42955, which is used for PATCHCATENET training. The test set image blocks are not enhanced by data, and 225 positive image blocks and 2321 negative image blocks are generated by the 43 test set scene depth maps and used for PATCHCATENET tests.

Step S007: an image block classification network PATCHCATENET is built. The present example uses a classification network ResNet as PATCHCATENET, where the network inputs are set to 224 x 224 according to the size trained on ImageNet, and the image block size of the present example is 280 x 280, thus requiring downsampling to 224 x 224 before inputting into the network. Because judging whether the unoccluded grippable plane is contained or not is a classification problem, the number of output channels of the last layer is 1, and the activation function used by the last layer is sigmold functions.

Step S008: the image block classification network PATCHCATENET constructed in step S007 is trained using the image block classification network PATCHCATENET dataset generated in step S006.

On-line data enhancement mode during training of this example PATCHCATENET: random horizontal flip and random vertical flip. PATCHCATENET parameter update strategy during training: the method for initializing the weight parameters is random initialization, the maximum iteration number of parameter updating is 9000, the learning rate is updated by adopting a random gradient descent method, the preset learning rate is 0.01, the momentum is 0.9, and the weight attenuation is 0.0001. The learning rate is linearly increased from 0.001 times the preset learning rate to the preset learning rate in the previous 1000 iterations, is reduced to 0.1 times the preset learning rate in 6000 iterations, and is reduced to 0.01 times the preset learning rate in 8000 iterations.

And (3) obtaining a depth map image block according to the mode of the step S005, generating semantic segmentation labels for the image blocks with positive examples of class labels in the image blocks, marking as parts of the example segmentation labels corresponding to the P _max in the image blocks, and not making the semantic segmentation labels for negative examples. As shown in fig. 10, the image block category is a positive example, and the semantic segmentation is marked as a middle light-colored part; as shown in FIG. 11, the image block category is negative, and does not make semantic segmentation labels.

And (3) performing rotation and clipping on the scene depth map according to the mode of the step S006, and acquiring the image block and the corresponding semantic segmentation labels according to the step S015. The 18726 image blocks with the category of positive examples are generated for PATCHSEGNET training, the test set image blocks are not enhanced by data, and 225 positive example image blocks are generated by 43 test set scene depth maps and used for PATCHSEGNET testing.

Step S017: an image block semantic segmentation network PATCHSEGNET is built. The present example uses semantic segmentation network Unet as PATCHSEGNET, and uses 224 x 224 tiles as PATCHSEGNET input since tile prediction is used as in PATCHCATENET. Since the original Unet is subjected to 2 times of downsampling for 4 times, the image with 224×224 size becomes 14×14 after 2 times of downsampling for 4 times, so that one downsampling layer with the original Unet is omitted, the resolution can be increased to 28×28, meanwhile, for convenience of upsampling, the convolution kernel parameter padding is set to 1, so that the size of the feature map on the same layer is unchanged, as shown in fig. 12, S in the figure represents the size of an input image, rectangles represent feature maps (solid rectangle represents feature maps operated sequentially, dashed rectangle represents feature maps connected in a jumping manner, solid rectangle and dashed rectangle are combined together to represent a splicing operation in a channel dimension), the lower numbers represent feature map channels, solid arrow to the right represents convolution layer, normalization layer and activation layer operation with the convolution kernel size of 3×3, solid arrow to the down represents pooling layer operation with 2 times of downsampling, solid arrow to the up represents 2 times of upsampling operation, dashed arrow to the right represents jump connection operation, input is single-channel image, and the number of output semantic division channels is 1.

Step S018: the image block semantic segmentation network PATCHSEGNET constructed in step S017 is trained using the image block semantic segmentation network PATCHCATENET dataset generated in step S016. The data enhancement and parameter update strategy during training of this example is the same as step S008.

Step S025: and generating a data set of the scene semantic segmentation network SCENESEGNET by adopting the scene gray level map and semantic segmentation labels of all gears to be grabbed in the scene. The data set of the scene semantic segmentation network SCENESEGNET is generated by adopting the collected 267 Zhang Changjing gray level images and semantic segmentation labels of all gears to be grabbed in the scene, 224 scene gray level images are used for generating training sets, and 43 scene gray level images are used for generating test sets.

Step S026: offline data enhancement is performed on the data set as needed prior to training of the field Jing Yuyi split network SCENESEGNET. Before SCENESEGNET training, the scene gray-scale image adopts an offline data enhancement mode: the center of 224 training set scene gray level images is taken as a rotation center, the rotation of 90 degrees, 180 degrees and 270 degrees is started from a home position (0 degrees) in the anticlockwise direction, the total of 4 times is carried out, 896 scene gray level images are generated for SCENESEGNET training, the data enhancement is not carried out on the test set scene gray level images, and 43 scene gray level images are kept for testing.

Step S027: a scene semantic segmentation network SCENESEGNET is built. The present example employs a network of the same structure as PATCHSEGNET as SCENESEGNET. Because the camera has overlarge vision and contains too many non-working areas, the working areas are preset during training and reasoning, the gray level images in the range are used as objects for training and reasoning, the gray level images of the cut-out working areas are downsampled to 512 multiplied by 512 and input into a network for training, and the semantic segmentation result is obtained and then upsampled to the original size.

Step S028: the scene semantic segmentation network SCENESEGNET constructed in step S027 is trained using the scene semantic segmentation network SCENESEGNET dataset generated in step S026. On-line data enhancement mode during training of this example SCENESEGNET: the random rotation angle range is + -45 degrees, gaussian noise, random horizontal overturn and random vertical overturn are carried out, and the parameter updating strategy is the same as that of the step S008.

Step S029: the trained scene semantic segmentation network SCENESEGNET is obtained from step S028. The invention provides a method for identifying and positioning gears to be grabbed in a scene, which comprises the following specific steps:

Step S100: and shooting a single gear placed on the platform by using a binocular structured light camera fixedly installed right above the gear placement platform in the scene to obtain a scene point cloud containing the single gear.

Step S200: and deleting redundant point clouds in the scene point clouds obtained in the step S100, reserving point clouds of which the gears can grab planes, so as to establish a three-dimensional point cloud template under a scene coordinate system, and selecting characteristic points one by one in the three-dimensional point cloud template, so that polygons formed by the characteristic points connected in sequence can surround the outer contours of the grabbed planes. And the characteristic points of the point cloud template are obtained under a scene coordinate system. As shown in fig. 13, gray points are point cloud templates of the grippable plane, and points at each corner of the gear are labeled feature points.

Step S300: and shooting all gears which are placed on the platform and are in a scattered stacking state by using a binocular structured light camera fixedly installed right above the gear placement platform in a scene.

Step S400: a scene gray map is obtained from step S300.

Step S500: a scene depth map is obtained from step S300.

Step S600: and obtaining a three-dimensional point cloud of the scene in the step S300.

Step S700: the scene gray level map obtained in the step S400 and the scene depth map obtained in the step S500 are used as input, input into a network prediction module, use three deep learning networks obtained in the earlier stage training to conduct prediction, and output final scene depth map instance segmentation.

Step S800: in the three-dimensional point cloud of the scene obtained in step S600, an image block with a size of patch size (patch size=280 in this example) centered on each example division centroid obtained in step S700 is selected as each local occlusion judgment area, and the scene point cloud corresponding to each area is an occlusion judgment local point cloud in one scene.

Step S900: in the three-dimensional point cloud of the scene obtained in step S600, a local area corresponding to each of the example partitions obtained in step S700 is selected as each local area to be registered, and the scene point cloud corresponding to each area is the local point cloud to be registered in one scene.

Step 1000: and (3) taking the template point cloud and the characteristic point cloud obtained in the step (200), the shielding judgment local point cloud in the scene obtained in the step (800), and the local point cloud to be registered in the scene obtained in the step (900) as inputs, and inputting a point cloud shielding judgment module to output the grabbing recommendation of the gears in the scene.

Step S1100: the gear with the highest recommended priority is grasped, and then step S300 is returned.

The network prediction module used in the step S700 of the present invention inputs the scene gray level map obtained in the step S400 and the scene depth map obtained in the step S500, and outputs the scene gray level map and the scene depth map as a final scene depth map example segmentation, and the steps are as follows:

Step S710: and (3) taking the scene gray level diagram in the step S400 as input, and inputting the scene semantic segmentation network PATCHSEGNET obtained in the step S029 to perform reasoning so as to obtain all gear semantic segmentations in the scene. In this example, since the camera field of view is too large and includes too many non-working areas, the working area set in step S027 firstly cuts the gray-scale image of the scene obtained in step S400, then downsamples the size of the cut gray-scale image to 512×512, inputs the scene semantic segmentation network SCENESEGNET obtained in step S029 to perform reasoning, upsamples the result to restore to the size before downsampling, and obtains all gear semantic segmentations in the scene.

Step S720: all gears in the scene output by step S710 are semantically segmented to determine whether there are gears in the scene, if yes, step S730 is performed, and if not, the capturing of the current round is ended.

Step S730: according to all gear semantic divisions in the scene output in step S710, the rectangular frame is slightly enlarged based on the minimum circumscribed rectangular frame (no rotation) of the semantic division, and the scene depth map obtained in step S500 is intercepted in the range. And acquiring the image block on the truncated depth map in a manner of truncating the image block in step S005, and downsampling the 280×280 depth map image block to 224×224.

Step S740: the image block obtained in step S730 is taken as an input, and is input to the image block classification network PATCHCATENET prediction module, and the image block with the class score equal to or greater than the class score threshold value gate _threshold is output, and in this example, the class score threshold value gate _threshold is set to 40%.

Step S750: the image block output in step S740 is taken as input, and the image block semantic segmentation network PATCHSEGNET obtained in step S019 is input for reasoning. And up-sampling the semantic segmentation result to the set image block size, which is 280×280 in the example, to obtain the grabbed plane semantic segmentation of the image block which is not blocked.

Step S760: and restoring the output semantic segmentation result to the scene depth map according to the position of the image block input in the step S750 in the scene depth map, wherein different semantic segmentations obtained after the restoration of different image blocks form a group of semantic segmentations with the size of the scene depth map and the number of PATCHSEGNET input image blocks as example segmentations of one scene depth map.

Step S770: the example segmentation results update the class score for each PATCHSEGNET input image block using non-maximum suppression.

Step S780: and judging whether the updated class score is greater than or equal to a preset non-maximum suppressed class score threshold NMS _threshold. The present example NMS _threshold sets to 40%, if yes, this example split is preserved, if no, this example split is discarded.

The image block classification network PATCHCATENET prediction module used in step S740 of the present invention inputs the image blocks as depth map image blocks, outputs the image blocks with the class score greater than or equal to the preset class score threshold, and the steps are as follows:

Step S741: and (3) inputting the depth map image blocks obtained in the step (730) into the image block classification network PATCHCATENET obtained in the step (009) for reasoning to obtain a graspable plane score of each image block which is not blocked.

Step S742: whether the image block contains the unoccluded grippable plane score is equal to or greater than a preset class score threshold value gate _threshold (gate _threshold =40% in the present example) is determined one by one, if so, the image block is retained, and if not, the image block is discarded.

The point cloud shielding judgment module used in the step S1000 is input into a shielding judgment local point cloud in a scene, can grasp a plane template and characteristic points, is used for registering the local point cloud in the scene, and is output into a gear capturing recommendation in the scene, and the steps are as follows:

Step S1001: registering the local point cloud to be registered in the scene obtained in the step S900 with the grippable plane template obtained in the step S200 to obtain a transformation matrix from the gear grippable plane template to the local point cloud to be registered, and transforming the grippable plane template and the feature points obtained in the step S200 into the scene by using the transformation matrix.

Step S1002: fitting a grippable plane according to the grippable plane template transformed to the scene to obtain plane equation parameters of the grippable plane, and projecting the occlusion judgment local point cloud in the scene obtained in the step S800 and the feature points transformed to the scene in the step S1001 to the fitting plane.

Step S1003: it is determined whether the aforementioned point projected onto the fitting plane is a point near the fitting plane (a point on the grippable plane in the scene) and below the plane before projection, if not, step S1004 is performed, and if yes, this point is discarded.

Step S1004: it is determined whether the remaining points in step S1003 are within the polygon composed of the feature points on the fitting plane, i.e., the points located above the grippable plane, and if so, step S1005 is performed, and if not, this point is discarded.

Step S1005: the number of points output from step S1004 is recorded.

Step S1006: the instance divisions are ordered according to the number of points above the plane recorded in step S1005.

Step S1007: and (3) obtaining gear grabbing recommendation in the scene according to the example segmentation ordering of the step S1006.

In this example, 30 rounds of grabbing tests are carried out, each round of testing starts to be randomly stacked by 14-18 gears, each time, the gear with the highest recommended sequence is grabbed, 464 times of grabbing identification are carried out, wherein no grabable gear is identified for 4 times, the accuracy is 98.92% because of too small judgment errors of a shielding part for 1 time, one identification result is shown in fig. 14, and numbers 1, 2, 3 and 4 on a mask are arranged according to the recommended sequence.

The present invention is not limited to the above-mentioned preferred embodiments, and any person can obtain other various methods for recognizing, positioning and grabbing planar objects in a random stacking state under the teaching of the present invention, and all equivalent changes and modifications made according to the scope of the present invention should be covered by the present invention.

Claims

1. A method for identifying, positioning and grabbing a planar object in a random stacking state is characterized in that: obtaining training and testing samples through data acquisition of a binocular structured light camera, local image block interception based on a sliding window and data enhancement processing; based on the sample, three convolutional neural networks are built and trained, wherein the three convolutional neural networks comprise an image block classification network PATCHCATENET for identifying whether each local image block area contains a grabbed plane which is not blocked, a semantic segmentation network PATCHSEGNET for segmenting the grabbed plane which is not blocked in each local image block area and a semantic segmentation network SCENESEGNET for identifying all objects to be grabbed in a scene image; the recognition of all objects to be grabbed in the visual scene is realized, and grabbing area matching recommendation is provided for three-dimensional point cloud registration operation related to the estimation of the positions and the postures of the sucker claws; judging whether the grippable plane is shielded or not in a three-dimensional point cloud environment, and finally determining the priority gripping sequence of each object to be gripped;

the acquisition and labeling of the sample image comprises the following steps:

step S002: step S001, obtaining a scene gray level map;

step S003: obtaining a scene depth map from step S001;

Step S004: in a field Jing Huidu diagram, performing semantic segmentation labeling on all objects to be grabbed in a scene, and performing instance segmentation labeling on a grabbed plane which is not blocked in the scene;

The training process of the image block classification network PATCHCATENET includes the following steps:

PATCHCATENET is generated on the depth map, firstly, a minimum circumscribing rectangular area containing semantic segmentation is cut out on the depth map according to the semantic segmentation label in the step S004, then, an image block is cut out on the cut-out depth map rectangular area from left to right and from top to bottom according to the size and the step length of a set sliding window from the left upper corner of the cut-out depth map rectangular area until the cutting out to the right lower corner is finished; each time an image block is cut out, the image block is subjected to the following classification judgment: if the image block has an intersection with the plurality of instance segmentation labels in step S004, sequentially calculating the percentage P _i of the intersection of the image block with each instance segmentation in each instance segmentation area according to the instance segmentation labels, i=1, 2. n represents the number of instance divisions contained in the image block, the largest percentage P _max＝Max(P₁,P₂,...,P_n is selected), P _max is compared with a preset grippable threshold P _threshold, if P _max≥P_threshold, the class of the image block is determined as positive, otherwise negative;

Taking the geometric centroid of the scene depth map as a base point, rotating and cutting the picture, so as to directly expand the data set, and further obtaining more information contained in the rotated image block;

step S007: building an image block classification network PATCHCATENET;

Step S009: step S008 is used to obtain a trained image block classification network PATCHCATENET;

the training process of the semantic segmentation network PATCHSEGNET includes the following steps:

step S017: building an image block semantic segmentation network PATCHSEGNET;

Step S018: training the image block semantic segmentation network PATCHSEGNET constructed in the step S017 by utilizing the image block semantic segmentation network PATCHSEGNET data set generated in the step S016;

Step S019: step S018 is used for obtaining a trained image block semantic segmentation network PATCHSEGNET;

the training process of the semantic segmentation network SCENESEGNET includes the following steps:

Step S027: constructing a scene semantic segmentation network SCENESEGNET;

step S029: obtaining a trained scene semantic segmentation network SCENESEGNET from step S028;

the method comprises the following steps of:

step S400: step S300, obtaining a scene gray level map;

Step S500: obtaining a scene depth map from step S300;

step S600: step S300, obtaining a scene three-dimensional point cloud;

step S1100: the object to be grabbed with the highest recommended priority is grabbed, and then the step S300 is returned;

in the network prediction module used in step S700, the scene gray level map obtained in step S400 and the scene depth map obtained in step S500 are input, and the final scene depth map instance segmentation is output, which specifically includes the following steps:

Step S730: according to semantic segmentation of all objects to be grabbed in the scene output in the step S710, slightly expanding a rectangular frame based on a minimum circumscribed rectangular frame without rotation of the semantic segmentation, and taking the rectangular frame as a range to intercept the scene depth map obtained in the step S500; acquiring an image block on the intercepted depth map;

step S790: obtaining final scene depth map instance segmentation according to the reserved result of the step S780;

the image block classification network PATCHCATENET prediction module used in step S740 inputs the image block as a depth map image block, outputs an image block with a category score greater than or equal to a preset category score threshold, and specifically comprises the following steps:

Step S743: obtaining an image block with a category score greater than or equal to a preset category score threshold according to the reserved result in the step S742;

The point cloud shielding judgment module used in the step S1000 is input into a shielding judgment local point cloud in a scene, can grasp a plane template and characteristic points, registers the local point cloud to be registered in the scene, and outputs a grasping recommendation for an object to be grasped in the scene, and the specific steps are as follows:

Step S1005: recording the number of points output in step S1004;

step S1007: obtaining object grabbing recommendation to be grabbed in a scene according to the example segmentation sequencing of the step S1006;

The image block classification network PATCHCATENET employs ResNet; the image block semantic segmentation network PATCHSEGNET and the scene semantic segmentation network SCENESEGNET employ a semantic segmentation network Unet.