CN114882214A

CN114882214A - Method for predicting object grabbing sequence from image based on deep learning

Info

Publication number: CN114882214A
Application number: CN202210344226.8A
Authority: CN
Inventors: 林梓尧; 贾奎
Original assignee: South China University of Technology SCUT
Current assignee: Cross Dimension Shenzhen Intelligent Digital Technology Co ltd
Priority date: 2022-04-02
Filing date: 2022-04-02
Publication date: 2022-08-09

Abstract

The invention discloses a method for predicting object grabbing sequence from an image based on deep learning, which comprises the following steps: 1) acquiring disordered captured scene pictures; 2) detecting a detection frame and a segmentation mask of all objects to be grabbed from the picture by using a depth segmentation network; 3) and pooling the characteristic map areas corresponding to different segmentation masks to form characteristic vectors with equal length. Meanwhile, performing feature pooling on the global feature map to form a global feature vector; 4) connecting the feature vectors corresponding to all the feature masks to a global feature vector to form the feature vector of each object; and sending the objects into a special cyclic neural network in a disordered way, wherein the cyclic neural network outputs the grabbing sequence of the objects. The method can predict a reasonable grabbing sequence in a complex stacked object scene, can accelerate the grabbing speed of the mechanical arm to the object, and can reduce collision. Object grabbing in a reasonable grabbing order is crucial in industrial scenarios.

Description

Method for predicting object grabbing sequence from image based on deep learning

Technical Field

The invention belongs to the field of computer vision. In particular to a method for predicting object grabbing sequence from images based on deep learning.

Background

Under the transition from manufacturing to intelligent manufacturing, the attention of industry and computer vision research has been gradually drawn on how to use AI technology to assist in building intelligent functions. In order to be able to gradually utilize AI technology in traditional manufacturing to replace the tedious and customization-demanding links therein, many enterprises seek help from computer vision. In the industrial manufacturing process, some static task processes such as defect detection and volume measurement are solved by using a visual AI mode. In some industrial tasks involving robot gripping objects, such as unordered gripping and loading and unloading, there have been some methods that attempt to perform object gripping by using AI to restore a gripping posture of an object as an input of the robot.

However, in real-world scenarios, the interaction of the robot with the environmental objects does not depend solely on a single target object, but is also influenced by some other target instances and other scene objects. For example, in a stacked scene, it may be desirable to first grab an upper object and then grab an underlying object. To reduce mechanical arm collisions, existing methods attempt to calculate a path to avoid a collision using a mechanical arm collision avoidance algorithm. The reasonable object grabbing sequence is presumed from the scene picture by utilizing the vision technology, so that the collision can be reduced in the actual grabbing process, and the whole grabbing process can be accelerated.

In the prior art, "grass Planning Based On Scene approximation In unrestrained Environment," plan the capture of a Scene of a basic geometric body, so that the capture can avoid collision of Scene objects. The grabbing plan is performed by sequencing the grabbed scores of the objects, has limitation and is not suitable for any object.

Disclosure of Invention

Aiming at the problem of how to predict reasonable object grabbing sequence under the condition of object stacking, the invention provides a method for predicting the object grabbing sequence under the scene of object grabbing stacking, and also provides a means for generating training data so as to support end-to-end training of an algorithm.

The invention is realized by at least one of the following technical schemes.

A method of predicting an object grabbing order from an image based on deep learning, comprising the steps of:

step 1, detecting all foreground objects in an image by using a segmentation network, simultaneously outputting segmentation masks for all foreground objects, and reserving a global feature map of the image and an object feature map before outputting the masks;

step 2, cutting out a characteristic diagram of a mask position from an object characteristic diagram by using a segmentation mask of the object, pooling to obtain a local characteristic vector of the object, pooling the global characteristic diagram to obtain a global characteristic vector, and connecting the global characteristic vector to the local characteristic vector of each object to obtain an object characteristic of each object;

step 3, using a cyclic neural network as an encoder, and sequentially sending object feature vector sequences of all objects into the encoder to finally obtain a feature vector with a fixed length;

and 4, taking the feature vector coded in the step 3 as a hidden feature, randomly generating and generating an input vector, inputting the hidden feature vector into a grabbing sequence predictor, receiving a fixed-length input vector and a hidden feature obtained in the previous step by each step of the grabbing sequence predictor, outputting an index, wherein the index points to a certain feature in an object feature sequence, an object corresponding to the feature is a grabbed object predicted in the current step, the step number of cyclic prediction is the number of detected objects, and the predicted index sequence is the grabbing sequence of the objects finally.

Further, the segmentation network comprises two classifiers for separating out foreground objects and background objects.

Further, the step 2 comprises the following steps:

21. mask for detecting all foreground objects by using segmentation network _i I ∈ 1,2, … N, where N is the number of objects detected by the segmentation network in the current picture, and the mask of the foreground object is used to mask the previous layer of feature layer of the predicted object mask, and after the mask is performed, feature pooling is performed, and then a linear network is used to convert the number of feature channels into the local feature of the object with fixed length

To generate respective local feature vectors for the respective objects;

22. directly performing feature pooling on the global feature layer with the most complete resolution, and converting the global features into the global features f of the scene by using another linear network _global ；

23. Connecting local features and global features of an object into features of the object

Further, each cycle of the encoder uses an object feature as an input and correspondingly outputs a hidden feature

Hidden feature to be encoded last

Feature encoding as a sequence of features of an object

Wherein,

is a hidden feature of the last encoded output,

is a characteristic of the object, N is the total number of objects,

is the result after encoding the characteristics of all objects.

Further, step 4 comprises the steps of:

41. using an LSTM recurrent neural network as a grabbing sequence predictor, and using the feature vector coded in the step 3 as a first hidden feature

Randomly generating a first input vector

m is a fixed input feature length, and the hidden feature and the input vector are input into a capture sequence predictor;

the fetch order predictor accepts one hidden feature per step

And an input vector

And outputs an output vector

Where j represents the current position in the jth loop that is predicting the jth grabbing target,

the feature corresponding to the capture target predicted in the last cycle indicates the first step of starting prediction when j is 1, and at this time

Generating a random vector as input;

42. calculating an index from the feature sequence of the object by using a mechanism in PointerNet for the feature vector output by each step of the grabbing sequence predictor, and taking the object corresponding to the index as the grabbed object of the step;

43. and (4) cycling the steps 41 to 42 for h times, wherein h is the number of the detected objects, so as to obtain an index sequence with the length of h, and the index sequence is the object grabbing sequence.

Further, the marked data is data with grabbing sequence marks in a large batch automatically generated in a simulation and rendering mode, and a generating method of the grabbing sequence marks uses a heuristic algorithm, and specifically comprises the following steps:

51. starting the construction of a scene, randomly importing n objects into a simulator, copying m instances of each object, and randomly generating the number of the objects and instance data in each scene construction;

52. dividing a p multiplied by p grid by taking the world center of the simulator as an origin, wherein the size of each square of the grid is the average diameter of an introduced object plus a fixed constant d;

53. placing an object at the edge of the grid, randomly selecting one example from the object examples each time, placing the object at the center of the corresponding grid, lifting the object along the z-axis, and then randomly translating the object in the xy-plane, wherein a texture is randomly given to the example each time;

54. repeating step 53 until the grid is fully placed, and then continuing to divide the grid of (p-2) × (p-2) with the same center and the same size;

56. repeating steps 53-54 until three layers are placed; if the number of the examples is insufficient or other conditions are not met, stopping scene construction and entering the next stage, wherein the finally generated scene presents the shape of a suspended pyramid pattern;

57. taking the world coordinate origin as the origin of a sphere, generating a hemispherical surface in the positive direction of a z axis, and uniformly sampling o positions on the spherical surface to assume a virtual camera;

58. gradually rendering the sampled camera position points, wherein the camera is over against the world origin every time, and meanwhile, randomly disturbing the lamplight and the material on the surface of the object in each rendering, and then rendering an image;

59. and filtering out objects with the shielding ratio exceeding a set ratio.

Further, the marked data is obtained by acquiring RGB images of a stacked scene at multiple angles by using a camera, detecting an object frame of the object by using a trained segmentation network, and marking the grabbing sequence of the object in a manual marking mode.

Furthermore, the image synthesized by using the simulation rendering mode is used as the data of the marked data, and the data synthesized by using the simulation rendering mode is automatically marked without additional manual marking.

Furthermore, the segmentation network can generate a feature map of the image while obtaining a segmentation mask of the object in the image, and meanwhile, the segmentation network can detect all foreground objects in the image, and meanwhile, the PointerNet network used in the capturing sequence prediction stage performs cyclic prediction on any number of objects, and is not limited to a certain fixed number of objects. Therefore, the invention can effectively predict the grabbing sequence of any number of objects to be grabbed in the scene.

Further, when the object is subjected to feature extraction, the image features of the area where the object is located are used as the features of the object, and a global feature is connected at the same time, so that the obtained features have local and global information at the same time, and sufficient information quantity is provided for the follow-up capture sequence prediction.

Compared with the prior art, the invention has the following beneficial effects:

the invention is based on a deep learning method, uses global information and local information to construct the relation between an object and a scene, and infers a reasonable object grabbing sequence from the relation. The object grabbing of the stacked scene is performed by utilizing the grabbing sequence, so that the collision can be reduced, and the grabbing process is accelerated. The invention is not limited to simple basic geometries, but applies to any object. Meanwhile, the invention uses an end-to-end algorithm, does not need to carry out more searches and simplifies the whole capturing process.

Drawings

FIG. 1 is a flowchart of a method for predicting an object grabbing sequence from an image based on deep learning according to the present invention;

FIG. 2 is an architecture diagram of a method for predicting an object capture sequence from an image based on deep learning according to the present invention;

FIG. 3 is a side view of a build scenario of the present embodiment;

fig. 4 is a schematic diagram of a picture generated in the present embodiment.

Detailed Description

In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

step 1, foreground object detection and segmentation mask prediction: all target objects, i.e. foreground objects, are considered as one type of object and background-independent objects are considered as another type of object. Detecting all foreground objects in the image by using a segmentation network, simultaneously outputting a segmentation mask to all foreground objects, and reserving a global feature map of the image _global And outputting the object feature map before the mask

7. For the use of a segmentation network to separate foreground objects from background objects. The split network is modified based on the existing split network. Mainly changing the training method. The invention only needs to detect all interested foreground objects without distinguishing specific object types. Therefore, the classification head of the segmentation network is transformed into a two-classifier for separating foreground and background objects. The reconstructed segmentation network can be trained on the existing segmentation data set, and the difference is that all similar foreground objects in the segmentation data set are divided into foreground classes, and similar background objects are divided into background classes. I.e. the number of object classes in the data set used for training is 2.

Step 2, generation of foreground object feature vectors: after obtaining the segmentation masks of all foreground objects and the feature maps of the whole image, the segmentation masks of the objects are used to obtain the feature maps of the objects

Middle cut mask bitThe placed feature maps are pooled to obtain the local feature vectors of the object, and the global feature map is processed _global Performing feature pooling to obtain global feature vector, and connecting the global feature vector to the local feature vectors of the objects to obtain object features of each object

Feature vector generation for foreground objects combines both local and global features. The method comprises the following specific steps:

21. mask for detecting all foreground objects by using segmentation network _i I is equal to 1,2, … N, and uses the mask to mask the previous layer of characteristic layer of the forecast object mask, after the mask, the characteristic pooling is carried out, after the pooling, the characteristic channel number is converted into the object local characteristic with fixed length by a layer of linear network

Thus generating respective local feature vectors for each object; here N refers to the number of objects detected by the segmentation network from the image.

22. Directly performing feature pooling on the global feature layer with the most complete resolution, and converting the global features into the global features f of the scene in a fixed length by using another linear network _global 。

The local features of the object are added with the global features of the image, so that the relative position relation of the object relative to the scene and other objects can be better captured by the capture sequence prediction network.

Step 3, feature coding: and a cyclic neural network is used as an encoder, and the object feature vector sequences of all objects are sequentially sent to the encoding network to finally obtain a feature vector with a fixed length. The sequence of features of the object is unordered here, so it is not important what order is chosen to be input into the coding network. The reason for choosing a recurrent neural network as the encoder here is that different scenes contain different numbers of instances of foreground objects. Different numbers of objects can be accommodated using a recurrent neural network.

The feature encoder in step 3 encodes the object feature sequence using a recurrent neural network. Object feature sequence obtained from step 2

That is, the object feature sequence represents a sequence of object features corresponding to all objects. The feature encoder uses an object feature as input and outputs a hidden feature at each step

Hidden feature to be encoded last

Feature encoding as a sequence of object features

Wherein,

is a hidden feature of the last encoded output. Hidden feature to be encoded last

Feature encoding as a sequence of object features

Where N is the total number of objects and, therefore,

is the result after encoding the characteristics of all objects.

Step 4, capturing sequence prediction: and 3, taking the feature vector coded in the step 3 as a first hidden feature, randomly generating and generating a first input vector, inputting the first hidden feature and the first input vector into a grabbing sequence predictor, wherein the grabbing sequence predictor is also a recurrent neural network, each step of the grabbing sequence predictor receives a fixed-length input feature vector and a hidden feature vector obtained in the previous step, and selects one object from the object feature sequences and outputs an index of the object feature sequence as the object selected in the current step, the step number of the recurrent prediction is the number of the detected objects, and finally, the index sequence of the object sequence is predicted, and the sequence is the grabbing sequence of the objects. The method comprises the following specific steps:

41. a recurrent neural network is used as a fetch order predictor. The predictor accepts one hidden feature per step

And an input vector

Where j represents the current position in the jth loop, i.e., the jth fetch target is currently predicted.

The characteristics corresponding to the captured target predicted by the previous cycle. When j is 1, i.e. the first step of the prediction has just started, this time

I.e. a random vector is generated as input. Simultaneously outputting an output vector

More specifically, in order to be able to output one index step by step from the object feature sequence, the present embodiment uses PointerNet as a grab order predictor.

42. Using the sequence features after step 4 feature coding as claimed in claim 1 as the first hidden features of the object grabbing order predictor

At the same time, a simple vector is generated as the input feature of the first step

m is the fixed input feature length.

43. For the feature vector output by each step of the predictor, an index is calculated from the feature sequence of the object in a manner similar to the attention mechanism, using the mechanism in PointerNet. And taking the object corresponding to the index as the grabbed object of the step. Therefore, the PointerNet is adopted as the capture sequence prediction network, and the purpose is 1), the number of objects is variable, and different scene images contain different numbers of objects. 2) The output of the network is a discrete number, each representing the object that should be grabbed at that step.

44. The above steps are cycled h times, h being the number of detected objects. Thus, an index sequence with the length h is obtained, and the index sequence is the object grabbing sequence.

Step 5, training the segmentation network and the recurrent neural network by using automatically synthesized labeled data: using simulation and rendering combined data to train mass data and using real data to optimize: in order to reduce the cost of manual marking, the invention also provides a method for automatically generating a large batch of training data with grabbing sequential marks by using the simulation-rendering synthetic data. A simulation engine is used for constructing a virtual unordered object grabbing stacking scene, wherein objects are placed in a gradually overlapping mode, so that the marks of grabbing sequences can be automatically acquired; and then, rendering a large amount of training data with a grabbing sequence at a plurality of angles by using a renderer and randomizing parameters such as illumination, materials and the like. And acquiring pictures by using an RGB camera from a real scene through other data, and manually marking to obtain a grabbing sequence which accords with human intuition. Meanwhile, for the trained basic network, the real training data obtained by a manual marking method is optimized, so that the network is better generalized to a real scene picture.

The acquisition of training data comes from three aspects. One is to automatically generate a large batch of data with grabbing sequential labels by using a simulation and rendering mode. The method for generating the label of the grabbing sequence uses a simple heuristic algorithm, and comprises the following specific steps:

51. enabling construction of a scene

52. And randomly importing n objects into the simulator, and copying m instances from each object. The number of objects and instance data here are randomly generated each time the scene is constructed.

53. A p x p grid is divided with the world center of the simulator as the origin. The size of each square of the grid is the average diameter of the lead-in object plus a fixed constant d.

54. The object is placed from left to right from top to bottom, one example is randomly selected from the object examples each time, the object is placed at the center of the corresponding grid, the object is lifted for a certain distance, for example, 5cm, along the z-axis, the distance is a parameter needing to be adjusted, generally, a distance can be randomly sampled from 5-10cm, and then random translation within a certain radius is performed on the xy plane. Randomly assigning a texture to the instance at a time;

55. repeat step 4 until the grid is fully set. The grid of (p-2) × (p-2) is then continued with the same center and the same size.

56. Repeating steps 4-5 until three layers are placed. If the number of instances is insufficient or other conditions are not met, the scene build is aborted and the next phase is entered. The resulting scene exhibits a shape of a suspended pyramid pattern. A side view of one constructed scene in this embodiment is given in fig. 3.

57. The world coordinate origin is taken as the origin of a sphere, a certain distance is taken as a radius, the radius is generally related to the distance between a camera and a target scene in an actual scene, the distance between the position of the camera and the center position of the scene in an application scene can be set, the distance is taken as the center point, the plus and minus extension is 0.5 meter, and a distance is randomly sampled in the range to be taken as the final used radius. A hemisphere is generated in the positive direction of the z axis, and o positions are uniformly sampled on the sphere to assume a virtual camera.

58. And gradually rendering the sampled camera position points, wherein the camera is over against the world origin every time, and meanwhile, certain random disturbance is performed on the light and the material of the object surface every time of rendering, and then an image is rendered. For example, a random number of 0-1 is sampled before each rendering, and if the number is less than 0.5, the probability is half. At this time, a material is randomly selected from the material bag and is attached to the surface of the object through the simulator. The texture packages may be downloaded directly from the web, such as the cctexture texture package.

59. And filtering out objects with shielding exceeding a certain proportion. By projecting the object separately, a complete segmentation Mask of the object is obtained _full Calculating the area ratio of the segmented mask and the complete mask which can be the surface of the object in the actual rendering map:

wherein Area () refers to the Area occupied by the split Mask, Mask _visib The visualization segmentation mask which is blocked by other objects is removed from the actual rendering image of the object. The occlusion threshold can be set to 0.5 when p _mask If < 0.5, the object is removed because the object is too much occluded by other objects.

Because the object is placed in a successive free fall mode, the object placed later can be considered to be grabbed first, and then the grabbing sequence of the object is automatically marked. Meanwhile, the boundary box and the segmentation mask of the object can be calculated by using the mode generated by the renderer. Thus, the data obtained using this approach can be used to train the entire neural network in the method of the present invention.

Another way to acquire training data is to use a real camera to acquire RGB images of a stacked scene at multiple angles, detect an object frame of an object using a segmentation network trained in advance, and label the grasping order of the object by a manual labeling way.

Example 2

As shown in fig. 1, the method for predicting the object grabbing sequence from the image based on the deep learning of the present invention includes the following steps:

s1, foreground object detection and segmentation mask generation: this step detects the masks of all foreground objects from the image, and in the process, obtains the feature map of the objects and the global feature map of the scene image.

Specifically, any object segmentation neural network in the prior art can be used for generating the segmentation mask, and MaskRCNN is adopted as the segmentation network in the embodiment to predict the segmentation mask and generate the feature map. The split network in fig. 2 is illustrated using the framework of MaskRCNN.

The method is characterized in that the method does not specifically classify the objects, but classifies all the objects in the image into a scene object class and a background object class. The background object class mainly comprises a desktop, a ground, a background frame for placing an object and the like. Thus, using an existing segmented neural network in the present method only requires setting the output class of the segmented network to 2. Meanwhile, the class ID of the object at the time of training is adjusted to the foreground (1) or the background (0).

S2, generating local features and global features of the object, connecting the local features and the global features to generate the object features: in order to enable the capture sequence prediction network to perceive the relative position relation of each object relative to other objects and scenes, the method fuses the global features and the local features of the objects in the process of generating the object features.

Specifically, for the local features of the object, the feature map before the object segmentation mask is output as the feature source, and the segmentation mask covering the object is covered, and the mask covered part is extracted. This step is illustrated in label 1 in fig. 2. The mask covered feature area is feature pooled and then the pooled feature vectors are mapped into a fixed length feature space using a one-dimensional convolution or linear layer. The embodiment maps the features to the feature space with the dimension of 256 and the local features of the object

Different dimensions of feature space may be suitable for different applicationsA complex scenario, therefore 256 is not a strictly required number here. Other numbers are possible.

The global features of the image are illustrated in label 2 in fig. 2. Global pooling is performed directly from the most complete resolution feature map of the image, and common pooling operations such as mean pooling and maximum pooling may be used. This example uses maximum pooling. And performing feature mapping on the pooled global features to a fixed-length feature space. The global features are still mapped to feature space with dimension 256 in this embodiment. I.e. f _global ∈R ²⁵⁶ .

Local characteristics of each object

Connecting global features to obtain object features

In this embodiment, the object feature is a feature vector with a length of 512. Wherein f is _global The global feature vector obtained in the previous step.

S3, carrying out feature coding on the object feature sequence by using a coder to generate the synthesized features of the object feature sequence: deriving object feature sequences from previous steps

Where N is the number of foreground objects detected. And coding the feature sequence by using a coder to obtain a feature vector fusing local and global.

Further, the encoder herein uses a recurrent neural network. The recurrent neural networks are used in order to be able to adapt to different object numbers. At the same time, the recurrent neural network can better encode content of the nature of the band sequence. The encoder herein may use variations of various recurrent neural networks. In this embodiment a single layer LSTM network is used as the encoder. The object signature sequence here is virtually disordered. At each cycle step of the encoder, a feature vector which is not selected yet is randomly selected from the feature sequence to be used as the current step input of the encoder. This is repeated N times, N being the number of objects. And using the hidden features output in the Nth step as the synthesized features of the object feature sequence.

S4, predicting the object grabbing sequence: in order to be able to pick the objects to be grabbed step by step from the characteristic sequence of the objects and finally recover an object grabbing sequence, the grabbing sequence predictor uses a special recurrent neural network. More specifically, the present embodiment uses PointerNet as a sequential predictor. The predictor receives one hidden feature per cycle step

And an input feature

And using the output characteristics of each step

With object feature sequences

And calculating an object selection probability distribution vector, and selecting an index with the highest probability from the object selection probability distribution vector as the object output of the step. The calculation method of the probability distribution vector comprises the following steps:

1.

2.p _j ＝softmax(u _j ),p _j ∈R ^N

in the above calculation, v ^T 、W ₁ 、W ₂ Are all network learnable parameters.

Is the correlation coefficient of the feature output of each step of the predictor to the object feature sequence. PredictionEach step of the output of the device can calculate the correlation coefficient with all the object features, namely the whole object feature sequence, and then the correlation coefficient becomes a correlation coefficient vector. Wherein p is _j Is the probability distribution calculated from the correlation coefficient vector using the softmax function.

Is referred to as being in p _j I.e. the ith value of

Representing the predicted probability of grabbing object i for the jth cycle step. N is the number of objects in the image, therefore, p _j Is a vector of length N, i.e. p _j ∈R ^N From p _j Index with the highest probability of selection

As the object of the step selection. The index is added to the pool of indexes that have already been selected so that the next cycle does not repeat the selection of the index.

S5, performing network training of large-scale data by using the automatically synthesized labeled data: existing data sets comprising an order of object grabbing are small and therefore it is also part of the invention how to obtain sufficient training data. Tagging by hand is a common way of acquiring data. In this embodiment, a step of obtaining data by manual annotation is as follows

a) Within a working range, usually a material frame or a working table, 5-20 instances of 2-3 objects are randomly placed. Different angles in the top assume that the camera is capturing images. 10-30 images can be acquired for each scene.

b) Foreground and background object detection and segmentation is performed using segmentation networks trained on other data sets.

c) Manually screening more accurate detection frames, and simultaneously manually optimizing the segmentation mask

d) And marking the grabbing sequence of the detected objects by using human expert knowledge.

Further, the method uses a simulation and rendering mode to generate larger-scale training data, and the specific steps are as follows

1) Enabling construction of a scene

2) And randomly importing n objects into the simulator, and copying m instances from each object. The number of objects and instance data here are randomly generated each time the scene is constructed. In this embodiment, n is sampled from 2-5 and m is sampled from 1-10

3) A p x p grid is divided with the world center of the simulator as the origin. In this example, p is randomly chosen from 3-7 at a time. The size of each square of the grid is the average diameter of the lead-in object plus a fixed constant d, which in this embodiment is set to 5 cm.

4) And placing the object from left to right from top to bottom, randomly selecting one example from the object examples each time, placing the object at the center of the corresponding grid, and lifting the object for a certain distance along the z-axis, wherein the lifting distance of the example is randomly sampled from 3-8 cm. Then make random translations within a certain radius in the xy-plane. Each time randomly assigned a texture to the instance

5) Repeating step 4) until the grid is fully placed. The grid of (p-2) × (p-2) is then continued with the same center and the same size.

6) Repeating steps 4) -5) until three layers are placed. If the number of instances is insufficient or other conditions are not met, the scene build is aborted and the next phase is entered. The resulting scene exhibits a shape of a suspended pyramid pattern. A side view of one constructed scene in this embodiment is given in fig. 3.

7) The world coordinate origin is used as the origin of a sphere, a certain distance is used as a radius, a hemispherical surface is generated in the positive direction of the z axis, and o positions are uniformly sampled on the spherical surface to be used for assuming a virtual camera. Here, o is not a constant value and may generally be 50 to 200.

8) And gradually rendering the sampled camera position points, wherein the camera is over against the world origin every time, and meanwhile, certain random disturbance is performed on the light and the material of the object surface every time of rendering, and then an image is rendered.

9) And filtering out objects with shielding exceeding a certain proportion. In this embodiment, objects that block more than 50% are filtered out. Fig. 4 gives an example of a generated picture. The numbers on each box represent the order in which the objects are grabbed within that box.

10) Because the object is placed in a successive free fall mode, the object placed later can be considered to be grabbed first, and then the grabbing sequence of the object is automatically marked. Meanwhile, the boundary box and the segmentation mask of the object can be calculated by using the mode generated by the renderer. Thus, the data obtained using this approach can be used to train the entire neural network in the method of the present invention.

Example 3

Specifically, the segmentation mask may be generated by using any object segmentation neural network in the prior art, and the MaskRCNN object segmentation method is adopted in this embodiment. The split network in fig. 2 is illustrated using the framework of MaskRCNN. S2, erecting a camera and constructing a plurality of real scenes, and acquiring images through different camera positions for each scene. Approximately 10-30 scenes were constructed, each scene having 25-50 real images taken at different angles. Labeling is performed using a labeling tool, such as LabelMe. The annotation types include: front and back scene body types, front and back scene object detection frames, front and back scene body segmentation masks and a foreground object grabbing sequence.

And fixing the parameters of the trunk network of the MaskRCNN pre-trained on other data sets, such as the MaskRCNN trained on the COCO data set, and changing the class output of the network into class 2. Retraining of the segmented network is performed using the labeled data.

S3, generating object local features and global features, connecting the two features to generate object features: in order to enable the capture sequence prediction network to perceive the relative position relation of each object relative to other objects and scenes, the method fuses the global features and the local features of the objects in the process of generating the object features.

Different dimensions of feature space may be suitable for different complexity scenarios, so 256 is not a strictly required number here. Other numbers are possible.

Local characteristics of each object

Connecting global features to obtain object features

S4, carrying out feature coding on the object feature sequence by using a coder to generate the synthesized features of the object feature sequence: deriving object feature sequences from previous steps

Further, the encoder herein uses a recurrent neural network. The recurrent neural networks are used in order to be able to adapt to different object numbers. At the same time, the recurrent neural network can better encode content of a band sequence nature. The encoder herein may use variations of the recurrent neural network. In this embodiment a single layer LSTM recurrent neural network is used as the encoder. The object signature sequence here is virtually disordered. At each cycle step of the encoder, a feature vector which is not selected yet is randomly selected from the feature sequence to be used as the current step input of the encoder. This is repeated N times, N being the number of objects. And using the hidden features output in the Nth step as the synthesized features of the object feature sequence.

S5, predicting the object grabbing sequence: in order to be able to pick the objects to be grabbed step by step from the characteristic sequence of the objects and finally recover an object grabbing sequence, the grabbing sequence predictor uses a special recurrent neural network. More specifically, the present embodiment uses PointerNet as a sequential predictor. The predictor receives one hidden feature per cycle step

And an input feature

And using the output characteristics of each step

With object feature sequences

p _j ＝softmax(u _j ),p _j ∈R ^N

Is the correlation coefficient of the feature output of each step of the predictor to the object feature sequence. Each step of output of the predictor can calculate the correlation coefficient with all the object features, namely the whole object feature sequence, and then the correlation coefficient becomes a correlation coefficient vector. Wherein p is _j Is the probability distribution calculated from the correlation coefficient vector using the softmax function.

Is referred to as being in p _j I.e. the ith value of

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A method for predicting an object capture order from an image based on deep learning is characterized in that: the method comprises the following steps:

step 2, cutting out a feature map of a mask position from the object feature map by using a segmentation mask of the object, pooling the feature map to obtain a local feature vector of the object, pooling the global feature map to obtain a global feature vector, and connecting the global feature vector to the local feature vector of each object to obtain the object feature of each object;

2. The method of claim 1, wherein the method for predicting the object grabbing sequence from the image based on the deep learning is characterized in that: the segmentation network includes two classifiers for separating foreground objects and background objects.

3. The method of claim 1, wherein the method for predicting the object grabbing sequence from the image based on the deep learning is characterized in that: the step 2 comprises the following steps:

21. mask for detecting all foreground objects by using segmentation network _i I ∈ 1, 2.. N, wherein N is the number of objects detected by the segmentation network in the current picture, the mask of the foreground object is used for masking the characteristic layer in the previous layer of the predicted object mask, the characteristic pooling is performed after the masking, and then a linear network is used for converting the number of characteristic channels into the local characteristic of the object with the fixed length

To generate respective local feature vectors for the respective objects;

4. The method of claim 1, wherein the method for predicting the object grabbing sequence from the image based on the deep learning is characterized in that: each time the encoder cycles, an object feature is used as an input, and a hidden feature is correspondingly output

Will be best understood byHidden feature encoded later

Feature encoding as a sequence of object features

Wherein,

is a hidden feature of the last encoded output,

is a characteristic of the object, N is the total number of objects,

is the result after encoding the characteristics of all objects.

5. The method of claim 1, wherein the method for predicting the object grabbing sequence from the image based on the deep learning is characterized in that: step 4 comprises the following steps:

Randomly generating a first input vector

m is a fixed input feature length, and the hidden feature and the input vector are input into the capture sequence predictor;

the fetch order predictor accepts one hidden feature per step

And an input vector

And outputs an output vector

Generating a random vector as input;

42. for the feature vector output by each step of the grabbing sequence predictor, an index is calculated from the feature sequence of the object by using a mechanism in PointerNet, and the object corresponding to the index is taken as the grabbed object of the step;

43. and (4) circulating the steps 41-42 for h times, wherein h is the number of the detected objects, so as to obtain an index sequence with the length of h, and the index sequence is the object grabbing sequence.

6. The method of claim 1, wherein the method for predicting the object grabbing sequence from the image based on the deep learning is characterized in that: the marked data is data with grabbing sequence marks in a large batch automatically generated in a simulation and rendering mode, a heuristic algorithm is used for the method for generating the grabbing sequence marks, and the method comprises the following specific steps:

57. the world coordinate origin is taken as the origin of the sphere, a hemisphere is generated in the positive direction of the z axis, and the uniform sampling is carried out on the sphere _o The position is used to assume a virtual camera;

59. and filtering out objects with the shielding ratio exceeding a set ratio.

7. The method of claim 1, wherein the method for predicting the object grabbing sequence from the image based on the deep learning is characterized in that: the marked data is obtained by collecting RGB images of a stacked scene at a plurality of angles by using a camera, detecting an object frame of an object by using a trained segmentation network, and marking the grabbing sequence of the object in a manual marking mode.

8. The method of claim 1, wherein the method for predicting the object grabbing sequence from the image based on the deep learning is characterized in that: the marked data adopts an image synthesized by using a simulation rendering mode as data, and the data synthesized by using the simulation rendering mode is automatically marked.

9. The method of claim 1, wherein the method for predicting the object grabbing sequence from the image based on the deep learning is characterized in that: the segmentation network can generate a characteristic diagram of an image while obtaining a segmentation mask of an object in the image, can detect all foreground objects in the image, and can perform cyclic prediction on any number of objects by using the PoInterNet network in a capturing sequence prediction stage.

10. The method of claim 1, wherein the method for predicting the object grabbing sequence from the image based on the deep learning is characterized in that: when the feature extraction is performed on the object, the image feature of the area where the object is located is taken as the feature of the object.