CN111191650A

CN111191650A - Object positioning method and system based on RGB-D image visual saliency

Info

Publication number: CN111191650A
Application number: CN202010003692.0A
Authority: CN
Inventors: 王松涛; 靳薇; 曲寒冰; 李彬
Original assignee: BEIJING INSTITUTE OF NEW TECHNOLOGY APPLICATIONS
Current assignee: BEIJING INSTITUTE OF NEW TECHNOLOGY APPLICATIONS
Priority date: 2019-12-30
Filing date: 2020-01-02
Publication date: 2020-05-22
Anticipated expiration: 2040-01-02
Also published as: CN111191650B

Abstract

An article positioning method and system based on RGB-D image visual saliency mainly comprises a camera, a mechanical arm and an operation table; the article to be grabbed is stacked on the operating platform, the mechanical arm is a UR5 mechanical arm, and the operating platform is a horizontal panel. When the system is initialized, the camera corrects the operating platform, and provides a reference plane for positioning the object by the mechanical arm and grabbing the object by the mechanical arm. Firstly, acquiring an RGB-D image of an operation platform scene by a camera; then, calculating a visual saliency map based on the RGB-D image, namely an RGB-D image saliency map; and finally, positioning the article based on the visual saliency map and providing mechanical arm article operation information. The pixel-level visual saliency map and the position information of the salient object can be generated simultaneously, and multiple operations of the manipulator are supported.

Description

Object positioning method and system based on RGB-D image visual saliency

Technical Field

The invention relates to the field of computer vision target positioning, in particular to an article positioning method and system based on RGB-D image vision saliency.

Background

When the scene is complex, especially when various items are scattered, the quick positioning of the items by the vision-based mechanical arm is a challenging task. Whether the mechanical arm successfully grabs the scene object or not is related to the selection of the order of grabbing the object, namely, the type and the position of the object which are most suitable for grabbing under the current scene need to be judged.

The mechanical arm grabbing based on visual perception is suitable for general article stacking scenes, such as logistics warehouses, can replace manual sorting, and achieves full-automatic and intelligent logistics management of unmanned factories, unmanned warehouses and the like.

Currently, a mechanical arm article grabbing application system generally collects scene visual information based on an RGB-D camera. A feedback map (affordance map) is computed based on the RGB-D image, from which points of proper operation are located. If the feedback graph has no proper points, a depth reinforcement learning strategy is adopted to actively try to change the space distribution of the scene objects, and the process is carried out until the feedback graph has proper points, so that the capturing success rate is not high.

When the scene is complex, the objects are overlapped and stacked and mutually shielded, and at the moment, the optimal positioning point cannot be confirmed by adopting the feedback graph method, the placing sequence of the objects in the scene needs to be actively interfered, namely, a feedback graph and reinforcement learning-based method is adopted. However, as a consequence of active intervention requires reinforcement learning to assess, a risk uncontrolled situation may arise, i.e. in an ineffective loop of death. Therefore, the object positioning based on the feedback graph for the reinforcement learning mechanism has the defects of complexity, uncontrollable property, high calculation consumption and the like.

Therefore, how to solve the problems of fast positioning and clamping of the mechanical arm in a complex scene of disordered distribution of various articles, namely, researching a new, fast, convenient and flexible method for fast positioning and clamping of the mechanical arm with less calculation consumption becomes a technical problem to be solved urgently.

Disclosure of Invention

In order to research a flexible and effective mechanism positioning method and system to complete rapid scene object positioning. The invention provides a positioning method for rapidly analyzing scenes based on visual saliency and simulating a human visual attention mechanism. Visual-saliency-based analysis may be used to accomplish specific visual tasks based on a priori knowledge, rules, and the like. The human visual attention mechanism can quickly browse a scene according to the significance degree, the first noticed area or target is often related to self experience knowledge and a specific purpose, and also related to the relative significance degree of the area or target in the scene, so how to realize the object positioning based on the visual significance is a technical difficulty which must be overcome. In order to solve the technical problems, the invention provides the method for sequencing scene articles based on semantic information to calculate the visual saliency, and provides distributed grabbing for mechanical arm grabbing according to the visual saliency value as a basis.

To solve the above technical problem, according to an aspect of the present invention, there is provided an article positioning method based on RGB-D image visual saliency, comprising the following steps:

acquiring an RGB-D image of an operation platform scene by a camera;

secondly, calculating a visual saliency map based on the RGB-D image, namely the RGB-D image saliency map;

thirdly, positioning the articles based on the visual saliency map and providing mechanical arm article operation information;

and step two, performing visual saliency detection on the RGB-D image, and calculating to obtain a visual saliency map, as shown in formula (1):

wherein, P (z)_S|I_RGB-D) Representing the visual saliency of the current scene, i.e. the visual saliency map, is defined as the probability p (z) of whether a pixel of an RGB-D image is salient or not_S|x_c，x_d)；I_RGB-DRepresenting an RGB-D image; x is the number of_cAnd x_dRespectively representing RGB image salient features and Depth image salient features, and respectively extracting by adopting a CNN network; p (z)_S，x_c，x_d) Representing a joint probability distribution, p (x)_c，x_d) Representing a significant feature probability distribution; the visualization effect is represented by a temperature map, the larger the significant value is, the warmer the color is, and the smaller the significant value is, the colder the color is;

based on the RGB-D image saliency map, carrying out salient object position estimation, as shown in formula (6):

wherein O represents the salient object position coordinates, z_SRepresenting the visual saliency of a salient object; p (O, z)_s|I_RGB-D) Representing a joint distribution of salient objects and visual saliency, p (I)_RGB-D|O，z_s) Representing the distribution of the RGB-D image saliency map, p (O, z), given target coordinates and visual saliency_s) Representing the joint distribution of objects and visual saliency, p (I)_RGB-D) Representing the RGB-D image feature distribution.

Preferably, when z is given_SWhen is, O and I_RGB-DIs condition independent, then the following is obtained as equation (7):

p(I_RGB-D|O，z_s)＝p(I_RGB-D|z_s) (7)

when the posterior probability of visual saliency of the target region is satisfied as a constraint condition of the salient target, under the condition that the image feature distribution is not changed, the formula (6) is approximately transformed to obtain a formula (8) as follows:

p(O，z_s|I_RGB-D)∝p(z_s|I_RGB-D)L(O)C(O，z_s) (8)

wherein L (O) represents a target detection region, C (O, z)_s) Representing constraints, decidesMeaning in the form:

wherein, b_jWhich indicates the area of the detection target,

indicating that the region is visually significant; and L (O) is obtained by detecting a target area based on the RGB image by adopting a target detection algorithm, namely a Faster R-CNN algorithm.

Preferably, the camera is a Kinect camera.

Preferably, the adopted mechanical arm has two operation functions of suction and clamping; when the scene is complex, namely, articles are stacked together and are seriously shielded, a significant image is generated to be distributed in a pixel level, and the 'sucking' operation of a manipulator is supported; when the scene provides a target detection rectangular area, the salient target rectangular area can be obtained based on the salient image, and the manipulator clamping operation is supported.

Preferably, an operation stop threshold value is set based on the saliency value, the manipulator operation is sequentially driven according to the magnitude of the visual saliency value until the saliency value is lower than the threshold value, and the visual saliency of the RGB-D image of the scene is recalculated.

To solve the above technical problem, according to another aspect of the present invention, there is provided an article positioning system based on RGB-D image visual saliency using the method of claim 1, comprising: the system comprises a camera, a mechanical arm and an operating platform, wherein articles to be grabbed are stacked on the operating platform, the mechanical arm is a UR5 mechanical arm, the operating platform is a horizontal panel, and when the system is initialized, the camera corrects the operating platform and provides a reference plane for positioning the articles by the mechanical arm and grabbing the articles by the mechanical arm.

p(I_RGB-D|O，z_s)＝p(I_RGB-D|z_s) (7)

p(O，z_s|I_RGB-D)∝p(z_s|I_RGB-D)L(O)C(O，z_s) (8)

wherein L (O) represents a target detection region, C (O, z)_s) Representing constraints, defined in the form:

wherein, b_jWhich indicates the area of the detection target,

Preferably, the camera is a Kinect camera.

The invention has the beneficial effects that:

1. the object positioning method based on the RGB-D image visual saliency is based on the fact that the visual saliency is used as a mechanical arm object selection judgment basis for positioning, and the complexity of learning based on a deep reinforcement learning strategy is solved.

2. The pixel-level visual saliency map and the position information of the saliency target can be generated simultaneously, various operations of the manipulator are supported, and the limitation of interpretation basis is overcome.

3. The problem of article positioning sequence strategy learning is simplified, and the article positioning sequence criterion has universality and universality.

4. The selection of articles based on visual saliency only requires sorting of visual saliency values to determine a priority without special training for a particular scene. And reflecting the attention sequence of the scene articles based on the visual saliency, and selecting the articles as a mechanical arm to position the articles. Only scene objects need to be detected, and reinforcement learning for specific scenes is not needed.

5. The method has the advantages that the obvious target position is estimated accurately by adopting the estimation method, the basis is provided for the subsequent sequential positioning of the articles, the system operation efficiency is improved, the success rate of the system operation is increased, and the system operation time is effectively reduced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the principles of the invention. The above and other objects, features and advantages of the present invention will become more apparent from the detailed description of the embodiments of the present invention when taken in conjunction with the accompanying drawings.

FIG. 1 is a robotic arm item positioning system;

FIG. 2 is a block diagram of a method for positioning items based on visual saliency facing the operation of a robotic arm;

FIG. 3 is a diagram of the operation of the robotic arm to provide for sequencing item prioritization based on saliency maps;

FIG. 4 is a drawing of the test experiment "suck" operation;

FIG. 5 is a diagram of the test experiment "clamp" operation.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and embodiments. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limitations of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

In addition, the embodiments of the present invention and the features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

As shown in fig. 1, the RGB-D image based visual saliency-based mechanical arm article positioning system includes a Kinect camera, a mechanical arm, a manipulator and an operation table, an article to be grabbed is stacked on the operation table, the mechanical arm is a UR5 mechanical arm, the operation table is a horizontal panel, and when the system is initialized, the Kinect camera corrects the operation table to provide a reference plane for positioning the article by the mechanical arm and grabbing the article by the manipulator. First, an RGB-D image of the console scene is acquired by the Kinect camera. Then, a visual saliency map is computed based on the RGB-D image. And positioning the article based on the visual saliency map, and providing mechanical arm article operation information. The adopted manipulator has two operation functions of suction and clamping. In the specific operation process, the executed operation flow is as follows: when the scene is complex, namely, articles are stacked together and are seriously shielded, a significant image is generated to be distributed in a pixel level, and the 'sucking' operation of a manipulator is supported; when the scene provides a target detection rectangular area, the salient target rectangular area can be obtained based on the salient image, and the manipulator clamping operation is supported. And finally, setting an operation stop threshold value based on the saliency value, sequentially driving the mechanical hands to operate according to the magnitude sequence of the visual saliency value until the saliency value is lower than the threshold value, stopping, recalculating the visual saliency of the RGB-D image of the scene, and repeating the steps.

Therefore, by carrying out scene analysis, the visual saliency with different scales is generated, various operations of the mechanical arm are realized, and the mechanical arm grabbing system can be suitable for quick positioning and operating tasks of articles in different scenes.

FIG. 2 is a block diagram of a salient item positioning method oriented to robotic arm operation. The method comprises the steps of collecting an RGB-D image by using a Kinect device, carrying out visual saliency detection on the RGB-D image based on a DMNB (hybrid Mixed-member Naive Bayes Model), and calculating to obtain a scene saliency map, namely obtaining the saliency map based on the RGB-D image. The sequence of scene item operations is then sorted based on the saliency values, as shown in FIG. 3.

To calculate the visual saliency of an RGB-D image, a binary random variable zs is defined to represent whether a pixel of the RGB-D image is salient or not, as shown in equation 1:

wherein, P (z)_S|I_RGB-D) Representing the visual saliency of the current scene, i.e. saliency map, is defined as the probability p (z) of whether a pixel of an RGB-D image is salient or not_S|x_c，x_d) The visualization effect is shown in a temperature diagram, the larger the significant value is, the warmer the color is, and the smaller the significant value is, the colder the color is; i is_RGB-DRepresenting RGB-D images, x_cAnd x_dRespectively representing RGB image salient features and Depth image (Depth image) salient features, and respectively extracting by adopting a CNN network; p (z)_S，x_c，x_d) Representing a joint probability distribution, p (x)_c，x_d) Representing a significant feature probability distribution.

Expanding the formula (1) based on Bayesian theorem, as shown in the formula (2):

where x is_cAnd x_dGiven hidden variable z_SThe conditions are independent, and therefore, equation (2) is transformed into equation (3), as follows:

p(x_c，x_d|z_s)＝p(x_c|z_s)p(x_d|z_s) (3)

combining with the formula (3), the formula (2) is transformed into the formula (4)

Wherein, p (z)_S) Representing a prior distribution, p (x)_c|z_s) And p (x)_d|z_s) Representing a visual saliency distribution, p (x), based on color features and depth features_c，x_d) Representing a significant feature probability distribution, which is simplified for computational efficiency. Finally, the significance value is calculated by equation (5):

p(z_s|x_c，x_d)∝p(z_s)p(x_c|z_s)p(x_d|z_s) (5)

(1) salient object position estimation based on RGB-D image salient map

Aiming at the technical problem of the invention, in order to obtain an effective visual saliency value so as to determine a priority according to a size ordering, the invention innovatively provides a calculation method of a saliency target estimation, wherein the saliency target estimation is shown as a formula 6:

wherein O represents the salient object position coordinates, z_sIndicating the visual saliency of the salient objects. p (O, z)_s|I_RGB-D) Representing a joint distribution of salient objects and visual saliency, p (I)_RGB-D|O，z_s) Representing the distribution of the RGB-D image saliency map, p (O, z), given target coordinates and visual saliency_s) Representing the joint distribution of objects and visual saliency, p (I)_RGB-D) Representing the RGB-D image feature distribution.

The invention innovatively provides that the estimation method is adopted to accurately estimate the obvious target position, provides a basis for the subsequent sequential positioning of articles, improves the system operation efficiency, increases the success rate of system operation, and effectively reduces the system operation time.

When given z_SWhen is, O and I_RGB-DIs condition independent, then the following is obtained as equation (7):

p(I_RGB-D|O，z_s)＝p(I_RGB-D|z_s) (7)

when the posterior probability of visual saliency of the target region is satisfied as a constraint condition of the salient target, under the condition that the image feature distribution is not changed, in order to calculate efficiency, the formula (6) is approximately transformed to obtain a formula (8) as follows:

p(O，z_s|I_RGB-D)∝p(z_s|I_RGB-D)L(O)C(O，z_s) (8)

wherein, b_jWhich indicates the area of the detection target,

indicating that the region is visually significant. And L (O) is obtained by detecting a target area based on the RGB image by adopting a target detection algorithm, namely a Faster R-CNN algorithm.

Based on the problem of detecting the repeated area solved by the non-maximization inhibition algorithm, the rectangular frame of the obvious object can be positioned under the condition of sparse distribution of articles in a scene, and the method is suitable for the 'grabbing' operation of a manipulator.

In order to verify the effectiveness of the object positioning method based on the visual saliency of the RGB-D image, the following test experiments are carried out:

selecting 40 different objects to construct different scenes, and grabbing by using a manipulator as shown in fig. 1, and performing grabbing experiments as shown in fig. 4 and 5. Wherein, fig. 4 is the test experiment suction operation, and fig. 5 is the test experiment clamping operation.

If a conventional feedback map is used, the robot will repeat this failed operation when the article corresponding to its maximum cannot be manipulated by the robot arm, since the environment and the feedback map are unchanged. Thus, if a manipulator fails three times on the same object, we define the test operation as a failure; we define the test as successful if the first 10 objects in the scene were successfully operated by the robot. On this basis, we define three indexes:

(1) the average number of test scenarios successfully grabbed at each time;

(2) "suck" operation success rate, which is defined as the number of successfully grasped objects divided by the number of lifting operations;

(3) test success, defined as the number of successful tests divided by the number of tests.

Table 1 records all test results for 20 different scenarios. Experiments show that after the feedback diagram and the reinforcement learning are actively optimized, the suction operation success rate and the test success rate of the system are improved to a certain extent compared with those of a method based on the feedback diagram, but the time complexity is greatly increased. The method based on the visual saliency solves the problem of object positioning from the perspective of visual saliency, does not depend on reinforcement learning, greatly improves the success rate, and does not increase excessive time complexity.

It follows that when relying solely on static feedback maps to obtain operational decisions, failures are likely to occur in cluttered scenarios.

When relying on feedback maps for active detection optimization to provide success rates, the system will ensure that the locations of items near the scene are sparse by actively disturbing the item distribution.

The method introduces the position estimation of the obvious objects, can automatically detect the sparse situation of the scene objects, and avoids the possibility of failed operation. The success rate of grabbing is greatly improved; meanwhile, for the situation that the article is seriously overlapped and blocked, the method can output the pixel level saliency map, so that the method has stronger adaptability to the scene. Thus, a more reliable decision can be obtained.

TABLE 1 System article location test results

The technical scheme of the invention solves the problems of complexity and limitation of strategy judgment basis in the process of positioning and grabbing articles based on machine vision in the operation of the traditional mechanical arm. The problem of article positioning sequence strategy learning is simplified, and the article positioning sequence criterion has universality and universality. The object positioning method based on the RGB-D image visual saliency is based on the fact that the visual saliency is used as a mechanical arm object selection judgment basis for positioning, and the complexity of learning based on a deep reinforcement learning strategy is solved. The method obtains the pixel-level and target-level saliency maps, supports various operations of a manipulator, and overcomes the limitation of interpretation bases.

The strategy training based on the deep reinforcement learning needs a large amount of labeled video data, and meanwhile, the training needs specific hardware support, so that the method is the optimal strategy learning under a specific scene. The visual saliency-based article selection method only needs the sorting of the visual saliency values to determine the priority, and does not need special training for specific scenes.

And scene positioning perception multi-scale positioning information is output, and various positioning requirements of the manipulator are supported. The original method only outputs pixel level information based on a feedback graph, and target level information is not generated. The invention calculates visual saliency in various output modes, and is suitable for pixel-level and target-level positioning. Each pixel in the image corresponds to a visual saliency value, the visual saliency map is suitable for mechanical arm suction operation, object segmentation is carried out on the image based on the visual saliency values of the pixel values to obtain the outline of the object, and the visual saliency map is suitable for mechanical arm clamping operation.

The mechanical arm object positioning and grabbing can be based on an object 6D posture estimation method, but the method is complex in scene, serious in object shielding and unreliable in posture estimation under the object stacking condition. Meanwhile, whether the object posture is suitable for the operation type of the manipulator needs to be quantified, and actual scene debugging is needed.

The method and the system for positioning the articles based on the visual saliency of the RGB-D images can be used in the fields of unmanned sorting of warehouses, service robots and the like.

And reflecting the attention sequence of the scene articles based on the visual saliency, and selecting the articles as a mechanical arm to position the articles. Only scene objects need to be detected, and reinforcement learning for specific scenes is not needed. In addition, the invention can simultaneously generate the pixel-level visual saliency map and the saliency target position information and support various operations of the manipulator.

So far, the technical solutions of the present invention have been described with reference to the preferred embodiments shown in the drawings, but it should be understood by those skilled in the art that the above embodiments are only for clearly illustrating the present invention, and not for limiting the scope of the present invention, and it is apparent that the scope of the present invention is not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. An article positioning method based on RGB-D image visual saliency is characterized by comprising the following steps:

acquiring an RGB-D image of an operation platform scene by the camera;

wherein, P (z)_S|I_RGB-D) Representing the visual saliency of the current scene, i.e. the visual saliency map, is defined as the probability p (z) of whether a pixel of an RGB-D image is salient or not_S|x_c，x_d)；I_RGB-DRepresenting an RGB-D image; x is the number of_cAnd x_dRespectively representing RGB image salient features and Depth image salient features, and respectively extracting by adopting a CNN network; p (z)_S，x_c，x_d) Representing a joint probability distribution, p (x)_c，x_d) Outline of representing salient featuresRate distribution; the visualization effect is represented by a temperature map, the larger the significant value is, the warmer the color is, and the smaller the significant value is, the colder the color is;

2. The RGB-D image visual saliency-based item positioning method of claim 1,

p(I_RGB-D|O，z_s)＝p(I_RGB-D|z_s) (7)

p(O，z_s|I_RGB-D)∝p(z_s|I_RGB-D)L(O)C(O，z_s) (8)

wherein, b_jIndicating a detected objectThe area of the image to be displayed is,

3. The RGB-D image visual saliency-based item positioning method of claim 1,

the camera is a Kinect camera.

4. The RGB-D image visual saliency-based item positioning method of claim 1,

the adopted manipulator has two operation functions of suction and clamping; when the scene is complex, namely, articles are stacked together and are seriously shielded, a significant image is generated to be distributed in a pixel level, and the 'sucking' operation of a manipulator is supported; when the scene provides a target detection rectangular area, the salient target rectangular area can be obtained based on the salient image, and the manipulator clamping operation is supported.

5. The RGB-D image visual saliency-based item positioning method of claim 1,

and setting an operation stop threshold value based on the saliency value, sequentially driving the mechanical hands to operate according to the magnitude sequence of the visual saliency value until the saliency value is lower than the threshold value, stopping, and recalculating the visual saliency of the RGB-D image of the scene.

6. An article location system based on RGB-D image visual saliency employing the method of claim 1, comprising: the system comprises a camera, a mechanical arm and an operating platform, wherein articles to be grabbed are stacked on the operating platform, the mechanical arm is a UR5 mechanical arm, the operating platform is a horizontal panel, and when the system is initialized, the camera corrects the operating platform and provides a reference plane for positioning the articles by the mechanical arm and grabbing the articles by the mechanical arm.

7. An article positioning system based on RGB-D image visual saliency as claimed in claim 6,

p(I_RGB-D|O，z_s)＝p(I_RGB-D|z_s) (7)

p(O，z_s|I_RGB-D)∝p(z_s|I_RGB-D)L(O)C(O，z_s) (8)

wherein, b_jWhich indicates the area of the detection target,

8. An article positioning system based on RGB-D image visual saliency as claimed in claim 6,

the camera is a Kinect camera.

9. An article positioning system based on RGB-D image visual saliency as claimed in claim 6,

10. An article positioning system based on RGB-D image visual saliency as claimed in claim 6,