CN115205371A

CN115205371A - Device and method for locating a region of an object from a camera image of the object

Info

Publication number: CN115205371A
Application number: CN202210373602.6A
Authority: CN
Inventors: A·G·库普奇克; P·C·席林格
Original assignee: Robert Bosch GmbH
Current assignee: Robert Bosch GmbH
Priority date: 2021-04-12
Filing date: 2022-04-11
Publication date: 2022-10-18
Also published as: DE102021109036A1

Abstract

According to various embodiments, a device and a method for locating regions of objects from camera images of these objects are described, wherein for a reference descriptor assigned to a region to be located, the camera images are searched for matches and the matches for different reference descriptors are grouped in the following manner: starting from the groups each consisting of one match, the groups are successively combined by combining the matched groups of the first part of the reference descriptors with the relative orientation of the match contained therein from the set of matched groups having the second part of the reference descriptors together with the match from the group of the first part of the reference descriptors with the one for which the reference for the relative orientation best matches. The matches of the groups are output as locations located for the respective objects.

Description

Device and method for locating a region of an object from a camera image of the object

Technical Field

The present disclosure relates to devices and methods for locating portions of objects from camera images of the objects.

Background

In order to be able to achieve flexible production or processing of objects by robots, it is desirable to: the robot is able to operate on an object regardless of the pose in which the object is placed into the workspace of the robot. Thus, the robot is intended to be able to identify which parts of an object are located in which positions, so that the robot can, for example, grasp the object at the exact location, for example, in order to fix the object on other objects, or to weld the object in the current position. This means that: the robot is intended to be able to recognize the pose (position and orientation) of an object or also the area of the object, such as a barcode, from one or more images taken by a camera fixed at the robot. The method for achieving the purpose comprises the following steps: a descriptor, that is to say a point (vector) in a predefined descriptor space, is determined for a part of the object, that is to say a pixel of the object represented in the camera image plane, wherein the robot is trained to assign the same descriptor to the same part of the object regardless of the current pose of the object and thus to recognize the topology of the object in the image, so that it is then known, for example, which corner of the object is at what position in the image. Then, knowing the pose of the camera, the pose of the object or the position of the region of the object in three-dimensional space can be deduced in turn. The identification of the topology may be implemented using a machine learning model that is trained accordingly. These descriptors have the theoretical property that they are invariably always assigned to the same location. However, this uniqueness only applies in one direction, since a particular descriptor may appear multiple times in a descriptor image generated for a camera image, for example when multiple objects of the same object type are displayed multiple times in the camera image. In this case, a descriptor needs to be assigned to the object (or also to a region of the object that appears multiple times in the camera image). In contrast, methods are desirable which allow descriptors appearing in the descriptor image for the camera image to be assigned to the correct object or to the correct region on one or more objects in order to position a region on the object by means of the camera image, for example for robot control.

Disclosure of Invention

According to various embodiments, a method for locating parts of objects based on camera images of the objects is provided, the method having: defining a part to be positioned for an object type of the object; determining a reference for the relative orientation of the location to be found; for the object type, training a machine learning model to map camera images onto descriptor images, wherein each camera image displays an object of the object type, wherein a descriptor image to which a camera image should be mapped has a descriptor of a part of the object at an image location for the part of the object displayed by the camera image at the image location; for each part to be positioned, specifying a descriptor at the part to be positioned of the subject as a reference descriptor for the part to be positioned; receiving camera images of one or more objects of the object type; mapping the camera image onto a descriptor image by means of a trained machine learning model; for each reference descriptor, determining a match between the descriptor of the descriptor image and the reference descriptor; the matches for different reference descriptors are grouped into groups by: starting from groups each consisting of one match, by

Combining the matched sets of the first part of the reference descriptors with the combination for which the match contained in the set of matched sets from the second part of the reference descriptors is combined with the matched relative orientation from the set of matched sets from the first part of the reference descriptors with the reference best match for that relative orientation and successively merging the sets; and moreover

The matches of the groups are output as locations located for the respective objects.

The above method enables correct assignment of descriptors that appear multiple times in descriptors generated for camera images, for example, to the corresponding object (or target instance, which may also be a region or part of the object). This allows: for a plurality of instances of the same object, for example a plurality of objects or also a plurality of identical regions on the object, the locations belonging to these instances are located, for example the poses of a plurality of identical objects in the camera image or the positions in three-dimensional space of a plurality of barcodes on one or more objects visible in the camera image are determined.

Various embodiments are described below.

Embodiment 1 is the above-described method for locating parts of objects from camera images of the objects.

Embodiment 2 is the method of embodiment 1, wherein the relative orientation of the part to be found is an orientation of the part to be found in three-dimensional space.

This opens up many possibilities to define the relative orientation, such as spatial angle, area between straight lines, length of distance between 3D positions, etc. By using the orientation in three-dimensional space (and not only in the camera image plane), it is also ensured that: assignment errors that are only reflected in orientation deviations perpendicular to the camera image plane can be avoided.

Embodiment 3 is the method of embodiment 1, wherein the relative orientation has a pair-wise distance in three-dimensional space of the location to be located or the location to be located.

This allows a simple determination of the relative orientation (e.g. by depth information or solution to the PnP (Perspective-n-Point) problem) and depends on all three spatial dimensions (not only on the distance in the camera image plane).

Embodiment 4 is the method of any one of embodiments 1 to 3, having: after each merging of the two groups, the group containing the match contained in one of the two groups of the pair is removed.

This speeds up the method significantly, but in adverse circumstances ("countervailing") a loss of quality may result if incomplete groups are removed, since it may happen that which group provides a better relative orientation (closer to the reference) when the group is made complete.

Embodiment 5 is the method of any one of embodiments 1 to 3, having: after each merging of two groups, if the combined group into which the two groups are merged contains a match for each reference descriptor, the group containing the match contained in one of the two groups of the pair is removed.

This speeds up the process without loss of quality, since only the complete set is removed.

Embodiment 6 is the method of any of embodiments 1-5, wherein the output has: the following steps are repeated until no more groups remain or the remaining groups have reached the specified minimum group size:

outputting a group for which a match has the best orientation among all groups having the largest number of matches; and removing the outputted group and all groups containing at least one match that the outputted group also contains.

In particular, intermediate results can be output and, if it is not possible for some groups to combine them into a complete group, for example because the region to be located is occluded in the camera image, the results can also be output.

Embodiment 7 is the method of any one of embodiments 1 to 6, having: the groups are organized in a graph, wherein each group is assigned to a node, edges are placed between nodes of groups assigned matches with different portions of the reference descriptors, wherein the weight of an edge between two groups accounts for the degree to which a relative bearing from a match common to both groups matches a reference to that relative bearing.

This forms an efficient data structure for implementing the method. The possibilities for defining such a graph are: constructing an adjacency matrix, wherein entries of the adjacency matrix are selected according to edge weights; and building a list defining the assignment of groups to matrix indices. Another possibility consists in managing the graph by means of a Priority list (in english "Priority Queue" or "Priority Heap)", the entries of which correspond to the edges of the graph and which is ordered by edge weight.

Embodiment 8 is a method for controlling a robot, the method having the steps of:

locating a part of an object to be processed by a robot according to any one of embodiments 1 to 7;

determining a pose of the object from the located part and controlling the robot according to the determined pose;

and/or

From these located locations, the area of the object is determined and the robot is controlled according to the determined area.

Embodiment 9 is a software or hardware agent, in particular a robot, having: a camera which is set up to provide a camera image of the object; and a control device, which is set up to carry out the method according to one of embodiments 1 to 8.

Embodiment 10 is a software or hardware agent according to embodiment 9, having at least one actuator, wherein the control device is set up to control the at least one actuator using the localized area.

Embodiment 11 is a computer program comprising instructions which, when executed by a processor, cause: the processor performs the method according to any one of embodiments 1 to 8.

Embodiment 12 is a computer readable medium storing instructions that when executed by a processor cause: the processor performs the method according to any one of embodiments 1 to 8.

Drawings

In the drawings, like reference numerals generally refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the invention. In the following description, various aspects are described with reference to the following drawings:

figure 1 shows a robot;

FIG. 2 illustrates training of a neural network in accordance with an embodiment;

FIG. 3 illustrates determination of an object pose or a grasp pose, according to an embodiment;

FIG. 4 illustrates an example of a problem in the location of a part when reference descriptors appear at multiple locations in a camera image;

FIG. 5 illustrates an example of grouping matches between descriptors and reference descriptors;

fig. 6 shows a flow chart of a method for locating a region of an object from a camera image of the object.

Detailed Description

The following detailed description refers to the accompanying drawings that illustrate, by way of illustration, specific details and aspects of the disclosure in which the invention may be practiced. Other aspects may be utilized and structural, logical, and electrical changes may be made without departing from the scope of the present invention. The various aspects of the disclosure are not necessarily mutually exclusive, as some aspects of the disclosure may be combined with one or more other aspects of the disclosure to form new aspects.

Various examples are described in more detail below.

Fig. 1 shows a robot 100.

The robot 100 comprises a robot arm 101, such as an industrial robot arm for handling or mounting a workpiece (or one or more other objects). The robotic arm 101 comprises

manipulators

102, 103, 104 and a base (or support) 105 by means of which these

manipulators

102, 103, 104 are supported. The term "manipulator" relates to movable members of robotic arm 101 whose manipulation enables physical interaction with the environment, for example, to perform a task. For the control, the robot 100 comprises a (robot) control device 106, which is designed to interact with the environment in accordance with a control program. The last member 104 of the

manipulators

102, 103, 104 (which is furthest from the base 105) is also referred to as an end effector 104 and may contain one or more tools, such as a welding torch, a gripper, a painting tool, and the like.

The other manipulators 102, 103 (which are closer to the base 105) may form a positioning device such that the robot arm 101 with the end effector 104 at its end is provided together with the end effector 104. The robotic arm 101 is a mechanical arm that may provide similar functionality as a human arm (possibly with a tool at its end).

The robotic arm 101 may contain link

elements

107, 108, 109 that connect the

manipulators

102, 103, 104 to each other and to the base 105.

Link elements

107, 108, 109 may have one or more links that may respectively provide rotatable (that is, rotational) and/or translational (that is, displacement) movement of the associated manipulators relative to one another. The movements of the

manipulators

102, 103, 104 may be initiated by means of actuators, which are controlled by the control device 106.

The term "actuator" may be understood as a component configured to affect a mechanism or process in response to its actuation. The actuator may implement the command created by the control device 106 (so-called activation) as a mechanical movement. The actuator, for example an electromechanical converter, can be designed as: in response to its actuation, electrical energy is converted to mechanical energy.

The term "control means" may be understood as any type of logic implementing entity, which may for example comprise a circuit and/or a processor, which is/are capable of executing software, firmware or a combination thereof stored in a storage medium, and which may for example issue instructions to an actuator in the present example. The control means may be configured, for example, by program code (e.g. software) to control the operation of the system, i.e. the robot in the present example.

In the present example, the control device 106 includes a memory 111 that stores code and data and one or more processors 110, the processors 110 controlling the robotic arm 101 based on the code and data. According to various embodiments, the control device 106 controls the robotic arm 101 based on the machine learning model 112 stored in the memory 111.

According to various embodiments, the machine learning model 112 is designed and trained to enable the robot 100 to recognize, from the camera images, a pick-up gesture of an object 113, for example, placed into the workspace of the robotic arm 101, for example, a robot intended to pick up an object from a Bin (english "Bin-picking"). This means that: the robot 100 identifies how it can pick up the object 113, that is, how it must orient its end effector 104 and where it must move the end effector in order to pick up (e.g., grab) the object 113. A pick-up gesture is understood such that it contains sufficient information for picking up, that is to say information about the orientation and position 113 of the object, from which it is determined how the object 113 can be grasped. The take-up gesture does not necessarily need to contain complete orientation information about the object 113, since in case the object 113 has a rotationally symmetric part for grabbing, it may for example not matter how this rotationally symmetric part rotates around its axis of rotation.

The robot 100 may be equipped with one or more cameras 114, for example, that enable the robot to take images of the robot's workspace. The camera 114 is attached to the robot arm 101, for example, so that the robot can take images of the object 113 from different angles in such a way that the robot moves around the robot arm 101.

An example of a machine learning model 112 for object recognition is a dense object network. The dense object network maps an image (e.g., an RGB image provided by camera 114) onto a descriptor space image having some selected dimension D. However, other machine learning models 112 may also be used, particularly such machine learning models that do not necessarily generate a "dense" feature map, but rather assign descriptors only to particular points (e.g., corners) of the object.

According to various embodiments, using the method for identifying an object and its pose, a 3D model of the object is assumed to be known, for example a CAD (Computer Aided Design) model, which is typically the case for industrial assembly or machining tasks. A non-linear dimensionality reduction technique may be used in order to compute an optimal target image for training the input images for the neural network. Thus, according to various embodiments, supervised training of neural networks is used. It is also possible to take RGBD images (RGB + depth information) of the object and determine a 3D model of the object accordingly. Alternatively, an unsupervised training may be performed, wherein the machine learning model learns the descriptors for the part of the object itself.

For supervised training, according to one embodiment, to generate training data for training the machine learning model 112, data collection is first performed. In particular, for example, registered RGB (red-green-blue) images are collected. Here, the registered image refers to an RGB image having known intrinsic and extrinsic camera values. In a real-world scenario, for example, a robot-mounted camera 114 (e.g., a camera mounted at a robot hand joint) is used to scan an object during movement of the robot (e.g., the robot arm 101) around. In a simulated scene, a photorealistically generated RGB image is used with a known object pose.

After the RGB images are collected, the target images of the RGB images are rendered for supervised training of the neural network.

Assume that the pose of each object in world coordinates is known in each collected RGB image. This is not complicated for simulating a scene, but requires manual adjustment to the scene in the real world, e.g. placing objects in predefined positions. RGBD (RGB + depth information) images may also be used in order to determine the position of the object.

With this information and using vertex descriptor computation techniques, descriptor images (i.e., training output images, also referred to as target images or Ground-Truth images) are rendered for each RGB image (i.e., training input image), such as described below.

If a target image is generated for each RGB image, that is pairs of RGB images and target images are formed, these pairs of training input images and associated target images may be used as training data for training the neural network, as shown in fig. 2.

Fig. 2 illustrates training of a neural network 200 in accordance with an embodiment.

The neural network 200 is a full convolution network (full convolution network) that will be used to generate the full convolution network

Tensor (input image) mapping to

Tensor (output image).

The full convolutional network comprises a plurality of convolutional layers of hierarchy 204, followed by a pooling layer, an upsampling layer 205 and a hopping connection 206 to combine the outputs of the different layers.

For training, the neural network 200 receives a training input image 201 and outputs an output image 202 having pixel values in descriptor space (e.g., color components in terms of descriptor vector components). A training loss is calculated between the output image 202 and the target image 203 associated with the training input image. This may be done for a batch of training input images, and the training penalties may be averaged over these training input images and used to train the weights of the neural network 200 using stochastic gradient descent. The training loss calculated between the output image 202 and the target image 203 is, for example, an L2 loss function (so as to minimize the pixel-by-pixel least squares error between the target image 203 and the output image 202).

The training input image 201 shows the object and the target image as well as the output image contain vectors in descriptor space. The vectors in descriptor space may be mapped onto colors such that the output image 202 (and the target image 203) resemble a heat map of the object.

The vectors in the descriptor space (also called (dense) descriptors) aredDimension vector (a)dE.g., 1, 2, or 3), which are assigned to each pixel in the respective image (e.g., each pixel of the input image 201, assuming that the input image 201 and the output image 202 have the same dimensions). The dense descriptor implicitly encodes the surface topology of the object shown in the input image 201, invariant with respect to the pose or camera position of the object.

Given a 3D model of an object, optimal and unique descriptors for each vertex of the 3D model of the object can be analytically determined. According to various embodiments, a target image for the registered RGB images is generated using these optimal descriptors (or estimates of these descriptors determined by optimization), which leads to a fully supervised training of the neural network 200. Additionally, the descriptor space is disregarded for the selected description Fu WeidudBecomes interpretable and optimal.

If now a machine learning model 112, e.g. a neural network 200, is trained for mapping camera images of the object 113 onto descriptor images, the following may be done in order to determine the pick-up pose of the object 113 in an unknown orientation.

First, a plurality of reference points on the object 113 are selected on the objectp _i ，i=1, …,NAnd determines descriptors (referred to herein as reference descriptors) of these reference points. These reference points are the parts of the object to be located for the later ("new") camera image, and the reference descriptors are a reference set of descriptors. This selection may be achieved by: taking a camera image of the object 113, selecting a reference pixel on the object: (u _i , v _i ) (and thereby correspondingly select the reference point of the object) and the camera image is mapped onto the descriptor image by the neural network 200. Then, the descriptors at the positions in the descriptor image given by the positions of the reference pixels can be taken as descriptors of the reference points, that is to say that the descriptors of these reference points ared _i = I ^d (u _i , v _i ) WhereinI ^d = f(I; θ) Is a descriptor image in whichfIs a mapping (from camera image to descriptor image) implemented by a neural network,Iis a camera imageθAre weights of the machine learning model 200.

If the object 113 is now at an unknown orientation, a new camera image is takenI _neu And determining the associated descriptor image by means of a machine learning modelI ^d _neu = f(I _neu ; θ). Now, in this new descriptor image, the descriptors that match the reference descriptors (within a certain tolerance) are searched for.

From the presence of the descriptor imageI ^d _neu And thereby correspondingly in the new camera imageI _neu Location of the descriptor thus found of (1) ((ii))u _i , v _i ) Determining the position of the corresponding site in three-dimensional space. For example, with camera imagesI _neu Taking depth images (or camera images) togetherI _neu Having a depth channel, e.g. the camera image is an RGBD image), so that the image can be determined according to (a)u _i , v _i ) To determine the ith part to be positionedp _i By the three-dimensional position of (a will be at position: (b))u _i , v _i ) The depth values at are projected into the corresponding workspace coordinate system).

If the position of a plurality of reference points in space is known, the pick-up gesture can be determined therefrom, as shown in fig. 3.

For example, two reference points on the object 300 are determinedp ₁ And p ₂ In space and linearly combine the two, for example taking their average, to specify the anchor point 304. To define the gripping direction, provision is made for having a squareTo the direction ofp ₁ Andp ₂ optionally defining a second axis 302 passing through an anchor point 304, for example in the direction of the z-axis of the camera 114 or in the direction of the axis of the working area coordinate system. A third axis 303 passing through the anchor point 304 may be calculated by a vector product of the direction vector of the first axis 301 and the direction vector of the second axis 302. These three axes 301 to 303 and the anchor point 304 define the pick-up pose of the object 300. Next, the robot may be controlled such that it holds the handle of the object 300 extending in the direction of the first axis. Reference pointp ₁ Andp ₂ for example, so that the reference points extend as shown along the handle, i.e. along an elongated portion of the object suitable for grasping.

Similarly, three or more reference points may be arranged on the object's gripping surface, so that depending on the position of these reference points, the full 6D pick-up pose of the object may be determined or also the direction of the gripping surface on which the object may be picked up (grabbed or sucked).

Here, it should be noted that: the gripper does not necessarily have to have a pincer shape, but may also have suction means, for example, in order to grip the object on a suitable surface and thereby pick up the object. In order to bring the suction device to the correct position, it may be desirable in such a case, for example, to determine a pick-up posture which describes the direction and position of the surface of the object suitable for suction. This may be achieved, for example, by determining an anchor point and a plane normal vector in the anchor point.

More than three reference points may also be used in order to determine the pick-up gesture, for example in order to reduce errors by averaging.

Similar to the pick-up gesture, the control device 106 may also determine an area on the object 300, such as a Bounding-Box (Bounding-Box) of a barcode disposed on the object 300, to identify the object. The control device 106 can then control the robot arm 101, for example, such that it brings the object 300 in front of the camera, so that the camera can read the barcode.

As described above, the dense object network assigns a descriptor image to a (e.g., RGB) camera image of an object or portion of the object taken from an arbitrary perspective, the descriptor image assigning a multi-dimensional descriptor to each pixel (or each location) in the input camera image. Descriptors have the theoretical property that a particular point on the surface of an object is always associated with the same descriptor, regardless of the perspective. This feature may be used in various applications, for example for identifying or locating various parts on an object or also regions on the surface of a target object by identifying multiple descriptors, for example from corner points of the region or multiple descriptors of a take-up gesture as described above in relation to fig. 3. Using the additional depth information, i.e. the RGBD input data, and not only the RGB image (i.e. RGB image + depth information), the determined points can be projected into the 3D space in order to completely define such a region.

As described above, by finding a descriptor matching the reference descriptor in the descriptor image for the camera image showing the object, a part on the object can be found.

However, there are difficulties here: the reference descriptor may appear multiple times in the descriptor image for the camera image.

Fig. 4 shows an example of a problem in the positioning of a part when reference descriptors appear at a plurality of positions of a camera image.

In all three examples 401, 402, 403, it is assumed that a rectangular area on the object (e.g. the bounding box of a barcode) is intended to be found.

In the first example 401, the reference descriptor appears at a second location 406 (e.g., due to a false positive descriptor determination) in addition to the first location 404 (within the object 405). As a result, the rectangular area 407 is likely to be determined erroneously so that the rectangular area is partially outside the object 405.

In the second example 402, there are two

instances

408, 409 of the object type, so that correspondingly all descriptors also repeatedly exist. Correspondingly, the rectangular area 410 may be erroneously determined such that it has the position of the two

objects

408, 409.

In a third example 403, the situation is similar to in the second example 402, where the object and descriptor locations are also such that there is a group of descriptor locations for which the distance is similar to the distance between the correct (that is to say correctly assigned to the object) descriptor locations. Here, the rectangular area 411 may also be erroneously determined such that the rectangular area has two object parts. The example of fig. 4 may be considered a "confrontational" situation.

As can be seen from the example of fig. 4, if the locations where the descriptors match the reference descriptors (also referred to herein as matches) are not correctly grouped, the location on the object is incorrectly located. That is, the locations are grouped such that the locations do not belong to the same target instance (the same object or the same region on the object, e.g., the same barcode).

Thus, according to various embodiments, a method for grouping matches is provided that is capable of generating groups of matches that belong to the same target instance. In each such group, each match belongs to a different reference descriptor. This means that: each group has at most one match between the reference descriptor and the descriptor for each reference descriptor. It is possible that: there are groups that do not have a match for every reference descriptor (e.g., because the part of the object is not visible in the current camera image or a match has not been correctly found (false negative)).

By grouping into groups belonging to the target instance, the transition from the individual descriptors to the target instance is made clear.

This approach is advantageous even if it is sufficient to find only one target instance, because it increases robustness with respect to the case where the reference descriptor is found multiple times (as in the first example 401). In this case, it is possible to realize: the best match is selected rather than an arbitrarily good match.

In the following, examples are described in which matches are grouped into groups.

First, a measure of the relative orientation of the positions (and thereby also the matches, since these matches have a position in the camera image and thereby also an assigned position (for example by depth information) in three-dimensional space) is defined. The measure for example comprises the pair-wise distance between positions in three-dimensional space. Thereby, a metric can be assigned to a group of locations (or matches). (in the case of pairwise distances, the measure is a vector of pairwise distances whose dimensions depend on the number of matches).

For a part of the object to be located, the measure is evaluated as a reference for the relative orientation, e.g. to a reference point on the objectp _i Together with the provisions of (a).

The metric may be evaluated for a matching group, and for a group the quality of the group may be determined by comparing the metric for the group to a reference. For example, take the euclidean norm of the difference between two metrics or metric vectors. (if there are groups with fewer matches than the number of reference points and the metric is a pair-wise distance vector, then a comparison is made for these reference points with the corresponding sub-vectors of the vector).

The smaller the difference between the set of measurements and the reference, the more similar the set's matched orientation and the orientation of the reference points and the higher the quality of the set.

If the camera image is now received by the control device 106, it maps it onto the descriptor image by means of a machine learning model and determines a match with the reference descriptor. The control means then groups the matches into groups.

According to various embodiments, this is achieved in accordance with a graph representation in which nodes correspond to groups. The weighted edges between two nodes (and thus two groups) respectively describe: two groups may be combined into one group, where the weight of an edge between two nodes (and thereby groups) accounts for: how high the deviation from the reference measure of the resulting group is when the two groups are combined.

Control device 106 starts generating a map with the following data:

(A1) Each match is treated as a separate group and nodes are thus added to the graph

(A2) Matches for different reference descriptors are connected by edges, that is there is no edge between matches for the same reference descriptor. The edge weights are set by evaluating the metric for a group containing two matches connected via an edge and differencing with a reference metric (and if necessary applying a norm to the difference in the case of a metric in the form of a vector with multiple components).

Subsequently, the control device 106 repeatedly updates the map as follows:

(B1) Finding a pair of groups that can be combined into the highest quality group, that is to say finding the least weighted edge in the graph;

(B2) Combining the found pairs of pairs by:

(B2 a) adding a new node to the combined group

(B2B) adding a new edge between the node and each other node corresponding to the group containing only matches for reference descriptors for which the combined group does not contain matches

(B2 c) sets an edge weight for the connected group (similar to A2).

At any point in time during the method, the control device can output the best result so far of assigning a match to the target instance according to the current map as follows:

(C1) The groups are sorted (primarily) according to decreasing order of number of matches and (secondarily) according to decreasing order of quality;

(C2) Selecting a first group in the list, outputting the first group as part of the result and removing the first group from the list;

(C3) Removing all remaining groups having at least one match for the reference descriptor for which the group selected in (C2) has a match;

(C4) Repeat (C1-C3) for the remaining groups until no more groups remain or until the specified minimum group size.

To improve efficiency, the updates (B1-B2 c) may be extended as follows.

First, pruning (rounding) can be added after (B2), which removes from the graph all nodes (groups) containing matches contained in one of the groups of the pair if the combined group contains a match for each reference descriptor, that is to say is complete. It can be shown that: these matches cannot be combined into one (or more) additional groups that would have better quality. That is, the trimming has no disadvantage in quality and can always be considered.

Second, the update may be performed "greedily" in such a way that the above-described clipping is performed for each combining step (B2) regardless of the number of matches grouped (i.e., even if the combined group is incomplete). Thus, the computational complexity is significantly reduced (linearly rather than exponentially), but if the target instances are close to each other (e.g. closer than the size of the region described by the site to be located), the resulting set may be erroneous.

Fig. 5 shows an example of grouping matches.

First, four reference points 502 of the object 501 are selected in the first camera image 503 (e.g., manually selected by a user). The control device 106 determines the associated reference descriptor by mapping the camera image 503 onto the descriptor image by means of the machine learning model 112.

The control device 106 then receives a second camera image 504, which contains two instances 505 of the same object type. The control means maps the second camera image 504 onto the descriptor image and finds a match with the reference descriptor. In fig. 5, matches for the same reference descriptor are numbered from 1 to 2 or from 1 to 3.

Next, the control device 106 constructs a map 506 based on (A1) and (A2). The control device then repeatedly updates the map to an updated map by generating larger and eventually complete combined groups and clipping until only individual groups remain. These groups define the determined assignment and the control device positions the area 510 on the object instance and controls the robot arm accordingly, for example.

In summary, according to various embodiments, a method as illustrated in fig. 6 is provided.

Fig. 6 shows a flow chart 600 of a method for locating a part of a subject from a camera image of the subject.

In 601, the location to be located is specified for the object type of these objects.

In 602, a reference for the relative orientation of the location to be found is determined.

In 603, for the object type, a machine learning model is trained to map camera images onto descriptor images, wherein each camera image shows an object of the object type, wherein the descriptor image to which a camera image should be mapped has a descriptor of a part of the object at an image location of the object for the part of the object that the camera image shows at the image location.

At 604, for each location to be located, a descriptor at the location to be located of the subject is specified as a reference descriptor for the location to be located.

At 605, camera images of one or more objects of the object type are received.

In 606, the camera image is mapped onto the descriptor image with the help of a trained machine learning model.

In 607, for each reference descriptor, a match between the descriptor of the descriptor image and the reference descriptor is determined.

In 608, matches for different reference descriptors are grouped into groups by: starting from groups each consisting of one match, by

The groups of matches of the first part of the reference descriptors are successively merged with the combination for which the match from the set of matched groups of the second part of the reference descriptors is contained together with the relative orientation of the match from the group of the first part of the reference descriptors with the reference best match for that relative orientation.

In 609, the matches of the groups are output as the locations found for the corresponding objects.

In other words, according to various embodiments, the matching of descriptors to different reference descriptors is grouped such that the orientation of these matches (that is to say the orientation of the position of the descriptor in the descriptor image that matches these reference descriptors) matches as well as possible with the specified reference for the relative orientation. For example, the shape and size of the bounding box of the barcode may be specified, and these matches are then grouped into groups such that each group defines an area commensurate with the specified bounding box. As already mentioned above, the groups need not be grouped completely, but intermediate results can already be output even in the case of partial grouping (that is to say the control device can also control the robot arm using intermediate results). In the case of intermediate results, the group for example defines the edges of the bounding box, but the complete bounding box has not yet been defined.

Objects are instances of an object type, i.e., all objects have the same shape specified by the object type, for example. For example, these objects are members having a specific shape. However, as long as the topology of these objects is the same, there may also be differences in shape. For example, the object type may be "shoes", the location to be located may be an edge point of the tongue, and the objects may be different shoes.

The machine learning model is, for example, a neural network. Other machine learning models that are correspondingly trained may be used.

According to various embodiments, the machine learning model assigns descriptors to pixels of the object (within the image plane of the corresponding camera image). This can be seen as an indirect encoding of the surface topology of the object. This connection between the descriptor and the surface topology can be made explicitly by rendering, so thatThe descriptors are mapped onto the image plane. It should be noted that: the descriptors on the surfaces of the object model, that is to say the points outside the vertices, can be determined by means of interpolation. If, for example, a surface is given by 3 vertices of the object model with corresponding descriptors y1, y2, y3, descriptor y can be calculated as a weighted sum of these values at any point of the surface

. In other words, the descriptors are interpolated over these vertices.

For generating an image pair of training data for a machine learning model, for example, an image (e.g. an RGB image) of an object, including the object (or objects) with a known 3D (e.g. CAD) model and a known pose (in a global (that is to say world) coordinate system), is mapped onto a (dense) descriptor image, which is optimal in that it is generated by searching for descriptors for minimizing the deviation of geometric properties (in particular the proximity of points of the object) between the object model and its representation (embedding) in descriptor space. In practical use, since the search is limited to a certain search space, a theoretically optimal solution for minimization is not usually located. However, the estimate of the minimum is determined within the constraints of the actual application (available computational accuracy, maximum number of iterations, etc.).

Each training data image pair comprises a training input image of the object and a target image generated by projecting the descriptors of the vertices visible in the training input image onto a training input image plane in accordance with the pose of the object in the training input image. The images are used together with their associated target images for supervised training of the machine learning model.

Thus, the machine learning model is trained to recognize well-defined features of an object (or objects). By means of the evaluation of the machine learning model, this information can be used in real time for various applications in robot control, such as predicting object grabbing gestures for assembly. It should be noted that: the supervised training approach enables explicit coding of symmetric information.

The method of fig. 6 may be performed by one or more computers comprising one or more data processing units. The term "data processing unit" may be understood as any type of entity capable of processing data or signals. For example, the data or signals may be processed in accordance with at least one (that is, one or more than one) specific function performed by the data processing unit. The data processing unit may contain or be formed from analog circuitry, digital circuitry, mixed signal circuitry, logic circuitry, a microprocessor, a microcontroller, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), a programmable gate array (FPGA), an integrated circuit, or any combination thereof. Any other way of implementing the respective functions described in more detail below may also be understood as a data processing unit or a logic circuit arrangement. It will be readily understood that one or more of the method steps detailed herein may be implemented (e.g., realized) by a data processing unit by means of one or more specific functions performed by the data processing unit.

Various embodiments may receive and use sensor signals of various sensors, such as (e.g., RGB or RGB-D) cameras, video sensors, radar sensors, liDAR (LiDAR) sensors, ultrasound sensors, thermal imaging technology sensors, and so forth, for example, to obtain sensor data of a display object. Embodiments for autonomous control of a robot, such as a robotic manipulator, are used to accomplish various manipulation tasks in various scenarios. In particular, the embodiments can be applied to the control and monitoring of the implementation of manipulation tasks, for example in assembly lines.

Although specific embodiments are illustrated and described herein, one of ordinary skill in the art will recognize that: the particular embodiments shown and described may be substituted for a wide variety of alternate and/or equivalent implementations without departing from the scope of the present invention. This application is intended to cover any adaptations or variations of the specific embodiments discussed herein. Accordingly, it is intended that this invention be limited only by the claims and the equivalents thereof.

Claims

1. A method for locating parts of objects from camera images of the objects, the method having:

defining the object type of the objects with parts to be positioned;

determining a reference for the relative orientation of the location to be found;

for the object type, training a machine learning model to map camera images onto descriptor images, wherein each camera image displays an object of the object type, wherein a descriptor image to which a camera image should be mapped has a descriptor of a part of the object at an image location for the part of the object displayed by the camera image at the image location;

for each part to be positioned, a descriptor at the part to be positioned of the object is defined as a reference descriptor of the part to be positioned;

receiving camera images of one or more objects of the object type;

mapping the camera image onto a descriptor image by means of a trained machine learning model;

for each reference descriptor, determining a match between the descriptor of the descriptor image and the reference descriptor;

matches for different reference descriptors are grouped into groups by: starting from groups each consisting of one match, by

2. A method according to claim 1, wherein the relative orientation of the part to be found is the orientation of the part to be found in three-dimensional space.

3. The method of claim 1, wherein the relative orientation has a pair-wise distance in three-dimensional space of the location to be located or the location to be located.

4. A method according to any one of claims 1 to 3, having: after each merging of the two groups, the group containing the match contained in one of the two groups of the pair is removed.

5. A method according to any one of claims 1 to 3, having: after each merging of two groups, if the combined group into which the two groups are merged contains a match for each reference descriptor, the group containing the match contained in one of the two groups of the pair is removed.

6. A method according to any one of claims 1 to 5, wherein the output has: the following steps are repeatedly performed until no more groups remain or the remaining groups have reached the specified minimum group size:

7. The method according to any one of claims 1 to 6, having: the groups are organized in a graph, wherein each group is assigned to a node, edges are placed between nodes of groups assigned matches with different portions of the reference descriptors, wherein the weight of an edge between two groups accounts for the degree to which a relative bearing from a match common to both groups matches a reference to that relative bearing.

8. A method for controlling a robot, the method having the steps of:

locating a part of an object to be processed by the robot according to any one of claims 1 to 7;

and/or

Determining an area of the object from the located parts and controlling the robot according to the determined area.

9. A software or hardware agent, in particular a robot, having the following components:

a camera set up to provide a camera image of an object;

control device, which is set up for carrying out the method according to one of claims 1 to 8.

10. Software or hardware agent according to claim 9, having at least one actuator, wherein the control device is set up for controlling the at least one actuator using the localized area.

11. A computer program comprising instructions that when executed by a processor cause: the processor performs the method of any one of claims 1 to 8.

12. A computer-readable medium storing instructions that when executed by a processor cause: the processor performs the method of any one of claims 1 to 8.