CN113223091B

CN113223091B - Three-dimensional target detection method, three-dimensional target capture device and electronic equipment

Info

Publication number: CN113223091B
Application number: CN202110473106.3A
Authority: CN
Inventors: 刘亦芃; 杜国光; 赵开勇
Original assignee: Cloudminds Shanghai Robotics Co Ltd
Current assignee: Cloudminds Shanghai Robotics Co Ltd
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2023-01-24
Anticipated expiration: 2041-04-29
Also published as: WO2022227678A1; CN113223091A

Abstract

The embodiment of the invention relates to the technical field of computer vision, and discloses a three-dimensional target detection method, a three-dimensional target grabbing device and electronic equipment. The method comprises the following steps: acquiring a depth image containing a target recognition object; generating a camera point cloud corresponding to the depth image according to the depth image and the camera internal parameters, wherein the camera point cloud is a point cloud under a camera coordinate system; converting the camera point cloud into a world point cloud, wherein the world point cloud is a point cloud under a world coordinate system; carrying out target detection on the world point cloud according to a preset target identification model so as to generate a bounding box of the target identification object under a world coordinate system; and generating the bounding box of the target identification object in the camera coordinate system according to the bounding box of the target identification object in the world coordinate system. Through the mode, the detection quality of the three-dimensional target is improved.

Description

Three-dimensional target detection method, three-dimensional target capture device and electronic equipment

Technical Field

The embodiment of the invention relates to the technical field of computer vision, in particular to a three-dimensional target detection method, a three-dimensional target grabbing device and electronic equipment.

Background

Three-dimensional target detection refers to a technique of detecting three-dimensional space coordinates of an object. In the field of automatic driving, the collision of the vehicle can be avoided by controlling the vehicle through three-dimensional target detection; in the field of service robots, objects can be accurately grabbed through three-dimensional target detection.

Three-dimensional target detection generally outputs bounding boxes, categories, and corresponding confidences of target identifiers according to input point cloud data. However, in the related art, when detecting a three-dimensional object, it is generally necessary to obtain external parameters of a camera, and convert point cloud data in a camera coordinate system into point cloud data in a world coordinate system according to the external parameters of the camera. When the external parameters of the camera cannot be acquired, the accuracy of detecting the three-dimensional target by the related technology is low.

Disclosure of Invention

In view of the above problems, embodiments of the present invention provide a three-dimensional target detection method, a three-dimensional target capture device, and an electronic device, so as to solve the problem in the prior art that the three-dimensional target detection accuracy is low.

According to an aspect of an embodiment of the present invention, there is provided a three-dimensional target detection method, including:

acquiring a depth image containing a target recognition object;

generating a camera point cloud corresponding to the depth image according to the depth image and the camera internal parameters, wherein the camera point cloud is a point cloud under a camera coordinate system;

converting the camera point cloud into a world point cloud, wherein the world point cloud is a point cloud under a world coordinate system;

carrying out target detection on the world point cloud according to a preset target identification model so as to generate a bounding box of the target identification object under a world coordinate system;

and generating the bounding box of the target identification object in the camera coordinate system according to the bounding box of the target identification object in the world coordinate system.

In an alternative approach, the converting the camera point cloud to a world point cloud comprises:

registering the camera point cloud with a preset plane point cloud to generate a transformation matrix from a camera coordinate system to a world coordinate system;

converting the camera point cloud to a world point cloud according to the transformation matrix.

In an alternative manner, the registering the camera point cloud with a preset plane point cloud to generate a transformation matrix from a camera coordinate system to a world coordinate system includes:

respectively calculating the mean values of the camera point clouds in three dimensions;

constructing a homogeneous transformation matrix according to the mean value, and setting the homogeneous transformation matrix as an initial value of an iterative closest point algorithm;

and generating a transformation matrix from a camera coordinate system to a world coordinate system according to the iterative closest point algorithm and the plane point cloud vertical to the gravity axis.

In an alternative approach, the converting the camera point cloud to a world point cloud according to the transformation matrix comprises:

determining a rotation matrix corresponding to the transformation matrix;

if the rotation angle corresponding to the rotation matrix is larger than 90 degrees, generating a world point cloud according to the rotation matrix and the camera point cloud;

and if the rotation angle corresponding to the rotation matrix is not more than 90 degrees, generating the world point cloud according to the complementary angle rotation amount corresponding to the rotation matrix and the camera point cloud.

In an optional manner, the method further comprises:

constructing a point cloud data training set, wherein the point cloud data training set comprises a plurality of groups of world point cloud data and label information corresponding to each group of world point cloud data;

and training a preset target recognition algorithm by using the point cloud data training set to generate the target recognition model.

In an alternative approach, the constructing a training set of point cloud data includes:

constructing a three-dimensional model library, wherein the three-dimensional model library comprises three-dimensional models of a plurality of identification objects;

after each identification object is placed in a world coordinate system, calculating an initial value of a bounding box of each identification object;

carrying out simulation arrangement on each identification object, and calculating a bounding box simulation value of each identification object at a simulation position;

randomly generating a camera view angle, and rendering based on the camera view angle to generate camera point cloud data of each identification object;

converting the camera point cloud data of each identification object into corresponding world point cloud data;

and adding label information to the corresponding world point cloud data.

According to another aspect of the embodiments of the present invention, there is provided a three-dimensional object capturing method, including the three-dimensional object detecting method described above, the three-dimensional object capturing method further includes:

determining the space position of the target recognition object according to the bounding box of the target recognition object under the camera coordinate system;

and generating a grabbing instruction according to the space position, so that the grabber grabs the target recognition object according to the grabbing instruction.

According to another aspect of the embodiments of the present invention, there is provided a three-dimensional object detecting apparatus, the apparatus including:

the acquisition module is used for acquiring a depth image containing a target recognition object;

the first generation module is used for generating a camera point cloud corresponding to the depth image according to the depth image and camera internal parameters, and the camera point cloud is a point cloud under a camera coordinate system;

the conversion module is used for converting the camera point cloud into a world point cloud, and the world point cloud is a point cloud under a world coordinate system;

the second generation module is used for carrying out target detection on the world point cloud according to a preset target identification model so as to generate a bounding box of the target identification object under a world coordinate system;

and the third generation module is used for generating the bounding box of the target recognition object in the camera coordinate system according to the bounding box of the target recognition object in the world coordinate system.

According to another aspect of the embodiments of the present invention, there is provided an electronic device including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the three-dimensional object detection method or the three-dimensional object grabbing method.

According to another aspect of the embodiments of the present invention, there is provided a computer-readable storage medium, where at least one executable instruction is stored, and when the executable instruction is executed on an electronic device, the electronic device executes the operations of the three-dimensional object detection method or the three-dimensional object capture method.

According to the embodiment of the invention, the camera point cloud corresponding to the depth image can be generated through the depth image and the camera internal parameter; after the camera point cloud is converted into the world point cloud, target detection can be performed on the world point cloud according to a preset target identification model so as to generate a bounding box of a target identification object under a world coordinate system; furthermore, the bounding box of the target identification object in the camera coordinate system can be generated according to the bounding box of the target identification object in the world coordinate system, so that the detection of the target identification object is completed. Therefore, the method and the device can generate the bounding box of the target identification object under the camera coordinate system based on the camera point cloud under the condition that the camera external reference is not acquired, and can improve the detection accuracy of the target identification object.

The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.

Drawings

The drawings are only for purposes of illustrating embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 shows a schematic flow diagram of a three-dimensional target detection method provided in an embodiment of the present invention;

fig. 2 (a) is a schematic diagram illustrating an identifier placement scene and a corresponding simulated camera position according to an embodiment of the present invention;

FIG. 2 (b) is a schematic diagram showing the rendering effect of the camera in FIG. 2 (a);

fig. 3 (a) is a schematic diagram illustrating another identifier placement scene and a corresponding simulated camera position according to an embodiment of the present invention;

FIG. 3 (b) is a schematic diagram showing rendering effects of the camera in FIG. 3 (a);

fig. 4 is a schematic flow chart illustrating a three-dimensional object capture method according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a three-dimensional object detection apparatus provided in an embodiment of the present invention;

fig. 6 shows a schematic structural diagram of an electronic device provided in an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention may be embodied in various forms and should not be limited to the embodiments set forth herein.

Fig. 1 shows a flow chart of a three-dimensional object detection method according to an embodiment of the invention, which is performed by an electronic device. The memory of the electronic device is used for storing at least one executable instruction, and the executable instruction causes the processor of the electronic device to execute the operation of the three-dimensional object detection method. The electronic device may be a robot, a car, a computer or other terminal device. As shown in fig. 1, the method comprises the steps of:

step 110: and acquiring a depth image containing the target recognition object.

The depth image may be an RGBD image, i.e., a depth image of an RGB color mode. The target identification object in the depth image is an identification object which needs target detection. The object identifiers can be, for example, cups, bottles, fruits, etc. Generally, a depth image including a target recognition object can be acquired by shooting a scene including the target recognition object with a depth camera.

Step 120: and generating a camera point cloud corresponding to the depth image according to the depth image and the camera internal parameters, wherein the camera point cloud is a point cloud under a camera coordinate system.

And generating a camera point cloud corresponding to the depth image according to the depth image and the camera internal parameters, wherein the camera point cloud is a point cloud under a camera coordinate system. The camera internal parameter is a parameter related to a characteristic of a camera that captures a depth image, and generally includes a focal length, a pixel size, and the like of the camera.

Step 130: and converting the camera point cloud into a world point cloud, wherein the world point cloud is a point cloud under a world coordinate system.

In an alternative manner, the camera point cloud and the preset plane point cloud may be registered to generate a transformation matrix from the camera coordinate system to the world coordinate system, and the camera point cloud is converted into the world point cloud according to the transformation matrix. In order to obtain a transformation matrix, the mean values of the camera point cloud in three dimensions can be respectively calculated, a homogeneous transformation matrix is constructed according to the mean values, the homogeneous transformation matrix is set as an initial value of an iterative closest point algorithm, and the transformation matrix from a camera coordinate system to a world coordinate system is generated according to the iterative closest point algorithm and the plane point cloud perpendicular to the gravity axis.

For example, the mean value of the camera point cloud in each dimension in the three-dimensional space is first calculated

And

then construct homogeneous transformation matrix

As an initial value of the iterative closest point algorithm; and generating a plane point cloud perpendicular to a gravity axis (z axis) of the world coordinate system to obtain a transformation matrix from the camera point cloud to the plane point cloud, and converting the camera point cloud into the world point cloud through the transformation matrix.

In an optional manner, when the camera point cloud is converted into the world point cloud according to the transformation matrix, first, a rotation matrix corresponding to the transformation matrix is determined, if a rotation angle corresponding to the rotation matrix is greater than 90 degrees, the world point cloud is generated according to the rotation matrix and the camera point cloud, and if the rotation angle corresponding to the rotation matrix is not greater than 90 degrees, the world point cloud is generated according to a complementary angle rotation amount corresponding to the rotation matrix and the camera point cloud. For example, if the rotation angle does not exceed 90 degrees, the difference between 180 degrees and the rotation angle is taken as the rotation angle of the rotation matrix.

Step 140: and carrying out target detection on the world point cloud according to a preset target identification model so as to generate a bounding box of the target identification object under a world coordinate system.

The target detection can be carried out on the world point cloud according to a preset target recognition model so as to generate a bounding box of the target identification object under the world coordinate system. Bounding boxes, i.e. bounding minimum rectangles, are algorithms for solving the optimal bounding space of a set of discrete points, the basic idea being to approximately replace complex geometric objects with slightly larger and characteristically simple geometries (bounding boxes). The bounding boxes of the target identifiers may be, for example, AABB bounding boxes, bounding balls, directional bounding boxes OBB, and fixed directional convex hulls FDH. Before target detection is performed on the world point cloud according to a preset target recognition model, a target recognition algorithm can be trained based on deep learning to generate a target recognition model. The following describes the training process of the object recognition algorithm in detail.

Before training a target recognition algorithm, a point cloud data training set needs to be constructed, wherein the point cloud data training set comprises a plurality of groups of world point cloud data and label information corresponding to each group of world point cloud data. And training a preset target recognition algorithm by using the point cloud data training set to generate a target recognition model. In one embodiment of the invention, the target recognition model may be, for example, a Vote Net network (three-dimensional target detection network). The Vote Net network is an end-to-end 3D object detection network based on cooperation of deep point set network and Hough voting.

In an alternative way, a point cloud data training set can be constructed by:

and constructing a three-dimensional model library, wherein the three-dimensional model library comprises a plurality of three-dimensional models of the identification objects, and each identification object is arranged to a world coordinate system (the x axis is rightward, the y axis is forward, and the z axis is upward), so that the long axis corresponds to the y axis, the width corresponds to the x axis, and the height corresponds to the z axis when the object is vertically placed. The bounding box for each recognition can then be calculated using principal component analysis. Further, a recognition object placing scene for simulation is constructed, each recognition object is placed at the simulation position under the placing scene, and a bounding box of each recognition object at the simulation position is calculated. If the placing position comprises a plurality of identification objects, collision detection can be carried out to ensure that the identification objects do not collide. The placing position is a space position of each identification object in a preset space range under a world coordinate system, after the identification objects are placed to the world coordinate system, the initial position of the identification objects is determined, and the placing position of the identification objects is determined through a translation matrix and a rotation matrix, wherein the rotation matrix is a rotation matrix around a z axis. Further, a plurality of camera view angles can be randomly generated, the world point cloud data can be rendered based on each camera view angle, camera point cloud data of each identification object corresponding to the camera view angle can be generated, and identification object categories corresponding to the camera point cloud data, the mass center, the length, the width, the height of the corresponding bounding box and the rotation angle around the z axis can be stored.

Fig. 2 (a) shows a schematic diagram of an object placement scene and a corresponding simulated camera position provided by an embodiment of the present invention, and fig. 2 (b) shows a schematic diagram of a rendering effect of the camera in fig. 2 (a); in the object placement scene in fig. 2 (a), a camera view angle is randomly generated, and the object point cloud data in the world coordinate system is rendered based on the camera view angle, so that the rendering effect in fig. 2 (b) can be obtained. Similarly, fig. 3 (a) shows a schematic diagram of another object placing scene and a corresponding simulated camera position provided by the embodiment of the present invention, and fig. 3 (b) shows a schematic diagram of a rendering effect of the camera in fig. 3 (a); in the object placement scene in fig. 3 (a), a camera view angle is randomly generated, and the object point cloud data in the world coordinate system is rendered based on the camera view angle, so that the rendering effect in fig. 3 (b) can be obtained. It should be noted that, for any object placement scene, multiple camera view angles may be randomly generated, and the world point cloud of the identification object is rendered based on each camera view angle, so as to obtain the camera point cloud under the corresponding camera view angle.

The following describes a process of calculating a bounding box of a recognition object by a principal component analysis method using a formula.

Suppose M is a 3 × n matrix representing the coordinates of the point clouds in three-dimensional space, and n is the number of point clouds. Let mean (M) denote the matrix formed by the mean values of M in three dimensions, i.e. the mean (M) matrix is also a 3 × n matrix, the elements of each row are equal, and the elements of each row are equal to the mean value of the matrix M in the corresponding dimension. Definition of

Calculating out

The covariance matrix Corr of (a) is,

and finding the eigenvalues a and eigenvectors V of Corr such that CorrV = AV. Further, the column vectors of the feature vectors V are rearranged to obtain feature vectors V' corresponding to 6 different placement modes of the identification object.

Further, by calculating M ' = V ' M, the corrected point clouds M ' of the identifiers in 6 different placement states can be obtained. After M ' is translated to the origin, i.e. M ' = M ' -mean (M '), the bounding box B of the corrected point cloud M ' can be calculated. Wherein xmin, ymin and zmin are minimum values of the corrected point cloud M 'in the x-axis direction, the y-axis direction and the z-axis direction, respectively, and xmax, ymax and zmax are maximum values of the corrected point cloud M' in the x-axis direction, the y-axis direction and the z-axis direction, respectively.

By rotating the matrix

And translation matrix t = [ t ] _x ，t _y ,t _z ] ^T The correction point cloud M ' may be randomly placed and then updated by M ' = RM ' + t. Wherein theta is the rotation angle of the correction point cloud M' around the z-axis, t _x 、t _y And t _z The translation amounts of the correction point cloud M' in the x-axis, y-axis and z-axis, respectively.

The following describes a process of randomly generating a camera view angle and rendering a point cloud under a world coordinate system based on the camera view angle through a formula.

Wherein a position matrix C of the virtual camera can be set _P ＝[x _p ，y _p ,z _p ] ^T Front orientation matrix C _f ＝[x _f ，y _f ,z _f ] ^T And an upper orientation matrix C _t ＝[x _t ，y _t ,z _t ] ^T Then it can be found that the left direction matrix of the camera is C _l ＝[y _t z _f -z _t y _f ，z _t x _f -x _t z _f ,x _t y _f -y _t x _f ] ^T . The camera view angle of the virtual camera at the corresponding position can be determined through the front direction matrix, the upper direction matrix and the left direction matrix. Let T be _C For a homogeneous transformation matrix of the camera coordinate system relative to the world coordinate system, then

Wherein, the first and the second end of the pipe are connected with each other,

is an external parameter matrix of the camera and is,

a transformation matrix for the direction of the camera coordinate system relative to the world coordinate system.

By solving the above linear equation

Further, by the pair T _C Inverting to obtain a homogeneous transformation matrix of the world coordinate system relative to the camera coordinate system

Further, the camera point cloud coordinates M of the recognized object _C Is composed of

The embodiment of the invention selects to train the Vote Net network so as to obtain the target recognition model. Vote Net predicts rotation around a single axis only relatively well, so before training a Vote Net network based on deep learning, the camera point cloud of the recognizer needs to be transformed to the world point cloud, even if the direction of gravity is aligned with the-z axis. Further, the camera point cloud of the recognition object may be converted to a recognition object world point cloud based on an iterative closest point algorithm. The following describes a process of converting a camera point cloud of an identification object into a world point cloud.

In an alternative mode, the mean value of the camera point clouds of the recognized objects in each dimension in the three-dimensional space is calculated firstly

And

then, based on the mean value of each dimension

And

constructing homogeneous transformation matrix

As an initial value for the iterative closest point algorithm. Because the background desktop accounts for a large amount in the scene of placing the identification object, and the proportion of the point cloud corresponding to the background desktop is large, a plane point cloud perpendicular to the z axis is generated, plane registration can be performed by using an iterative closest point algorithm, a transformation matrix from the camera point cloud of the identification object to the plane point cloud is calculated, the transformation matrix comprises a translation matrix and a rotation matrix, and the rotation angle corresponding to the rotation matrix can be further determined.

Note that, since the default is a plan view when the robot grips, it is (0, 1) ^T The rotation angle of the vector should exceed 90 degrees; if (0, 1) ^T If the rotation angle of the vector does not exceed 90 degrees, 180 degrees are compared with (0, 1) ^T The difference between the rotation angles of the vectors serves as the rotation angle of the rotation matrix. Finally, the camera point cloud is converted into a world point cloud by the rotation matrix, namelyThe-z axis is obtained to be consistent with the gravity direction.

And converting the camera point cloud data of the camera view angle corresponding to each placing position into world point cloud data, and adding label information to the world point cloud data to construct a point cloud data training set. The tag information may include, for example, the category of the corresponding identifier, and the centroid, length, width, height, and rotation angle about the z-axis of the bounding box corresponding to the simulated location.

The Vote Net network takes world point cloud as input and outputs a 3D bounding box, confidence coefficient and category of the target recognition object in the actual placement scene. The three-dimensional target is detected through the Vote Net network, only coordinate information of the world point cloud is needed, the density of the world point cloud is not greatly dependent, and the generalization performance is good. Although Vote Net has achieved good results in the task of 3D object detection in indoor scenes, it is true data of large indoor objects. In this specification, vote Net is used to process simulation data, train with the simulation data, and detect a world point cloud obtained from real shot data. The geometric characteristics of the simulation data and the real shooting data are not greatly different, so that the embodiment of the invention has better feasibility.

The following describes training of a Vote Net network based on a point cloud data training set.

When a Vote Net network is trained, firstly, 2.5D point cloud under a simulation scene is constructed according to similar density, then shooting is carried out through a virtual camera, world point cloud data are generated according to camera point cloud data obtained through shooting, and label information of each world point cloud data is automatically obtained, so that the training speed of a target recognition model can be improved. And inputting the world point cloud data containing the label information into a Vote Net network for training, and determining the total number of training rounds according to the cloud amount of the points. After Vote Net network training is finished, three-dimensional target detection is carried out on the world point cloud processed by the iterative closest point algorithm, and a 3D bounding box, confidence coefficient and a recognizer category of a target recognizer corresponding to camera point cloud data can be obtained.

Step 150: and generating the bounding box of the target identification object in the camera coordinate system according to the bounding box of the target identification object in the world coordinate system.

The bounding box of the target identification object in the world coordinate system can be converted into the bounding box of the target identification object in the camera coordinate system according to the rotation matrix. Further, the bounding box matrix of the object identification object under the camera coordinate system can be obtained by right multiplying the rotation matrix by the bounding box matrix of the object identification object under the world coordinate system.

According to the embodiment of the invention, the camera point cloud corresponding to the depth image can be generated through the depth image and the camera internal reference; after the camera point cloud is converted into the world point cloud, target detection can be carried out on the world point cloud according to a preset target identification model so as to generate a bounding box of a target identification object under a world coordinate system; furthermore, the bounding box of the target identification object in the camera coordinate system can be generated according to the bounding box of the target identification object in the world coordinate system, so that the detection of the target identification object is completed. Therefore, the method and the device can generate the bounding box of the target identification object under the camera coordinate system based on the camera point cloud under the condition that the camera external reference is not acquired, and can improve the detection accuracy of the target identification object.

Fig. 4 is a flowchart illustrating a three-dimensional object capture method according to another embodiment of the invention, which is performed by an electronic device. The memory of the electronic device is used for storing at least one executable instruction, and the executable instruction enables the processor of the electronic device to execute the operation of the three-dimensional object grabbing method. As shown in fig. 4, the method comprises the steps of:

step 210: and determining the space position of the target recognition object according to the bounding box of the target recognition object under the camera coordinate system.

The space position of the target recognition object can be determined according to the bounding box of the target recognition object under the camera coordinate system. The spatial position of the target recognition object comprises the spatial coordinates of the target recognition object and the rotation angle of the target recognition object in the three-dimensional space.

Step 220: and generating a grabbing instruction according to the space position, so that the grabber grabs the target recognition object according to the grabbing instruction.

The grabbing instruction can be generated according to the space position of the target recognition object, and the grabbing instruction is sent to a grabber used for grabbing the target recognition object. The grabber can determine the grabbing path of the target identification object according to the grabbing instruction, and grab the target identification object according to the grabbing path.

According to the embodiment of the invention, the bounding box of the target identification object under the camera coordinate system is generated based on the camera point cloud, the spatial position of the target identification object is determined according to the bounding box of the target identification object under the camera coordinate system, and the grabbing instruction is generated according to the spatial position, so that the grabber can accurately grab the target identification object according to the grabbing instruction.

Fig. 5 shows a schematic structural diagram of a three-dimensional target detection device according to an embodiment of the invention. As shown in fig. 5, the apparatus 300 includes: an acquisition module 310, a first generation module 320, a conversion module 330, a second generation module 340, and a third generation module 350.

The acquiring module 310 is configured to acquire a depth image including a target recognition object;

a first generating module 320, configured to generate a camera point cloud corresponding to the depth image according to the depth image and the camera internal parameters, where the camera point cloud is a point cloud under a camera coordinate system;

a conversion module 330, configured to convert the camera point cloud into a world point cloud, where the world point cloud is a point cloud under a world coordinate system;

the second generating module 340 is configured to perform target detection on the world point cloud according to a preset target identification model, so as to generate a bounding box of the target identifier in a world coordinate system;

a third generating module 350, configured to generate a bounding box of the target identifier in the camera coordinate system according to the bounding box of the target identifier in the world coordinate system.

In an alternative manner, the conversion module 330 is configured to:

In an alternative, the conversion module 330 is configured to:

determining a rotation matrix corresponding to the transformation matrix;

In an optional manner, the apparatus 300 further comprises a training module for:

In an alternative, the training module is configured to:

and adding label information to the corresponding world point cloud data.

According to the embodiment of the invention, the camera point cloud corresponding to the depth image can be generated through the depth image and the camera internal reference; after the camera point cloud is converted into the world point cloud, target detection can be performed on the world point cloud according to a preset target identification model so as to generate a bounding box of a target identification object under a world coordinate system; furthermore, the bounding box of the target identification object in the camera coordinate system can be generated according to the bounding box of the target identification object in the world coordinate system, so that the detection of the target identification object is completed. Therefore, the method and the device can generate the bounding box of the target identification object under the camera coordinate system based on the camera point cloud under the condition that the camera external reference is not acquired, and can improve the detection accuracy of the target identification object.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and a specific implementation of the electronic device is not limited in the specific embodiment of the present invention.

As shown in fig. 6, the electronic device may include: a processor (processor) 402, a Communications Interface 404, a memory 406, and a Communications bus 408.

Wherein: the processor 402, communication interface 404, and memory 406 communicate with each other via a communication bus 408. A communication interface 404 for communicating with network elements of other devices, such as clients or other servers. The processor 402 is configured to execute the program 410, and may specifically perform the relevant steps in the embodiment of the three-dimensional object detection method described above.

In particular, program 410 may include program code comprising computer-executable instructions.

The processor 402 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits configured to implement an embodiment of the present invention. The electronic device comprises one or more processors, which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And a memory 406 for storing a program 410. Memory 406 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 410 may specifically be invoked by the processor 402 to cause the electronic device to perform the following operations:

acquiring a depth image containing a target recognition object;

In an alternative, the program 410 is invoked by the processor 402 to cause the electronic device to perform the following operations:

and converting the camera point cloud into a world point cloud according to the transformation matrix.

determining a rotation matrix corresponding to the transformation matrix;

and adding label information to the corresponding world point cloud data.

and generating a grabbing instruction according to the space position, so that the grabber grabs the target identification object according to the grabbing instruction.

An embodiment of the present invention provides a computer-readable storage medium, where the storage medium stores at least one executable instruction, and when the executable instruction is executed on an electronic device, the electronic device is enabled to execute a three-dimensional object detection method in any method embodiment described above.

The embodiment of the invention provides a three-dimensional target detection device, which is used for executing the three-dimensional target detection method.

Embodiments of the present invention provide a computer program, which can be called by a processor to enable an electronic device to execute a three-dimensional object detection method in any of the above method embodiments.

Embodiments of the present invention provide a computer program product, the computer program product comprising a computer program stored on a computer-readable storage medium, the computer program comprising program instructions that, when run on a computer, cause the computer to perform the three-dimensional object detection method in any of the above method embodiments.

The algorithms or displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limited to the order of execution unless otherwise specified.

Claims

1. A method for three-dimensional object detection, the method comprising:

acquiring a depth image containing a target recognition object;

performing target detection on the world point cloud according to a preset target identification model to generate a bounding box of the target identification object under a world coordinate system;

and generating the bounding box of the target recognition object in the camera coordinate system according to the bounding box of the target recognition object in the world coordinate system.

2. The method of claim 1, wherein the converting the camera point cloud to a world point cloud comprises:

3. The method of claim 2, wherein registering the camera point cloud with a pre-defined planar point cloud to generate a transformation matrix of a camera coordinate system to a world coordinate system comprises:

4. The method of claim 2 or 3, wherein the converting the camera point cloud to a world point cloud according to the transformation matrix comprises:

determining a rotation matrix corresponding to the transformation matrix;

and if the rotation angle corresponding to the rotation matrix is not more than 90 degrees, generating the world point cloud according to the residual angle rotation amount corresponding to the rotation matrix and the camera point cloud.

5. The method of claim 1, further comprising:

6. The method of claim 5, wherein the constructing a training set of point cloud data comprises:

and adding label information to the corresponding world point cloud data.

7. A three-dimensional object grasping method comprising the three-dimensional object detecting method according to any one of claims 1 to 6, the three-dimensional object grasping method further comprising:

8. A three-dimensional object detection apparatus, characterized in that the apparatus comprises:

the first generation module is used for generating a camera point cloud corresponding to the depth image according to the depth image and the camera internal parameters, wherein the camera point cloud is a point cloud under a camera coordinate system;

and the third generation module is used for generating the bounding box of the target identification object in the camera coordinate system according to the bounding box of the target identification object in the world coordinate system.

9. An electronic device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is configured to store at least one executable instruction that causes the processor to perform the operations of the three-dimensional object detection method according to any one of claims 1-6 or the three-dimensional object capture method according to claim 7.

10. A computer-readable storage medium having stored therein at least one executable instruction, which when run on an electronic device, causes the electronic device to perform the operations of the three-dimensional object detection method according to any one of claims 1 to 6 or the three-dimensional object capture method according to claim 7.