WO2022116423A1 - 物***姿估计方法、装置、电子设备及计算机存储介质 - Google Patents

物***姿估计方法、装置、电子设备及计算机存储介质 Download PDF

Info

Publication number
WO2022116423A1
WO2022116423A1 PCT/CN2021/083083 CN2021083083W WO2022116423A1 WO 2022116423 A1 WO2022116423 A1 WO 2022116423A1 CN 2021083083 W CN2021083083 W CN 2021083083W WO 2022116423 A1 WO2022116423 A1 WO 2022116423A1
Authority
WO
WIPO (PCT)
Prior art keywords
target object
loss value
point set
point
visibility
Prior art date
Application number
PCT/CN2021/083083
Other languages
English (en)
French (fr)
Inventor
王健宗
李泽远
朱星华
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022116423A1 publication Critical patent/WO2022116423A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to an object pose estimation method, apparatus, electronic device, and computer-readable storage medium.
  • the grasping and sorting tasks of industrial robotic arms mainly rely on the pose estimation of the objects to be grasped.
  • the pose estimation methods of objects mainly use point-by-point teaching or 2D visual perception methods.
  • the point-by-point teaching method is complex and time-consuming, and the 2D visual perception method will lead to inaccurate pose estimation of objects due to the cluttered placement of objects and the occlusion between objects.
  • An object pose estimation method provided by this application includes:
  • Hough voting is performed on the target object point set to obtain a key point set, and the key point loss value of the target object is calculated according to the key point set;
  • Semantic segmentation is performed on the pixels of the scene depth map to obtain the semantic loss value of the target object
  • the pose of the target object is calculated.
  • the present application also provides a device for estimating the pose of a target object, the device comprising:
  • a three-dimensional point cloud acquisition module configured to obtain a scene depth map of a target object by using a preset camera device, and calculate a three-dimensional point cloud of the scene depth map according to the pixel points in the scene depth map;
  • a target object point set extraction module used for extracting target points in the three-dimensional point cloud by using a pre-built deep learning network to obtain a target object point set
  • a visibility loss value calculation module configured to calculate the visibility loss value of the target object according to the three-dimensional point cloud and the target object point set;
  • a key point loss value calculation module configured to perform Hough voting on the target object point set to obtain a key point set, and calculate the key point loss value of the target object according to the key point set;
  • a semantic loss value calculation module configured to perform semantic segmentation on the pixels of the scene depth map to obtain the semantic loss value of the target object
  • the pose calculation module is configured to calculate the pose of the target object according to the visibility loss value, the key point loss value, the semantic loss value, and the multi-task joint model obtained by pre-training.
  • the present application also provides an electronic device, the electronic device comprising:
  • the processor executes the computer program stored in the memory to implement the method for estimating the pose of an object as described below:
  • Hough voting is performed on the target object point set to obtain a key point set, and the key point loss value of the target object is calculated according to the key point set;
  • Semantic segmentation is performed on the pixels of the scene depth map to obtain the semantic loss value of the target object
  • the pose of the target object is calculated.
  • the present application also provides a computer-readable storage medium, including a storage data area and a storage program area, the storage data area stores created data, and the storage program area stores a computer program; wherein, the computer program is implemented as follows when executed by a processor The described object pose estimation method:
  • Hough voting is performed on the target object point set to obtain a key point set, and the key point loss value of the target object is calculated according to the key point set;
  • Semantic segmentation is performed on the pixels of the scene depth map to obtain the semantic loss value of the target object
  • the pose of the target object is calculated.
  • FIG. 1 is a schematic flowchart of an object pose estimation method provided by an embodiment of the present application
  • FIG. 2 is a schematic block diagram of an object pose estimation apparatus provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an internal structure of an electronic device for implementing a method for estimating object pose and pose provided by an embodiment of the present application;
  • the embodiments of the present application provide a method for estimating the pose of an object.
  • the execution subject of the object pose estimation method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server, a terminal, and the like.
  • the object pose estimation method may be executed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform.
  • the server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
  • the object pose estimation method includes:
  • S1 Use a preset camera device to acquire a scene depth map of a target object, and calculate a three-dimensional point cloud of the scene depth map according to the pixels in the scene depth map.
  • the camera device may be a 3D camera
  • the target object may be a target object to be grasped by a manipulator.
  • the scene depth image is also called a range image (range image), and refers to an image in which the distance (depth) from the camera to each point of the target object is taken as the pixel value.
  • the scene depth map can be calculated as point cloud data after coordinate transformation.
  • the scene depth map may be stored in a blockchain node.
  • the 3D point cloud of the scene depth map can be calculated according to the pixel points in the scene depth map through the following formula:
  • x, y, z are the coordinates of the point in the three-dimensional point cloud
  • u, v are the row and column where the pixel point is located in the scene depth map
  • c x and cy are the two-dimensional pixel point in the scene depth map
  • the coordinates, f x , f y , and d are the focal lengths of the camera device on the x-axis, the y-axis, and the z-axis, respectively.
  • the three-dimensional point cloud is the three-dimensional point cloud of the scene depth map of the target object to be grasped by the manipulator. Since there are many objects in the scene of the target object to be grasped, it is necessary to extract target points from the three-dimensional point cloud to obtain a target object point set.
  • the pre-built deep learning network is a convolutional neural network including a convolution layer, a pooling layer, and a fully connected layer.
  • the convolution layer uses a preset function to perform feature extraction on the three-dimensional point cloud, and the pooling layer compresses the data obtained by feature extraction, simplifies the computational complexity, and extracts main feature data.
  • the fully connected layer is:
  • the feature point set is obtained by concatenating all the data obtained by feature extraction.
  • the deep learning network further includes a classifier. Specifically, the classifier uses a given category to learn classification rules using known training data, and then classifies the feature point set to obtain the target object point set and the non-target object point set.
  • a deep learning network to extract target points in the three-dimensional point cloud to obtain a target object point set, including:
  • the feature point set is classified into a target point set and a non-target point set by using the classifier in the deep learning network, and the target point set is extracted to obtain a target object point set.
  • visibility is the degree to which a target object can be seen by normal eyesight. Some objects are occluded by other objects and other reasons, resulting in reduced visibility, resulting in a loss of visibility value. Those heavily occluded objects are not the objects that the robotic arm prioritizes to grab, because they are likely to be at the bottom, and there is not enough information for pose estimation. In order to reduce the interference caused by these objects, the embodiment of this application needs to calculate the objects The visibility loss value of .
  • the visibility loss value of the target object is obtained by weighted calculation of the difference between the actual visibility and the predicted visibility of the target object.
  • N i represents the number of points of the target object point set of the target target object i
  • N max represents the number of points of the largest point set in the target object contained in the 3D point cloud
  • performing Hough voting on the target object point set to obtain a key point set including:
  • the target object sampling point set is obtained by sampling the target object point set, and the Euclidean distance offset of the target object sampling point is calculated to obtain the offset;
  • Voting is performed according to the offset, and the set of points whose votes exceed the preset threshold is used as the key point set.
  • the key point set is divided into a common key point set and a central key point, and the following formula is used to adopt a point-by-point method.
  • the feature regression algorithm calculates the keypoint loss value L kps of the keypoint set:
  • L kp represents the loss of common key points
  • N is the number of points in the target object point set
  • M is the number of common key points
  • L c represents the loss of the center key point
  • ⁇ x i is the actual offset from the common key point to the center key point
  • ⁇ 1 is the weight of the loss of the common key point
  • ⁇ 2 is the weight of the loss of the center key point.
  • the semantic segmentation is to calculate the semantic loss L s of the target object according to the pixel points of the scene depth map using the following formula
  • represents the balance parameter of the camera
  • represents the focus parameter of the camera
  • q i represents the confidence that the ith pixel in the scene depth map belongs to the foreground point or the background point.
  • the pose of the target object refers to a six-dimensional quantity composed of a three-dimensional rotation matrix and a three-dimensional translation matrix.
  • L kps represents the key point loss value
  • L s represents the semantic loss
  • L v represents the visibility loss value
  • ⁇ 01 , ⁇ 02 , ⁇ 03 represent the weights obtained after training the multi-task joint model value.
  • the embodiment of the present application calculates the three-dimensional point cloud of the scene depth map by acquiring the scene depth map of the target object, and uses the deep learning network to extract the target object point set from the three-dimensional point cloud, and according to the three-dimensional point cloud And the target object point set calculates the visibility loss value, key point loss value and semantic loss value of the target object, and finally obtains the pose of the target object according to the visibility loss value, key point loss value and semantic loss value.
  • the object pose estimation method proposed in the embodiment of the present application performs pose estimation on the target object according to the loss of visibility, key points, and semantics, and therefore, the accuracy of the object pose estimation can be improved.
  • FIG. 2 it is a schematic diagram of a module of the object pose estimation apparatus of the present application.
  • the object pose estimation apparatus 100 described in this application may be installed in an electronic device.
  • the object pose estimation device may include a three-dimensional point cloud acquisition module 101 , a target object point set extraction module 102 , a visibility loss value calculation module 103 , a key point loss value calculation module 104 , and a semantic loss value calculation module 105 and pose calculation module 106 .
  • the modules described in this application may also be referred to as units, which refer to a series of computer program segments that can be executed by the processor of an electronic device and can perform fixed functions, and are stored in the memory of the electronic device.
  • each module/unit is as follows:
  • the three-dimensional point cloud acquiring module 101 is configured to acquire a scene depth map of a target object by using a preset camera device, and calculate a three-dimensional point cloud of the scene depth map according to the pixels in the scene depth map.
  • the camera device may be a 3D camera
  • the target object may be a target object to be grasped by a manipulator.
  • the scene depth image (depth image) is also called a range image (range image), and refers to an image in which the distance (depth) from the camera to each point of the target object is taken as the pixel value.
  • the scene depth map can be calculated as point cloud data after coordinate transformation.
  • the 3D point cloud of the scene depth map can be calculated according to the pixel points in the scene depth map through the following formula:
  • x, y, z are the coordinates of the point in the three-dimensional point cloud
  • u, v are the row and column where the pixel point is located in the scene depth map
  • c x and cy are the two-dimensional pixel point in the scene depth map
  • the coordinates, f x , f y , and d are the focal lengths of the camera device on the x-axis, the y-axis, and the z-axis, respectively.
  • the target object point set extraction module 102 uses a pre-built deep learning network to extract target points in the three-dimensional point cloud to obtain a target object point set.
  • the three-dimensional point cloud is the three-dimensional point cloud of the scene depth map of the target object to be grasped by the manipulator. Since there are many objects in the scene of the target object to be grasped, it is necessary to extract target points from the three-dimensional point cloud to obtain a target object point set.
  • the pre-built deep learning network is a convolutional neural network, including a convolution layer, a pooling layer, and a fully connected layer.
  • the convolution layer uses a preset function to perform feature extraction on the three-dimensional point cloud, and the pooling layer compresses the data obtained by feature extraction, simplifies the computational complexity, and extracts main feature data.
  • the fully connected layer is:
  • the feature point set is obtained by concatenating all the data obtained by feature extraction.
  • the deep learning network further includes a classifier. Specifically, the classifier uses a given category to learn classification rules using known training data, and then classifies the feature point set to obtain the target object point set and the non-target object point set.
  • the target object point set extraction module 102 is specifically used for:
  • the feature point set is classified into a target point set and a non-target object point set by using the classifier in the deep learning network, and the target object point set is extracted.
  • the visibility loss value calculation module 103 is configured to calculate the visibility loss value of the target object according to the three-dimensional point cloud and the target object point set.
  • visibility is the degree to which a target object can be seen by normal eyesight. Some objects are occluded by other objects and other reasons, resulting in reduced visibility, resulting in a loss of visibility value. Those heavily occluded objects are not the objects that the robotic arm prioritizes to grab, because they are likely to be at the bottom, and there is not enough information for pose estimation. In order to reduce the interference caused by these objects, the embodiment of this application needs to calculate the objects The visibility loss value of .
  • the visibility loss value calculation module 103 is specifically used for:
  • the visibility loss value of the target object is obtained by weighted calculation of the difference between the actual visibility and the predicted visibility of the target object.
  • N i represents the number of points of the target object point set of the target target object i
  • N max represents the number of points of the largest point set in the target object contained in the 3D point cloud
  • the key point loss value calculation module 104 is configured to perform Hough voting on the target object point set to obtain a key point set, and calculate the key point loss value of the target object according to the key point set.
  • performing Hough voting on the target object point set to obtain a key point set including:
  • the target object sampling point set is obtained by sampling the target object point set, and the Euclidean distance offset of the target object sampling point is calculated to obtain the offset;
  • Voting is performed according to the offset, and the set of points whose votes exceed the preset threshold is used as the key point set.
  • the key point set is divided into a common key point set and a central key point, and the following formula is used to adopt a point-by-point method.
  • the feature regression algorithm calculates the keypoint loss value L kps of the keypoint set:
  • L kp represents the loss of common key points
  • N is the number of points in the target object point set
  • M is the number of common key points
  • L c represents the loss of the center key point
  • ⁇ x i is the actual offset from the common key point to the center key point
  • ⁇ 1 is the weight of the loss of the common key point
  • ⁇ 2 is the weight of the loss of the center key point.
  • the semantic loss value calculation module 105 is configured to perform semantic segmentation on the pixels of the scene depth map to obtain the semantic loss value of the target object.
  • the semantic segmentation is to calculate the semantic loss L s of the target object according to the pixel points of the scene depth map using the following formula
  • represents the balance parameter of the camera
  • represents the focus parameter of the camera
  • q i represents the confidence that the ith pixel in the scene depth map belongs to the foreground point or the background point.
  • the pose calculation module 106 is configured to calculate the pose of the target object according to the visibility loss value, the key point loss value, the semantic loss value, and the multi-task joint model obtained by pre-training.
  • the pose of the target object refers to a six-dimensional quantity composed of a three-dimensional rotation matrix and a three-dimensional translation matrix.
  • the pose calculation module 106 uses the following multi-task joint model to calculate the final loss value L mt of the target object:
  • L kps represents the key point loss value
  • L s represents the semantic loss
  • L v represents the visibility loss value
  • ⁇ 01 , ⁇ 02 , ⁇ 03 represent the weights obtained after training the multi-task joint model value
  • the embodiment of the present application further adjusts the predicted rotation matrix and the predicted translation matrix of the target object according to the final loss value to obtain the pose of the target object.
  • the pose calculation module 106 sends the pose of the target object to a pre-built robotic arm, and uses the robotic arm to perform the target object grasping task.
  • FIG. 3 it is a schematic structural diagram of an electronic device implementing the object pose estimation method of the present application.
  • the electronic device 1 may include a processor 10, a memory 11 and a bus, and may also include a computer program stored in the memory 11 and executable on the processor 10, such as an object pose estimation program 12.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium may be volatile or non-volatile.
  • the readable storage medium includes a flash memory, a mobile hard disk, a multimedia card, a card-type memory (eg, SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like.
  • the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a mobile hard disk of the electronic device 1 . In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) equipped on the electronic device 1.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 can not only be used to store application software installed in the electronic device 1 and various types of data, such as the code of the object pose estimation program 12, etc., but also can be used to temporarily store data that has been output or will be output.
  • the processor 10 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or more integrated circuits.
  • Central processing unit Central Processing unit, CPU
  • microprocessor digital processing chip
  • graphics processor and combination of various control chips, etc.
  • the processor 10 is the control core (ControlUnit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing the program or module (for example, executing the object) stored in the memory 11. pose estimation program, etc.), and call the data stored in the memory 11 to perform various functions of the electronic device 1 and process data.
  • the bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (extended industry standard architecture, EISA for short) bus or the like.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the bus is configured to implement connection communication between the memory 11 and at least one processor 10 and the like.
  • FIG. 3 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 3 does not constitute a limitation on the electronic device 1, and may include fewer or more components than those shown in the figure. components, or a combination of certain components, or a different arrangement of components.
  • the electronic device 1 may also include a power supply (such as a battery) for powering the various components, preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that the power management
  • the device implements functions such as charge management, discharge management, and power consumption management.
  • the power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components.
  • the electronic device 1 may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the electronic device 1 may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • a network interface optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • the electronic device 1 may further include a user interface, and the user interface may be a display (Display), an input unit (eg, a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like.
  • the display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
  • the object pose estimation program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple computer programs, and when running in the processor 10, it can realize:
  • Hough voting is performed on the target object point set to obtain a key point set, and the key point loss value of the target object is calculated according to the key point set;
  • Semantic segmentation is performed on the pixels of the scene depth map to obtain the semantic loss value of the target object
  • the pose of the target object is calculated.
  • the modules/units integrated in the electronic device 1 may be stored in a computer-readable storage medium.
  • the computer-readable storage medium may be volatile or non-volatile.
  • the computer-readable storage medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read Only Memory) -Only Memory).
  • the computer usable storage medium may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function, and the like; using the created data, etc.
  • modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

一种目标物***姿估计方法,包括:根据目标物体的场景深度图得到三维点云;提取所述三维点云中的目标物体点集;根据所述三维点云和目标物体点集,计算目标物体的可见度损失值;通过对所述目标物体点集进行霍夫投票,计算目标物体的关键点损失值;对所述场景深度图的像素点进行语义分割,得到目标物体的语义损失值;根据所述可见度损失值、关键点损失值、语义损失值以及多任务联合模型,计算目标物体的位姿。还提出一种目标物***姿估计装置、设备及存储介质。该方法还涉及区块链技术,所述场景深度图可存储于区块链节点中。该方法可以准确分析待抓取目标物体的位姿,以提高机械臂的抓取精度。

Description

物***姿估计方法、装置、电子设备及计算机存储介质
本申请要求于2020年12月01日提交中国专利局、申请号为202011385260.7,发明名称为“物***姿估计方法、装置、电子设备及计算机存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种物***姿估计方法、装置、电子设备及计算机可读存储介质。
背景技术
发明人意识到,随着工业领域上机械臂的不断发展和智能视觉***的深入应用,搭载有智能视觉***的机械臂开始承担起智能分拣、柔性制造等复杂任务,成为一种节省人力资源的工业机械。
工业机械臂的抓取、分拣任务主要依靠对待抓取物体的位姿估计。目前,物体的位姿估计方法主要是利用逐点示教或者2D视觉感知的方法。然而在工业环境下,逐点示教的方法既复杂又浪费时间,2D视觉感知的方法又会因为物体存在的摆放杂乱和各物体之间的遮挡问题导致物体的位姿估计不准确。
发明内容
本申请提供的一种物***姿估计方法,包括:
利用预设的摄像装置获取目标物体的场景深度图,根据所述场景深度图中的像素点计算所述场景深度图的三维点云;
利用预构建的深度学习网络提取所述三维点云中的目标点,得到目标物体点集;
根据所述三维点云和所述目标物体点集,计算所述目标物体的可见度损失值;
对所述目标物体点集进行霍夫投票,得到关键点集,根据所述关键点集计算所述目标物体的关键点损失值;
对所述场景深度图的像素点进行语义分割,得到所述目标物体的语义损失值;
根据所述可见度损失值、所述关键点损失值、所述语义损失值,以及预先训练得到的多任务联合模型,计算得到所述目标物体的位姿。
本申请还提供一种目标物***姿估计装置,所述装置包括:
三维点云获取模块,用于利用预设的摄像装置获取目标物体的场景深度图,根据所述场景深度图中的像素点计算所述场景深度图的三维点云;
目标物体点集提取模块,用于利用预构建的深度学习网络提取所述三维点云中的目标点,得到目标物体点集;
可见度损失值计算模块,用于根据所述三维点云和所述目标物体点集,计算所述目标物体的可见度损失值;
关键点损失值计算模块,用于对所述目标物体点集进行霍夫投票,得到关键点集,根据所述关键点集计算所述目标物体的关键点损失值;
语义损失值计算模块,用于对所述场景深度图的像素点进行语义分割,得到所述目标物体的语义损失值;
位姿计算模块,用于根据所述可见度损失值、所述关键点损失值、所述语义损失值,以及预先训练得到的多任务联合模型,计算得到所述目标物体的位姿。
本申请还提供一种电子设备,所述电子设备包括:
存储器,存储至少一个计算机程序;及
处理器,执行所述存储器中存储的计算机程序以实现如下所述的物***姿估计方法:
利用预设的摄像装置获取目标物体的场景深度图,根据所述场景深度图中的像素点计算所述场景深度图的三维点云;
利用预构建的深度学习网络提取所述三维点云中的目标点,得到目标物体点集;
根据所述三维点云和所述目标物体点集,计算所述目标物体的可见度损失值;
对所述目标物体点集进行霍夫投票,得到关键点集,根据所述关键点集计算所述目标物体的关键点损失值;
对所述场景深度图的像素点进行语义分割,得到所述目标物体的语义损失值;
根据所述可见度损失值、所述关键点损失值、所述语义损失值,以及预先训练得到的多任务联合模型,计算得到所述目标物体的位姿。
本申请还提供一种计算机可读存储介质,包括存储数据区和存储程序区,存储数据区存储创建的数据,存储程序区存储有计算机程序;其中,所述计算机程序被处理器执行时实现如下所述的物***姿估计方法:
利用预设的摄像装置获取目标物体的场景深度图,根据所述场景深度图中的像素点计算所述场景深度图的三维点云;
利用预构建的深度学习网络提取所述三维点云中的目标点,得到目标物体点集;
根据所述三维点云和所述目标物体点集,计算所述目标物体的可见度损失值;
对所述目标物体点集进行霍夫投票,得到关键点集,根据所述关键点集计算所述目标物体的关键点损失值;
对所述场景深度图的像素点进行语义分割,得到所述目标物体的语义损失值;
根据所述可见度损失值、所述关键点损失值、所述语义损失值,以及预先训练得到的多任务联合模型,计算得到所述目标物体的位姿。
附图说明
图1为本申请一实施例提供的物***姿估计方法的流程示意图;
图2为本申请一实施例提供的物***姿估计装置的模块示意图;
图3为本申请一实施例提供的实现物***姿估计方法的电子设备的内部结构示意图;
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请实施例提供一种物***姿估计方法。所述物***姿估计方法的执行主体包括但不限于服务端、终端等能够被配置为执行本申请实施例提供的该方法的电子设备中的至少一种。换言之,所述物***姿估计方法可以由安装在终端设备或服务端设备的软件或硬件来执行,所述软件可以是区块链平台。所述服务端包括但不限于:单台服务器、服务器集群、云端服务器或云端服务器集群等。
参照图1所示,为本申请一实施例提供的物***姿估计方法的流程示意图。在本实施例中,所述物***姿估计方法包括:
S1、利用预设的摄像装置获取目标物体的场景深度图,根据所述场景深度图中的像素点计算所述场景深度图的三维点云。
本申请实施例中,所述摄像装置可以是一个3D摄像机,以及所述目标物体可以是机械手待抓取的目标物体。所述场景深度图像(depthimage)也被称为距离影像(rangeimage),是指将从摄像装置到所述目标物体各点的距离(深度)作为像素值的图像。所述场景深度 图经过坐标转换可以计算为点云数据。
本申请其中一个实施例中,所述场景深度图可存储于区块链节点中。
详细地,本申请实施例可以通过以下公式根据所述场景深度图中的像素点计算所述场景深度图的三维点云:
Figure PCTCN2021083083-appb-000001
其中,x、y、z是三维点云中点的坐标,u、v为所述场景深度图中像素点所在的行和列,c x和c y是所述场景深度图中像素点二维坐标,f x、f y、d分别为所述摄像装置在x轴、y轴和z轴的焦距。
S2、利用预构建的深度学习网络提取所述三维点云中的目标点,得到目标物体点集。
如上述描述可知,所述三维点云是机械手待抓取的目标物体的场景深度图的三维点云。由于所述待抓取的目标物体的场景中会存在很多物体,因此,需要从所述三维点云中提取目标点,得到目标物体点集。
本申请实施例中,所述预构建的深度学习网络为卷积神经网络包括卷积层、池化层、全连接层。所述卷积层利用预设函数对所述三维点云进行特征提取,所述池化层为对特征提取得到的数据进行压缩,简化计算复杂度,提取主要特征数据,所述全连接层为连接所有特征提取得到的数据得到特征点集。进一步地,本申请实施例中,所述深度学习网络还包括一个分类器。详细地,所述分类器为利用给定的类别,利用已知的训练数据学习分类规则,然后对于所述特征点集进行分类,得到所述目标物体点集和非目标物体点集。
详细地,所述利用深度学习网络提取所述三维点云中的目标点,得到目标物体点集,包括:
利用预构建的深度学习网络中的卷积、池化以及全连接层提取所述三维点云的特征点集;
利用所述深度学习网络中的分类器将所述特征点集分类为目标点集和非目标点集,并提取其中的目标点集得到目标物体点集。
S3、根据所述三维点云和所述目标物体点集,计算所述目标物体的可见度损失值。
可以理解,可见度是目标物体能够被正常目力看见的程度。有些物体由于被其他物体遮挡等原因,导致可见度降低,产生了可见度损失值。那些遮挡严重的物体并不是机械臂优先考虑抓取的对象,因为它们大概率位于底部,而且也没有足够的信息进行位姿估计,为了减少这些物体带来的干扰,本申请实施例需要计算物体的可见度损失值。
本申请其中一个实施例可以采用下述方法计算所述目标物体的可见度损失值:
根据所述目标物体点集的点数与所述三维点云中包含的所有物体中的最大点集的点数的比值计算所述目标物体的实际可见度;
通过所述实际可见度与所述目标物体的预测可见度的差的加权计算得到所述目标物体的可见度损失值。
即:
Figure PCTCN2021083083-appb-000002
Figure PCTCN2021083083-appb-000003
其中,N i代表目标目标物体i的目标物体点集的点数,N max代表三维点云中包含的目标物体中的最大点集的点数,
Figure PCTCN2021083083-appb-000004
代表目标目标物体i的预测可见度,即未有任何遮挡情况目标目标物体i的最大可见度。
S4、对所述目标物体点集进行霍夫投票,得到关键点集,计算所述关键点集的关键点损失值。
详细地,所述对所述目标物体点集进行霍夫投票,得到关键点集,包括:
从所述目标物体点集采样得到目标物体采样点集,计算所述目标物体采样点的欧式距离偏移,得到偏移量;
根据所述偏移量进行投票,将票数超过预设阈值的点的集合作为关键点集。
进一步地,本申请实施例根据中心关键点有且只有一个且不会受遮挡影响的性质,将所述关键点集分为普通关键点集和中心关键点,并利用下述公式,采用逐点特征回归算法计算所述关键点集的关键点损失值L kps
Figure PCTCN2021083083-appb-000005
Figure PCTCN2021083083-appb-000006
L kps=γ 1L kp2L c
其中,L kp代表普通关键点损失,N为目标物体点集的点数,M为普通关键点的数量,
Figure PCTCN2021083083-appb-000007
代表目标物体点集的实际位置偏移,
Figure PCTCN2021083083-appb-000008
代表目标物体点集的预测实际位置偏移,L c代表中心关键点损失,Δx i是普通关键点到中心关键点的实际偏移,
Figure PCTCN2021083083-appb-000009
是普通关键点到中心关键点的预测偏移,γ 1为普通关键点损失的权值、γ 2为中心关键点损失的权重。
S5、对所述场景深度图的像素点进行语义分割,得到语义损失值。
详细地,所述语义分割为根据所述场景深度图的像素点,利用如下公式计算得到所述目标物体的语义损失L s
L s=-α(1-q i) γlog(q i)
其中,α表示所述摄像装置的平衡参数,γ表示所述摄像装置的焦点参数,q i代表场景深度图中第i个像素点属于前景点还是背景点的置信度。
S6、根据所述可见度损失值、所述关键点损失值、所述语义损失值,以及预先训练得到的多任务联合模型,计算得到所述目标物体的位姿。
详细地,本申请实施例中,所述目标物体的位姿是指三维的旋转矩阵和三维的平移矩阵组成的六维量。
本申请实施例利用下述多任务联合模型计算所述目标物体的最终损失值L mt
L mt=μ 1L kps2L s3L v
其中,L kps代表所述关键点损失值,L s代表所述语义损失,L v代表所述可见度损失值,μ 01、μ 02、μ 03代表对所述多任务联合模型训练后得到的权值。
根据所述最终损失值调整所述目标物体的预测旋转矩阵和预测平移矩阵,得到所述目标物体的物姿。
本申请实施例通过获取目标物体的场景深度图计算出所述场景深度图的三维点云,并 利用深度学习网络从所述三维点云中提取得到目标物体点集,并根据所述三维点云以及所述目标物体点集计算所述目标物体的可见度损失值、关键点损失值以及语义损失值,最后根据所述可见度损失值、关键点损失值以及语义损失值得到目标物体的位姿。本申请实施例提出的物***姿估计方法根据可见度、关键点以及语义三个方面的损失对目标物体进行位姿估计,因此,可以提高物***姿估计的准确性。
如图2所示,是本申请物***姿估计装置的模块示意图。
本申请所述物***姿估计装置100可以安装于电子设备中。根据实现的功能,所述物***姿估计装置可以包括三维点云获取模块101、目标物体点集提取模块102、可见度损失值计算模块103、关键点损失值计算模块104、语义损失值计算模块105和位姿计算模块106。本申请所述模块也可以称之为单元,是指一种能够被电子设备处理器所执行,并且能够完成固定功能的一系列计算机程序段,其存储在电子设备的存储器中。
在本实施例中,关于各模块/单元的功能如下:
所述三维点云获取模块101,用于利用预设的摄像装置获取目标物体的场景深度图,根据所述场景深度图中的像素点计算所述场景深度图的三维点云。
本申请实施例中,所述摄像装置可以是一个3D摄像机,以及所述目标物体可以是机械手待抓取的目标物体。所述场景深度图像(depthimage)也被称为距离影像(rangeimage),是指将从摄像装置到所述目标物体各点的距离(深度)作为像素值的图像。所述场景深度图经过坐标转换可以计算为点云数据。详细地,本申请实施例可以通过以下公式根据所述场景深度图中的像素点计算所述场景深度图的三维点云:
Figure PCTCN2021083083-appb-000010
其中,x、y、z是三维点云中点的坐标,u、v为所述场景深度图中像素点所在的行和列,c x和c y是所述场景深度图中像素点二维坐标,f x、f y、d分别为所述摄像装置在x轴、y轴和z轴的焦距。
所述目标物体点集提取模块102,利用预构建的深度学习网络提取所述三维点云中的目标点,得到目标物体点集。
如上述描述可知,所述三维点云是机械手待抓取的目标物体的场景深度图的三维点云。由于所述待抓取的目标物体的场景中会存在很多物体,因此,需要从所述三维点云中提取目标点,得到目标物体点集。
本申请实施例中,所述预构建的深度学习网络为卷积神经网络,包括卷积层、池化层、全连接层。所述卷积层利用预设函数对所述三维点云进行特征提取,所述池化层为对特征提取得到的数据进行压缩,简化计算复杂度,提取主要特征数据,所述全连接层为连接所有特征提取得到的数据得到特征点集。进一步地,本申请实施例中,所述深度学习网络还包括一个分类器。详细地,所述分类器为利用给定的类别,利用已知的训练数据学习分类规则,然后对于所述特征点集进行分类,得到所述目标物体点集和非目标物体点集。
详细地,本申请实施例中,所述目标物体点集提取模块102具体用于:
利用预构建的深度学习网络中的卷积、池化以及全连接层提取所述三维点云的特征点集;
利用所述深度学习网络中的分类器将所述特征点集分类为目标点集和非目标物体点集,并提取其中的目标物体点集。
所述可见度损失值计算模块103,用于根据所述三维点云和所述目标物体点集,计算所述目标物体的可见度损失值。
可以理解,可见度是目标物体能够被正常目力看见的程度。有些物体由于被其他物体遮挡等原因,导致可见度降低,产生了可见度损失值。那些遮挡严重的物体并不是机械臂优先考虑抓取的对象,因为它们大概率位于底部,而且也没有足够的信息进行位姿估计,为了减少这些物体带来的干扰,本申请实施例需要计算物体的可见度损失值。
本申请其中一个实施例,所述可见度损失值计算模块103具体用于:
根据所述目标物体的目标物体点集的点数与所述三维点云中包含的所有目标物体中的最大点集的点数的比值计算所述目标物体的实际可见度;
通过所述实际可见度与所述目标物体的预测可见度的差的加权计算得到所述目标物体的可见度损失值。
即:
Figure PCTCN2021083083-appb-000011
Figure PCTCN2021083083-appb-000012
其中,N i代表目标目标物体i的目标物体点集的点数,N max代表三维点云中包含的目标物体中的最大点集的点数,
Figure PCTCN2021083083-appb-000013
代表目标目标物体i的预测可见度,即未有任何遮挡情况目标目标物体i的最大可见度。
所述关键点损失值计算模块104,用于对所述目标物体点集进行霍夫投票,得到关键点集,根据所述关键点集计算所述目标物体的关键点损失值。
详细地,所述对所述目标物体点集进行霍夫投票,得到关键点集,包括:
从所述目标物体点集采样得到目标物体采样点集,计算所述目标物体采样点的欧式距离偏移,得到偏移量;
根据所述偏移量进行投票,将票数超过预设阈值的点的集合作为关键点集。
进一步地,本申请实施例根据中心关键点有且只有一个且不会受遮挡影响的性质,将所述关键点集分为普通关键点集和中心关键点,并利用下述公式,采用逐点特征回归算法计算所述关键点集的关键点损失值L kps
Figure PCTCN2021083083-appb-000014
Figure PCTCN2021083083-appb-000015
L kps=γ 1L kp2L c
其中,L kp代表普通关键点损失,N为目标物体点集的点数,M为普通关键点的数量,
Figure PCTCN2021083083-appb-000016
代表目标物体点集的实际位置偏移,
Figure PCTCN2021083083-appb-000017
代表目标物体点集的预测实际位置偏移,L c代表中心关键点损失,Δx i是普通关键点到中心关键点的实际偏移,
Figure PCTCN2021083083-appb-000018
是普通关键点到中心关键点的预测偏移,γ 1为普通关键点损失的权值、γ 2为中心关键点损失的权重。
所述语义损失值计算模块105,用于对所述场景深度图的像素点进行语义分割,得到所述目标物体的语义损失值。
详细地,所述语义分割为根据所述场景深度图的像素点,利用如下公式计算得到所述目标物体的语义损失L s
L s=-α(1-q i) γlog(q i)
其中,α表示所述摄像装置的平衡参数,γ表示所述摄像装置的焦点参数,q i代表场景深度图中第i个像素点属于前景点还是背景点的置信度。
所述位姿计算模块106,用于根据所述可见度损失值、所述关键点损失值、所述语义损失值,以及预先训练得到的多任务联合模型,计算得到所述目标物体的位姿。
详细地,本申请实施例中,所述目标物体的位姿是指三维的旋转矩阵和三维的平移矩阵组成的六维量。
详细地,所述位姿计算模块106利用下述多任务联合模型计算所述目标物体的最终损失值L mt
L mt=μ 1L kps2L s3L v
其中,L kps代表所述关键点损失值,L s代表所述语义损失,L v代表所述可见度损失值,μ 01、μ 02、μ 03代表对所述多任务联合模型训练后得到的权值;
本申请实施例进一步根据所述最终损失值调整所述目标物体的预测旋转矩阵和预测平移矩阵,得到所述目标物体的物姿。
进一步地,所述位姿计算模块106将所述目标物***姿发送给预构建的机械臂,利用所述机械臂执行目标物体抓取任务。
如图3所示,是本申请实现物***姿估计方法的电子设备的结构示意图。
所述电子设备1可以包括处理器10、存储器11和总线,还可以包括存储在所述存储器11中并可在所述处理器10上运行的计算机程序,如物***姿估计程序12。
其中,所述存储器11至少包括一种类型的可读存储介质,所述可读存储介质可以是易失性的,也可以是非易失性的。具体的,所述可读存储介质包括闪存、移动硬盘、多媒体卡、卡型存储器(例如:SD或DX存储器等)、磁性存储器、磁盘、光盘等。所述存储器11在一些实施例中可以是电子设备1的内部存储单元,例如该电子设备1的移动硬盘。所述存储器11在另一些实施例中也可以是电子设备1的外部存储设备,例如电子设备1上配备的插接式移动硬盘、智能存储卡(SmartMediaCard,SMC)、安全数字(SecureDigital,SD)卡、闪存卡(FlashCard)等。进一步地,所述存储器11还可以既包括电子设备1的内部存储单元也包括外部存储设备。所述存储器11不仅可以用于存储安装于电子设备1的应用软件及各类数据,例如物***姿估计程序12的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。
所述处理器10在一些实施例中可以由集成电路组成,例如可以由单个封装的集成电路所组成,也可以是由多个相同功能或不同功能封装的集成电路所组成,包括一个或者多个中央处理器(CentralProcessingunit,CPU)、微处理器、数字处理芯片、图形处理器及各种控制芯片的组合等。所述处理器10是所述电子设备的控制核心(ControlUnit),利用各种接口和线路连接整个电子设备的各个部件,通过运行或执行存储在所述存储器11内的程序或者模块(例如执行物***姿估计程序等),以及调用存储在所述存储器11内的数据,以执行电子设备1的各种功能和处理数据。
所述总线可以是外设部件互连标准(peripheralcomponentinterconnect,简称PCI)总线或扩展工业标准结构(extendedindustrystandardarchitecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。所述总线被设置为实现所述存储器11以及至少一个处理器10等之间的连接通信。
图3仅示出了具有部件的电子设备,本领域技术人员可以理解的是,图3示出的结构 并不构成对所述电子设备1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。
例如,尽管未示出,所述电子设备1还可以包括给各个部件供电的电源(比如电池),优选地,电源可以通过电源管理装置与所述至少一个处理器10逻辑相连,从而通过电源管理装置实现充电管理、放电管理、以及功耗管理等功能。电源还可以包括一个或一个以上的直流或交流电源、再充电装置、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。所述电子设备1还可以包括多种传感器、蓝牙模块、Wi-Fi模块等,在此不再赘述。
进一步地,所述电子设备1还可以包括网络接口,可选地,所述网络接口可以包括有线接口和/或无线接口(如WI-FI接口、蓝牙接口等),通常用于在该电子设备1与其他电子设备之间建立通信连接。
可选地,该电子设备1还可以包括用户接口,用户接口可以是显示器(Display)、输入单元(比如键盘(Keyboard)),可选地,用户接口还可以是标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(OrganicLight-EmittingDiode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在电子设备1中处理的信息以及用于显示可视化的用户界面。
应该了解,所述实施例仅为说明之用,在专利申请范围上并不受此结构的限制。
所述电子设备1中的所述存储器11存储的物***姿估计程序12是多个计算机程序的组合,在所述处理器10中运行时,可以实现:
利用预设的摄像装置获取目标物体的场景深度图,根据所述场景深度图中的像素点计算所述场景深度图的三维点云;
利用预构建的深度学习网络提取所述三维点云中的目标点,得到目标物体点集;
根据所述三维点云和所述目标物体点集,计算所述目标物体的可见度损失值;
对所述目标物体点集进行霍夫投票,得到关键点集,根据所述关键点集计算所述目标物体的关键点损失值;
对所述场景深度图的像素点进行语义分割,得到所述目标物体的语义损失值;
根据所述可见度损失值、所述关键点损失值、所述语义损失值以及预先训练得到的多任务联合模型,计算得到所述目标物体的位姿。
进一步地,所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。所述计算机可读存储介质可以是易失性的,也可以是非易失性的。具体的,所述计算机可读存储介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-OnlyMemory)。
进一步地,所述计算机可用存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作***、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既 可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图表记视为限制所涉及的权利要求。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。***权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称,而并不表示任何特定的顺序。
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。

Claims (20)

  1. 一种物***姿估计方法,其中,所述方法包括:
    利用预设的摄像装置获取目标物体的场景深度图,根据所述场景深度图中的像素点计算所述场景深度图的三维点云;
    利用预构建的深度学习网络提取所述三维点云中的目标点,得到目标物体点集;
    根据所述三维点云和所述目标物体点集,计算所述目标物体的可见度损失值;
    对所述目标物体点集进行霍夫投票,得到关键点集,根据所述关键点集计算所述目标物体的关键点损失值;
    对所述场景深度图的像素点进行语义分割,得到所述目标物体的语义损失值;
    根据所述可见度损失值、所述关键点损失值、所述语义损失值以及预先训练得到的多任务联合模型,计算得到所述目标物体的位姿。
  2. 如权利要求1所述的物***姿估计方法,其中,所述根据所述三维点云和所述目标物体点集,计算所述目标物体的可见度损失值,包括:
    根据所述目标物体点集的点数与所述三维点云中包含的所有物体中的最大点集的点数的比值计算所述目标物体的实际可见度;
    通过所述实际可见度与所述目标物体的预测可见度的差的加权计算得到所述目标物体的可见度损失值。
  3. 如权利要求1所述的物***姿估计方法,其中,所述利用预构建的深度学习网络提取所述三维点云中的目标点,得到目标物体点集,包括:
    利用预构建的深度学习网络中的卷积、池化以及全连接层提取所述三维点云的特征点集;
    利用所述深度学习网络中的分类器将所述特征点集分类为目标点集和非目标点集,并提取其中的目标点集得到目标物体点集。
  4. 如权利要求1所述的物***姿估计方法,其中,所述对所述目标物体点集进行霍夫投票,得到关键点集,包括:
    从所述目标物体点集中采样得到采样点集,计算所述采样点集之间的欧式距离偏移,得到偏移量;
    根据所述偏移量进行投票,将票数超过预设阈值的点的集合作为关键点集。
  5. 如权利要求1所述的物***姿估计方法,其中,所述对所述场景深度图的像素点进行语义分割,得到所述目标物体的语义损失值,包括:
    利用如下公式计算得到所述目标物体的语义损失L s
    L s=-α(1-q i) γlog(q i)
    其中,α表示所述摄像装置的平衡参数,γ表示所述摄像装置的焦点参数,q i代表场景深度图中第i个像素点属于前景点还是背景点的置信度。
  6. 如权利要求1至5中任意一项所述的物***姿估计方法,其中,所述根据所述可见度损失值、所述关键点损失值、所述语义损失值以及预先训练得到的多任务联合模型,计算得到所述目标物体的位姿,包括:
    利用下述多任务联合模型计算所述目标物体的最终损失值L mt
    L mt=μ 1L kps2L s3L v
    其中,L kps代表所述关键点损失值,L s代表所述语义损失,L v代表所述可见度损失值,μ 01、μ 02、μ 03代表对所述多任务联合模型训练后得到的权值。
    根据所述最终损失值调整所述目标物体的预测旋转矩阵和预测平移矩阵,得到所述目标物体的物姿。
  7. 如权利要求1至5中任意一项所述的物***姿估计方法,其中,所述对所述目标点 进行多任务联合训练,得到目标物体的位姿之后,还包括:
    将所述目标物体的位姿发送给预构建的机械臂,利用所述机械臂执行目标物体的抓取任务。
  8. 一种物***姿估计装置,其中,所述装置包括:
    三维点云获取模块,用于利用预设的摄像装置获取目标物体的场景深度图,根据所述场景深度图中的像素点计算所述场景深度图的三维点云;
    目标物体点集提取模块,用于利用预构建的深度学习网络提取所述三维点云中的目标点,得到目标物体点集;
    可见度损失值计算模块,用于根据所述三维点云和所述目标物体点集,计算所述目标物体的可见度损失值;
    关键点损失值计算模块,用于对所述目标物体点集进行霍夫投票,得到关键点集,根据所述关键点集计算所述目标物体的关键点损失值;
    语义损失值计算模块,用于对所述场景深度图的像素点进行语义分割,得到所述目标物体的语义损失值;
    位姿计算模块,用于根据所述可见度损失值、所述关键点损失值、所述语义损失值以及预先训练得到的多任务联合模型,计算得到所述目标物体的位姿。
  9. 一种电子设备,其中,所述电子设备包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的计算机程序指令,所述计算机程序指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如下所述的物***姿估计方法:
    利用预设的摄像装置获取目标物体的场景深度图,根据所述场景深度图中的像素点计算所述场景深度图的三维点云;
    利用预构建的深度学习网络提取所述三维点云中的目标点,得到目标物体点集;
    根据所述三维点云和所述目标物体点集,计算所述目标物体的可见度损失值;
    对所述目标物体点集进行霍夫投票,得到关键点集,根据所述关键点集计算所述目标物体的关键点损失值;
    对所述场景深度图的像素点进行语义分割,得到所述目标物体的语义损失值;
    根据所述可见度损失值、所述关键点损失值、所述语义损失值以及预先训练得到的多任务联合模型,计算得到所述目标物体的位姿。
  10. 如权利要求9所述的电子设备,其中,所述根据所述三维点云和所述目标物体点集,计算所述目标物体的可见度损失值,包括:
    根据所述目标物体点集的点数与所述三维点云中包含的所有物体中的最大点集的点数的比值计算所述目标物体的实际可见度;
    通过所述实际可见度与所述目标物体的预测可见度的差的加权计算得到所述目标物体的可见度损失值。
  11. 如权利要求9所述的电子设备,其中,所述利用预构建的深度学习网络提取所述三维点云中的目标点,得到目标物体点集,包括:
    利用预构建的深度学习网络中的卷积、池化以及全连接层提取所述三维点云的特征点集;
    利用所述深度学习网络中的分类器将所述特征点集分类为目标点集和非目标点集,并提取其中的目标点集得到目标物体点集。
  12. 如权利要求9所述的电子设备,其中,所述对所述目标物体点集进行霍夫投票,得到关键点集,包括:
    从所述目标物体点集中采样得到采样点集,计算所述采样点集之间的欧式距离偏移,得到偏移量;
    根据所述偏移量进行投票,将票数超过预设阈值的点的集合作为关键点集。
  13. 如权利要求9所述的电子设备,其中,所述对所述场景深度图的像素点进行语义分割,得到所述目标物体的语义损失值,包括:
    利用如下公式计算得到所述目标物体的语义损失L s
    L s=-α(1-q i) γlog(q i)
    其中,α表示所述摄像装置的平衡参数,γ表示所述摄像装置的焦点参数,q i代表场景深度图中第i个像素点属于前景点还是背景点的置信度。
  14. 如权利要求9至13中任意一项所述的电子设备,其中,所述根据所述可见度损失值、所述关键点损失值、所述语义损失值以及预先训练得到的多任务联合模型,计算得到所述目标物体的位姿,包括:
    利用下述多任务联合模型计算所述目标物体的最终损失值L mt
    L mt=μ 1L kps2L s3L v
    其中,L kps代表所述关键点损失值,L s代表所述语义损失,L v代表所述可见度损失值,μ 01、μ 02、μ 03代表对所述多任务联合模型训练后得到的权值。
    根据所述最终损失值调整所述目标物体的预测旋转矩阵和预测平移矩阵,得到所述目标物体的物姿。
  15. 一种计算机可读存储介质,包括存储数据区和存储程序区,存储数据区存储创建的数据,存储程序区存储有计算机程序;其中,所述计算机程序被处理器执行时实现如下所述的物***姿估计方法:
    利用预设的摄像装置获取目标物体的场景深度图,根据所述场景深度图中的像素点计算所述场景深度图的三维点云;
    利用预构建的深度学习网络提取所述三维点云中的目标点,得到目标物体点集;
    根据所述三维点云和所述目标物体点集,计算所述目标物体的可见度损失值;
    对所述目标物体点集进行霍夫投票,得到关键点集,根据所述关键点集计算所述目标物体的关键点损失值;
    对所述场景深度图的像素点进行语义分割,得到所述目标物体的语义损失值;
    根据所述可见度损失值、所述关键点损失值、所述语义损失值以及预先训练得到的多任务联合模型,计算得到所述目标物体的位姿。
  16. 如权利要求15所述的计算机可读存储介质,其中,所述根据所述三维点云和所述目标物体点集,计算所述目标物体的可见度损失值,包括:
    根据所述目标物体点集的点数与所述三维点云中包含的所有物体中的最大点集的点数的比值计算所述目标物体的实际可见度;
    通过所述实际可见度与所述目标物体的预测可见度的差的加权计算得到所述目标物体的可见度损失值。
  17. 如权利要求15所述的计算机可读存储介质,其中,所述利用预构建的深度学习网络提取所述三维点云中的目标点,得到目标物体点集,包括:
    利用预构建的深度学习网络中的卷积、池化以及全连接层提取所述三维点云的特征点集;
    利用所述深度学习网络中的分类器将所述特征点集分类为目标点集和非目标点集,并提取其中的目标点集得到目标物体点集。
  18. 如权利要求15所述的计算机可读存储介质,其中,所述对所述目标物体点集进行霍夫投票,得到关键点集,包括:
    从所述目标物体点集中采样得到采样点集,计算所述采样点集之间的欧式距离偏移, 得到偏移量;
    根据所述偏移量进行投票,将票数超过预设阈值的点的集合作为关键点集。
  19. 如权利要求15所述的计算机可读存储介质,其中,所述对所述场景深度图的像素点进行语义分割,得到所述目标物体的语义损失值,包括:
    利用如下公式计算得到所述目标物体的语义损失L s
    L s=-α(1-q i) γlog(q i)
    其中,α表示所述摄像装置的平衡参数,γ表示所述摄像装置的焦点参数,q i代表场景深度图中第i个像素点属于前景点还是背景点的置信度。
  20. 如权利要求15至19中任意一项所述的计算机可读存储介质,其中,所述根据所述可见度损失值、所述关键点损失值、所述语义损失值以及预先训练得到的多任务联合模型,计算得到所述目标物体的位姿,包括:
    利用下述多任务联合模型计算所述目标物体的最终损失值L mt
    L mt=μ 1L kps2L s3L v
    其中,L kps代表所述关键点损失值,L s代表所述语义损失,L v代表所述可见度损失值,μ 01、μ 02、μ 03代表对所述多任务联合模型训练后得到的权值。
    根据所述最终损失值调整所述目标物体的预测旋转矩阵和预测平移矩阵,得到所述目标物体的物姿。
PCT/CN2021/083083 2020-12-01 2021-03-25 物***姿估计方法、装置、电子设备及计算机存储介质 WO2022116423A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011385260.7 2020-12-01
CN202011385260.7A CN112446919B (zh) 2020-12-01 2020-12-01 物***姿估计方法、装置、电子设备及计算机存储介质

Publications (1)

Publication Number Publication Date
WO2022116423A1 true WO2022116423A1 (zh) 2022-06-09

Family

ID=74740242

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083083 WO2022116423A1 (zh) 2020-12-01 2021-03-25 物***姿估计方法、装置、电子设备及计算机存储介质

Country Status (2)

Country Link
CN (1) CN112446919B (zh)
WO (1) WO2022116423A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115147488A (zh) * 2022-07-06 2022-10-04 湖南大学 一种基于密集预测的工件位姿估计方法与抓取***
CN115546216A (zh) * 2022-12-02 2022-12-30 深圳海星智驾科技有限公司 一种托盘检测方法、装置、设备及存储介质
CN115797565A (zh) * 2022-12-20 2023-03-14 北京百度网讯科技有限公司 三维重建模型训练方法、三维重建方法、装置及电子设备
CN116630394A (zh) * 2023-07-25 2023-08-22 山东中科先进技术有限公司 一种三维建模约束的多模态目标物体姿态估计方法及***
CN117226854A (zh) * 2023-11-13 2023-12-15 之江实验室 一种夹取任务的执行方法、装置、存储介质及电子设备
CN117788730A (zh) * 2023-12-08 2024-03-29 中交机电工程局有限公司 一种语义点云地图构建方法

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112446919B (zh) * 2020-12-01 2024-05-28 平安科技(深圳)有限公司 物***姿估计方法、装置、电子设备及计算机存储介质
CN113012291B (zh) * 2021-04-01 2022-11-25 清华大学 基于机械手参数重建对象三维模型的方法和装置
CN113095205B (zh) * 2021-04-07 2022-07-12 北京航空航天大学 一种基于改进霍夫投票的点云目标检测方法
CN113469947B (zh) * 2021-06-08 2022-08-05 智洋创新科技股份有限公司 一种适合多种地形的测量隐患与输电导线净空距离的方法
CN114399421A (zh) * 2021-11-19 2022-04-26 腾讯科技(成都)有限公司 三维模型可见度数据的存储方法、装置、设备及存储介质
CN115482279A (zh) * 2022-09-01 2022-12-16 北京有竹居网络技术有限公司 物***姿估计方法、装置、介质和设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066935A (zh) * 2017-01-25 2017-08-18 网易(杭州)网络有限公司 基于深度学习的手部姿态估计方法及装置
US20170330375A1 (en) * 2015-02-04 2017-11-16 Huawei Technologies Co., Ltd. Data Processing Method and Apparatus
CN108665537A (zh) * 2018-05-15 2018-10-16 清华大学 联合优化人体体态与外观模型的三维重建方法及***
CN111160280A (zh) * 2019-12-31 2020-05-15 芜湖哈特机器人产业技术研究院有限公司 基于rgbd相机的目标物体识别与定位方法及移动机器人
CN112446919A (zh) * 2020-12-01 2021-03-05 平安科技(深圳)有限公司 物***姿估计方法、装置、电子设备及计算机存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3065100B1 (fr) * 2017-04-06 2019-04-12 B<>Com Procede d'estimation de pose, dispositif, systeme et programme d'ordinateur associes
JP2018189510A (ja) * 2017-05-08 2018-11-29 株式会社マイクロ・テクニカ 3次元物体の位置および姿勢を推定する方法および装置
CN108961339B (zh) * 2018-07-20 2020-10-20 深圳辰视智能科技有限公司 一种基于深度学习的点云物体姿态估计方法、装置及其设备
CN111489394B (zh) * 2020-03-16 2023-04-21 华南理工大学 物体姿态估计模型训练方法、***、装置及介质
CN111968129B (zh) * 2020-07-15 2023-11-07 上海交通大学 具有语义感知的即时定位与地图构建***及方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170330375A1 (en) * 2015-02-04 2017-11-16 Huawei Technologies Co., Ltd. Data Processing Method and Apparatus
CN107066935A (zh) * 2017-01-25 2017-08-18 网易(杭州)网络有限公司 基于深度学习的手部姿态估计方法及装置
CN108665537A (zh) * 2018-05-15 2018-10-16 清华大学 联合优化人体体态与外观模型的三维重建方法及***
CN111160280A (zh) * 2019-12-31 2020-05-15 芜湖哈特机器人产业技术研究院有限公司 基于rgbd相机的目标物体识别与定位方法及移动机器人
CN112446919A (zh) * 2020-12-01 2021-03-05 平安科技(深圳)有限公司 物***姿估计方法、装置、电子设备及计算机存储介质

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115147488A (zh) * 2022-07-06 2022-10-04 湖南大学 一种基于密集预测的工件位姿估计方法与抓取***
CN115546216A (zh) * 2022-12-02 2022-12-30 深圳海星智驾科技有限公司 一种托盘检测方法、装置、设备及存储介质
CN115546216B (zh) * 2022-12-02 2023-03-31 深圳海星智驾科技有限公司 一种托盘检测方法、装置、设备及存储介质
CN115797565A (zh) * 2022-12-20 2023-03-14 北京百度网讯科技有限公司 三维重建模型训练方法、三维重建方法、装置及电子设备
CN115797565B (zh) * 2022-12-20 2023-10-27 北京百度网讯科技有限公司 三维重建模型训练方法、三维重建方法、装置及电子设备
CN116630394A (zh) * 2023-07-25 2023-08-22 山东中科先进技术有限公司 一种三维建模约束的多模态目标物体姿态估计方法及***
CN116630394B (zh) * 2023-07-25 2023-10-20 山东中科先进技术有限公司 一种三维建模约束的多模态目标物体姿态估计方法及***
CN117226854A (zh) * 2023-11-13 2023-12-15 之江实验室 一种夹取任务的执行方法、装置、存储介质及电子设备
CN117226854B (zh) * 2023-11-13 2024-02-02 之江实验室 一种夹取任务的执行方法、装置、存储介质及电子设备
CN117788730A (zh) * 2023-12-08 2024-03-29 中交机电工程局有限公司 一种语义点云地图构建方法

Also Published As

Publication number Publication date
CN112446919B (zh) 2024-05-28
CN112446919A (zh) 2021-03-05

Similar Documents

Publication Publication Date Title
WO2022116423A1 (zh) 物***姿估计方法、装置、电子设备及计算机存储介质
JP6745328B2 (ja) 点群データを復旧するための方法及び装置
CN110363817B (zh) 目标位姿估计方法、电子设备和介质
WO2020244075A1 (zh) 手语识别方法、装置、计算机设备及存储介质
CN111723786A (zh) 一种基于单模型预测的安全帽佩戴检测方法及装置
CN112419326B (zh) 图像分割数据处理方法、装置、设备及存储介质
WO2022126914A1 (zh) 活体检测方法、装置、电子设备及存储介质
US20220262093A1 (en) Object detection method and system, and non-transitory computer-readable medium
US20230020965A1 (en) Method and apparatus for updating object recognition model
WO2023083030A1 (zh) 一种姿态识别方法及其相关设备
CN110222651A (zh) 一种人脸姿态检测方法、装置、终端设备及可读存储介质
CN114565916A (zh) 目标检测模型训练方法、目标检测方法以及电子设备
CN116778527A (zh) 人体模型构建方法、装置、设备及存储介质
Wang et al. Deep leaning-based ultra-fast stair detection
CN115511779A (zh) 图像检测方法、装置、电子设备和存储介质
CN112784102B (zh) 视频检索方法、装置和电子设备
CN116091570B (zh) 三维模型的处理方法、装置、电子设备、及存储介质
CN116453222B (zh) 目标对象姿态确定方法、训练方法、装置以及存储介质
WO2023109086A1 (zh) 文字识别方法、装置、设备及存储介质
CN116309643A (zh) 人脸遮挡分确定方法、电子设备及介质
CN113627394B (zh) 人脸提取方法、装置、电子设备及可读存储介质
CN114494857A (zh) 一种基于机器视觉的室内目标物识别和测距方法
CN117036658A (zh) 一种图像处理方法及相关设备
CN114049676A (zh) 疲劳状态检测方法、装置、设备及存储介质
Zhang et al. Object detection based on deep learning and b-spline level set in color images

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21899467

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21899467

Country of ref document: EP

Kind code of ref document: A1