WO2018075053A1 - Pose d'un objet sur la base d'une mise en correspondance d'informations de profondeur en 2,5d et d'informations en 3d - Google Patents

Pose d'un objet sur la base d'une mise en correspondance d'informations de profondeur en 2,5d et d'informations en 3d Download PDF

Info

Publication number
WO2018075053A1
WO2018075053A1 PCT/US2016/058014 US2016058014W WO2018075053A1 WO 2018075053 A1 WO2018075053 A1 WO 2018075053A1 US 2016058014 W US2016058014 W US 2016058014W WO 2018075053 A1 WO2018075053 A1 WO 2018075053A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
depth
information
pose
matching
Prior art date
Application number
PCT/US2016/058014
Other languages
English (en)
Inventor
Terrence Chen
Jan Ernst
Stefan Kluckner
Kai Ma
Vivek Kumar Singh
Ziyan Wu
Original Assignee
Siemens Aktiengesellschaft
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens Aktiengesellschaft filed Critical Siemens Aktiengesellschaft
Priority to PCT/US2016/058014 priority Critical patent/WO2018075053A1/fr
Publication of WO2018075053A1 publication Critical patent/WO2018075053A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Definitions

  • the present embodiments relate to finding a pose of an object relative to a camera.
  • One approach compares the camera image to images in a database.
  • the images in the database are of the object from different poses.
  • the comparison finds the closest image of the database to the camera image, providing the pose.
  • This approach allows suffers from having to create a large database of the images, which is very cost intensive.
  • Depth information may be available for imaging.
  • RGBD data is used for gaming to recognize position of a person.
  • Cameras capable of collecting depth information may become more prevalent.
  • metadata from corresponding 3D representations of the imaged objects may be added.
  • the coordinate systems of the camera and the 3D information are aligned, but the comparison of RGBD data to 3D data may not be efficient for determining pose.
  • systems, methods and computer readable media are provided for determining pose and/or matching depth information with 3D information.
  • Orthographic projections of the 3D information are created from different viewpoints.
  • the orthographic projections may be represented as 2.5D data structures, allowing matching with 2.5D data from a depth sensor of a camera.
  • Using orthographic projections may be invariant to camera parameters and allow simple scaling for matching.
  • the matching provides the pose, allowing transfer of metadata from the 3D information to the camera images.
  • a system for matching depth information to 3D information.
  • An RGBD sensor is provided for sensing 2.5D data representing an area of an object facing the depth sensor and depth from the depth sensor to the object for each location of the area.
  • a memory is configured to store first orthographic projections of the 3D information.
  • the 3D information represents the object in three dimensions.
  • the first orthographic projections are generated from different view directions relative to the object.
  • An image processor is configured to convert the 2.5D data into a second orthographic projection of the object, to match the second orthographic projection with at least one of the first orthographic projections, and to transfer an object label of the 3D information to a coordinate system of the depth sensor based on the match.
  • a display is configured to display an image from the 2.5D data augmented with the object label from the 3D information.
  • a method for matching depth information to 3D data.
  • a first representation of an orthographic projection from depth data of a depth sensor on a mobile device is matched with a second representation of an orthographic projection from the 3D data.
  • a pose of an object relative to the depth sensor is determined based on the first representation being matched to the second representation.
  • An image on a display of the mobile device is augmented with label data from the 3D data, the augmenting spatially positioning the label data using the pose.
  • a method for determining a camera viewpoint. Measurements of distance from a camera to an object are acquired. The measurements are orthographically projected. The
  • orthographical projection of the measurements is compared with orthographic projections from different views of the object.
  • the camera viewpoint of the camera relative to the object is determined based on the comparing.
  • Figure 1 is a block diagram of one embodiment of a system for matching depth information to 3D information
  • Figure 2 illustrates one embodiment of a method for determining pose of an object relative to a camera
  • Figure 3 is a flow chart diagram of one embodiment of a method for matching depth information to 3D information and/or determining pose.
  • 2.5D depth images are matched to computer assisted design (CAD), medical volume imaging, geographic information system (GIS), building design, or other 3D information.
  • CAD computer assisted design
  • Medical volume imaging e.g., medical volume imaging
  • GIS geographic information system
  • 3D information e.g., 3D information from 3D images
  • Individually acquired depth images e.g., acquired by depth sensing devices
  • 3D data such as CAD, GIS, engineering, or segmented medical data with interactive response times.
  • the match estimates correspondence between depth data and 3D data.
  • the pose of the depth data with respect to the 3D data is recovered based on the match.
  • Pose information supports fusion of the involved modalities within a common metric coordinate system. Fusion may enable the linkage of data between involved modalities since the information is spatially related (e.g., 2D/3D correspondences on image, object and pixel level are known).
  • orthographic projections of the 3D data from potential viewpoints are created and/or stored. These orthographic projections may be represented as 2.5D data structures or depth structures. The orthographic projections may be compared with the 2.5D data or depth structure of the 2.5D data for matching. Using these orthographic projections provides invariance to camera parameters and provides scale handling enabled by down or upscaling.
  • representations for CAD, GIS, engineering data, building design, or medical data are metric formats.
  • Depth (e.g., RGBD) sensing devices provide metric 2.5D measurements for matching.
  • an indexing system for the orthographic projections may be used. Potentially relevant projections to the current scene observation are found based on the indexing. The indexing uses a histogram of depths. Rather than comparing spatial distribution of depths, histograms are compared, at least initially. Potential projections are used within a refined search strategy based on filtering with an orthographic projection. Knowledge of scale and the computed response map determines the final pose.
  • Mobile sensing devices for 2.5D or 3D may be standard.
  • the estimated spatial correspondence enables the determination of the camera viewpoint for such mobile devices. Having available the viewpoint (i.e., the camera pose with respect to the synthetic data or object) provides a basis for linking image data with synthetic or scan data (i.e., 3D data).
  • the 3D data may include metadata so that the spatial linking allows augmenting the mobile device image with the metadata. For example, spare part or organ specific information may be presented on a camera image at the appropriate location.
  • the spatial linking may provide an automated solution to initialization of registration of the mobile data with the synthetic data for tracking in augmented reality.
  • Figure 1 shows one embodiment of a system for matching depth information to 3D information, pose determination, and/or aligning coordinate systems.
  • the mobile device 10 captures 2.5D data of the object 14.
  • the depth information of the 2.5D data is compared to orthographic projections from different viewpoints.
  • the orthographic projections are generated from 3D data representing the object. The comparison may be simplified by using histograms of the depths from the orthographic projections.
  • the pose is determined from the pose or poses of the best or sufficiently matching orthographic projections.
  • the system includes a mobile device 10 with a camera and depth sensor 12 for viewing an object 14, an image processor 16, a memory 18, and a display 20.
  • the image processor 16, memory 18, and/or display 20 may be remote from the mobile device 10.
  • the mobile device 10 connects to a server with one or more communications networks.
  • the mobile device 10 connects wirelessly but directly to a computer.
  • the image processor 16, memory 18, and/or display 20 are part of the mobile device 10 such that the matching and augmentation are performed locally or by the mobile device 10.
  • a database separate from the memory 18 is provided for storing orthographic projections or histograms of orthographic projections of the object from different viewpoints.
  • other mobile devices 10 and/or cameras and depth sensors 12 are provided for communicating depth data to the image processor 16 as a query for pose.
  • a user input device such as a touch screen, keyboard, buttons, sliders, touch pad, mouse, and/or trackball, is provided for interacting with the mobile device 10 and/or the image processor 16.
  • the object 14 is a physical object.
  • the object 14 may be a single part or may be a collection of multiple parts, such as an assembly (e.g., machine, train bogie, manufacturing or assembly line, consumer product (e.g., keyboard or fan), buildings, or any other assembly of parts).
  • the parts may be separable or fixed to each other.
  • the parts are of a same or different material, size, and/or shape.
  • the parts are connected together into the assembled configuration or separated to be assembled.
  • One or more parts of an overall assembly may or may not be missing.
  • One or more parts 16 may themselves be assemblies (e.g., a sub-assembly).
  • the object 14 is a patient or animal.
  • the object 14 may be part of a patient or animal, such as the head, torso, or one or more limbs. Any physical object may be used.
  • the object 14 is represented by 3D data.
  • a building, an assembly, or manufactured object 14 is represented by computer assisted design data (CAD) and/or engineering data.
  • the 3D data is defined by segments parameterized by size, shape, and/or length. Other 3D data parameterization may be used, such as a mesh or interconnected triangles.
  • CAD computer assisted design data
  • the 3D data is defined by segments parameterized by size, shape, and/or length. Other 3D data parameterization may be used, such as a mesh or interconnected triangles.
  • a patient or inanimate object is scanned with a medical scanner, such as a computed tomography, magnetic resonance, ultrasound, positron emission tomography, or single photon emission computed tomography system. The scan provides a 3D representation or volume of the patient or inanimate object.
  • the 3D data is voxels
  • the 3D data may include one or more labels.
  • the labels are information other than the geometry of the physical object.
  • the labels may be part information (e.g., part number, available options, manufacturer, recall notice, performance information, use instructions, assembly/disassembly instructions, cost, and/or availability).
  • the labels may be other information, such as shipping date for an assembly.
  • the label may merely identify the object 14 or part of the object 14.
  • the labels may be organ identification, lesion identification, derived parameters for part of the patient (e.g., volume of a heart chamber, elasticity of tissue, size of a lesion, scan parameters, or operational information).
  • a physician or automated process may add labels to a pre-operative scan and/or labels are incorporated from a reference source.
  • the label is a fit model or geometry.
  • the mobile device 10 is a cellular phone, tablet computer, the camera and depth sensor 12, navigation computer, virtual reality headgear, glasses, or another device that may be carried or worn by a user.
  • the mobile device 10 operates on batteries rather than relying on a cord for power. In alternative embodiments, a cord is provided for power.
  • the mobile device 10 may be sized to be held in one hand, but may be operated with two hands. For a worn device, the mobile device 10 is sized to avoid interfering with movement by the wearer.
  • the camera and depth sensor 12 is a red, green, blue, depth (RGBD) sensor or other sensor for capturing an image and distance to locations in the image.
  • RGBD red, green, blue, depth
  • the camera and depth sensor 12 is a pair of cameras that use parallax to determine depth for each pixel captured in an image.
  • lidar, structured light, or time-of-flight sensors are used to determine the depth.
  • the camera portion may be a charge coupled device (CCD), digital camera, or other device for capturing light over an area of the object 14.
  • the camera and depth sensor 12 is a perspective camera or an orthographic 3D camera.
  • the camera portion captures an image of an area of the object 14 from a viewpoint of the camera and depth sensor 12 relative to the object 14.
  • the depth sensor portion determines a distance of each location in the area to the camera and depth sensor 12. For example, a depth from the camera and depth sensor 12 to each pixel or groups of pixels is captured. Due to the shape and/or position of the object 14, different pixels or locations in the area may have different distances to a center or corresponding cell of the camera and depth sensor 12.
  • the 2.5D data represents a surface of the object 14 viewable from the RGBD sensor and depths to parts of the object 14.
  • the surface or area portion e.g., RGB is a photograph.
  • the camera and depth sensor 12 connects (e.g., wirelessly, over a cable, via a trace) with an interface of the image processor 16.
  • Wi-Fi, Bluetooth, or other wireless connection protocol may be used.
  • a wired connection is used, such as being connected through a back plane or printed circuit board routing.
  • a user captures 2.5D data of the object 14 from a given orientation relative to the object 14 with the camera and depth sensor 12.
  • the 2.5D data includes a photograph (2D) of the area and depth measurements (2.5D) from the camera and depth sensor 12 to the locations represented in the area.
  • the distance from the camera and depth sensor 12 to obscured portions (e.g., back side) of the object 14 are not captured or measured, so the 2.5D data is different than a 3D representation of the object 14.
  • 2.5D images may be seen as a projection of 3D data onto a defined image plane. Each pixel in 2.5D images corresponds to a depth measurement and light intensity.
  • the mapped and visible surface may be recovered from the 2.5D data.
  • RGB information is typically available and provides visual scene observation.
  • the 2.5D may be captured at any orientation or in one of multiple defined or instructed orientations.
  • the 2.5D data is communicated to the image processor 16.
  • the image processor 16 Upon arrival of the 2.5D data (e.g., photograph and depth measurements) or stream of 2.5D data (video and depth).
  • the image processor 16 returns one or more labels, geometry, or other information from the 3D data to be added to a display of an image or images (e.g., photograph or video) from the 2.5D data.
  • the arrival of 2.5D data and return of labels occurs in real-time (e.g., within 1 second or within 3 seconds), but may take longer.
  • a photograph or video is taken.
  • Label information for one or more parts in the photograph is returned, such as providing smart data, and displayed with the photograph. Due to the short response time to provide the label, the operator may be able to use the smart data to assist in maintenance, ordering, diagnosis, or other process. For more complex objects 14, the user may be able to select a region of interest for more detailed identification or information.
  • the image processor 16 interacts with the operator to provide annotations for the photograph from the 3D data.
  • the depth cues are used rather than relying on the more data intensive processing of texture or the photograph portion.
  • the depth cue is used as a supporting modality to estimate correspondence between the current view of the mobile device 10 and the 3D data.
  • the camera and depth sensor 12 provide the 2.5D data
  • the memory 18 provides the 3D data and/or information derived from the 3D data.
  • the memory 18 is a database, a graphics processing memory, a video random access memory, a random access memory, system memory, cache memory, hard drive, optical media, magnetic media, flash drive, buffer, combinations thereof, or other now known or later developed memory device for storing data or video information.
  • the memory 18 is part of the mobile device 10, part of a computer associated with the image processor 16, part of a database, part of another system, a picture archival memory, and/or a standalone device.
  • the memory 18 is configured by a processor to store, such as being formatted to store.
  • the 3D information includes surfaces of the object 14 not in view of the camera and depth sensor 12 when sensing the 2.5D data.
  • 3D CAD is typically represented in 3D space by using XYZ coordinates (vertices).
  • CAD data is clean and complete (watertight) and does not include noise.
  • CAD data is generally planned and represented in metric scale.
  • Engineering or GIS data may also include little or no noise.
  • the 3D data may be voxels representing intensity of response from each location. The voxel intensities may include noise.
  • segmentation is performed so that a mesh or other 3D surface of an organ or part is provided.
  • the memory 18 stores the 3D data and/or information derived from the 3D data.
  • the memory 18 stores orthographic projections of the 3D information.
  • An orthographic projection is a projection of the 3D data as if viewed from a given direction.
  • the orthographic projection provides a distance from the viewable part of the object to a parallel viewing plane.
  • the camera center of the orthographic projection may point to a point of interest (e.g. the center of gravity of the observed object).
  • a plurality of orthographic projections from different view directions relative to the object 14 are generated and stored.
  • a 2.5D image database is created based on the 3D data.
  • the database may be enriched with "real world" data acquisitions, such as measurements, models or images used to remove or reduce noise in the 3D data.
  • the 3D CAD model or 3D data is used to render or generate synthetic orthographic views from any potential viewpoint a user or operator may look at the object 14 in a real scene. Where the object 14 is not viewable from certain directions, then those viewpoints may not be used.
  • the strategy for creating synthetic views may be random or may be based on planned sampling the 3D space (e.g., on sphere depending on the potential acquisition scenario during the matching procedure). Any number of viewpoints and corresponding distribution of viewpoints about the object 14 may be used.
  • the orthographic projections from the 3D data represent the object from the different view directions.
  • the orthographic projections stored in the memory 18 are normalized to a given pixel size. In order to be invariant to camera
  • the synthetic database is created based on orthographic projections where each resulting pixel in the 2.5D orthographic projections (i.e., depth information) provides a same size (e.g., 1 pixel corresponds to a metric area, such as 1 pixel maps to 1x1 inch in real space).
  • the projections are not normalized, but instead a pixel area is calculated and stored with each projection.
  • orthographic projections themselves, other representations of the orthographic projections are stored.
  • the orthographic projections are indexed, such as with
  • the orthographic projections are used to create a representative dataset that can be used for indexing.
  • the efficient indexing system allows filtering for potentially similar views based on the set of created 2.5D views and reduction of the search space during the matching procedure.
  • each orthographic projection is mapped to a histogram representation.
  • the histogram is binned by depths, so reflects a distribution of depths without using the photographic or texture information. The spatial distribution of the depths is not used.
  • an indexing system inspired by a Bag-of-Word concepts is applied to the orthographic projections and its histogram representation.
  • a histogram driven quantization of the depth is used due to efficiency during generation. Due to restricted depth ranges, a quantization of depth values into specified number of bins is used.
  • noisy measurement can be filtered in advance, such as by low pass, mean, or other non-linear filtering of the depths over the area prior to binning.
  • the orthographic projections are mapped to a descriptor using a neural network or deep learnt classifier.
  • the pre-filtering for similar views is only done based on a compact histogram representation for each generated view. Histogram representations over depth measurements also overcome the problem of missing data since normalized 1 D distributions may be generated for image regions with holes. For robustness, a histogram representation in a coarse to fine concept (i.e. creating a spatial pyramid of histograms) may be used.
  • the memory 18 also stores labels and coordinates for the labels.
  • the label information may be stored separately from the 3D data. Rather than store the 3D data, the orthographic projections and/or histograms and label information are stored.
  • the memory 18 or other memory is alternatively or additionally a non-transitory computer readable storage medium storing data representing instructions executable by the programmed image processor 16 for matching 2.5D and 3D data to determine pose.
  • the instructions for implementing the processes, methods and/or techniques discussed herein are provided on non- transitory computer-readable storage media or memories, such as a cache, buffer, RAM, removable media, hard drive or other computer readable storage media.
  • Non-transitory computer readable storage media include various types of volatile and nonvolatile storage media.
  • the functions, acts or tasks illustrated in the figures or described herein are executed in response to one or more sets of instructions stored in or on computer readable storage media.
  • processing strategies may include multiprocessing, multitasking, parallel processing, and the like.
  • the instructions are stored on a removable media device for reading by local or remote systems.
  • the instructions are stored in a remote location for transfer through a computer network or over telephone lines.
  • the instructions are stored within a given computer, CPU, GPU, or system.
  • the image processor 16 is a general processor, central processing unit, control processor, graphics processor, digital signal processor, three- dimensional rendering processor, server, application specific integrated circuit, field programmable gate array, digital circuit, analog circuit,
  • the image processor 16 is for searching an index, matching orthographic projections, aligning coordinates, and/or augmenting displayed images.
  • the image processor 16 is a single device or multiple devices operating in serial, parallel, or separately.
  • the image processor 16 may be a main processor of a computer, such as a laptop or desktop computer, or may be a processor for handling some tasks in a larger system, such as in the mobile device 10.
  • the processor 16 is configured by instructions, design, hardware, and/or software to perform the acts discussed herein.
  • the image processor 16 is configured to relate or link 3D data (e.g., engineering data) with the real world 2.5D data (e.g., data captured by the camera and depth sensor 12). To relate, the pose of the object 14 relative to the camera and depth sensor 12 is determined. Using orthographic projections, the image processor 16 outputs labels specific to pixels or locations of the object displayed in a photograph or video from the 2.5D data. The pose is determined by matching the 2.5D data to one or more
  • the 2.5D data is converted to or used as an orthographic projection of the object 14.
  • the depth measurements are extracted from the 2.5D data, resulting in the orthographic projection.
  • the depth measurements provide for depth as a function of location in an area.
  • the depth stream of the observed scene e.g., 2.5D data
  • the depth measurements from the 2.5D data may be compared with the depth measurements of the orthographic projections from the 3D data.
  • the orthographic projection from the 2.5D data is scaled to the pixel size used for the orthographic projections from the 3D data.
  • the size of the pixels scales to be the same so that the depth measurements correspond to the same pixel or area size.
  • the focal length and/or other parameters of the camera and depth sensor 12 as well as the measured depth at the center of the area are used to scale.
  • the orthographic projection from the 2.5D data is matched with one or more orthographic projections from the 3D data.
  • the orthographic projections are matched.
  • the orthographic projection from the 2.5D data is compared to any number of the other orthographic projections. A normalized cross- correlation, minimum sum of absolute differences, or other measure of similarity may be used to match based on the comparison.
  • comparisons may be efficiently computed on architectures for parallelization, such as multi-core processors and/or graphics processing units.
  • a threshold or other criterion may be used to determine sufficiency of the match.
  • One or more matches may be found. Alternatively, a best one or other number of matches are found.
  • the histograms may be matched.
  • the orthographic projection from the 2.5D data is converted to a histogram (2.5D depth histogram), and the orthographic projections from the 3D data are converted to histograms (3D depth histograms).
  • the 2.5D depth histogram is then matched to the index of depth histograms for the 3D data.
  • the same or different types of matching discussed above may be used. For example, normalized cross correlation based on box filters or filtering in the Fourier domain finds similar views based on the histograms.
  • the 3D depth histograms sufficiently matching with the 2.5D depth histogram are found.
  • the depth histograms are used to match the orthographic projections.
  • the histograms of the index are an intermediate image representation for finding the most similar views to the current measurement.
  • the orthographic projection quantized into a histogram based representation may enable a quick search for potentially similar views in the database. By comparing depth histograms, the amount of image processing is more limited. The spatial distribution of the depths is removed.
  • the search for matches may follow any pattern, such as testing each possible 3D depth histogram.
  • a tree structure or search based on feedback from results is used.
  • the 3D depth histograms are clustered based on similarity so that branches or groupings may be ruled out, avoiding comparison with all of the depth histograms. For example, a tree structure using L1/L2 norms or approximated metrics are used.
  • the image processor 16 is configured to determine a pose of the object 14 relative to the camera and depth sensor 12 based on poses of the object 14 in the 3D orthographic projections. Where a best match is found, the pose of the orthographic projection or histogram derived from the 3D data is determined as the pose of the object 14 relative to the camera and depth sensor 12. By using orthographic projections as the basis for matching, the pose is determined with respect to natural camera characteristics. By using histograms derived from the orthographic projections, the comparison of the orthographic projections may be more rapidly performed.
  • the pose or poses from the matches are further refined to determine a more exact pose.
  • a refined filter strategy may result in more accurate pose recovery.
  • a plurality of matches is found. These matches are of similar views or poses.
  • the similar views are feed into the refined filtering strategy where the 3D-based orthographic images are matched to the current template or 2.5D-based orthographic image.
  • the comparison of histograms is performed as discussed above. This filtering concept or comparisons of histograms provides a response map encoding potential viewpoints.
  • the response map is a chart or other representation of the measures of similarity.
  • a voting scheme is applied to the most similar views.
  • Any voting scheme may be used, such as selecting the most similar, interpolating, Hough space voting, or another redundancy-based selection criterion.
  • the voting results in a pose.
  • the pose is the pose of the selected (i.e., matching) orthographic projection or a pose averaged or combined from the plurality of selected (i.e., matching) orthographic projections.
  • the image processor 16 may further refine the pose of the object relative to the camera and depth sensor 12.
  • the refinement uses the features in the orthographic projections rather than the histogram.
  • the 3D-based orthographic projection or projections closest to the determined pose are image processed for one or more features, such as contours, edges, T- junctions, lines, curves, and/or other shapes.
  • the features from the 2.5D data are compared to features from the 3D data itself.
  • the 2.5D orthographic projection is also image processed for the same features.
  • the features are then matched. Different adjustments of the pose or orientation are made, resulting in alteration of the features for the 3D orthographic projection.
  • the spatial distribution of the features are compared to the features for the matching 2.5D orthographic projection.
  • the alteration to the pose providing the best match of the features is the refined pose estimate.
  • Coordinates in the 3D data may be related to or transformed to coordinates in the 2.5D data.
  • the coordinate systems are aligned, and/or a transform between the coordinate systems is provided.
  • the image processor 16 is configured to transfer an object label of the 3D data and corresponding source to a coordinate system of the camera and depth sensor 12 based on the match.
  • the recovered pose of the mobile device 10 with respect to the 3D data enables the exchange of information, such as the labels from the 3D data. For example, part information or an annotation from the 3D data is transferred to the 2.5D data.
  • the object label is specific to one or more coordinates. Using the aligned coordinate systems or the transform between the coordinate systems, the location of the label relative to the 2.5D data is determined. Labels at any level of detail of the object 14 may be transferred. [0056] The transfer is onto an image generated based on the 2.5D data. For example, the image processor 16 transfers a graphical overlay or text to be displayed as a specific location in an image rendered from or based on the photographic portion of the 2.5D data.
  • the display 20 is a monitor, LCD, projector with a screen, plasma display, CRT, printer, touch screen, virtual reality viewer, or other now known or later developed devise for outputting visual information.
  • the display 86 receives images, graphics, text, quantities, or other information from the camera and depth sensor 12, memory 18, and/or image processor 16.
  • the display 20 is configured by a display plane or memory. Image content is loaded into the display plane, and then the display 20 displays the image.
  • the display 20 is configured to display an image from the 2.5D data augmented with the object label from the 3D information.
  • a photograph or video of the object 14 is displayed on the mobile device 10.
  • the photograph or video includes a graphic, highlighting, brightness adjustment, or other modification indicating further information from the object label or labels.
  • text or a link is displayed on or over part of the object 14.
  • the image is an augmented reality image with the object label being the augmentation rendered onto the image. Images from the 2.5D data are displayed in real-time with augmentation from the 3D data rendered onto the photograph or video.
  • the display 20 displays the
  • the camera and depth sensor 12 may be a depth sensor used to determine the user's current viewpoint.
  • the proposed approach enables the transfer of object labels into the coordinate system of the observed scene and vice versa.
  • Annotations or part information for part of the object 14 are transferred and rendered as an augmentation.
  • the proposed approach may be used for initialization of a real time tracking system during augmented reality processing. Tracking may use other processes once the initial spatial relationship or pose is determined.
  • the object label being transferred is part information for part of the object as an assembly.
  • the part may be automatically identified.
  • the user takes a picture of the object 14.
  • the returned label for a given part identifies the part, such as by matching CAD data to photograph data. Individual spare parts are identified on-the-fly from the CAD data.
  • a user takes screenshots of a real assembly.
  • the system identifies the position of the operator with respect to the assembly using the database.
  • the CAD information may be overlaid onto the real object or image of the real object using rendering. Metadata may be exchanged between the 2.5D data and CAD.
  • the object label is an annotation or other medical information.
  • a scan of a patient is performed to acquire the 3D data. This scan is aligned with the 2.5D data from the camera and depth sensor 12.
  • Depth information enables the registration of a mobile device viewing the patient to the medical volume data of the patient. This registration information can be used to trigger a cinematic render engine for overlaying.
  • Annotations or other medical information added to or included in the 3D scan data are overlaid on a photograph or video of the patient. Skin or clothes
  • segmentation or other image processing may be used to isolate information of interest in the 3D data for rendering onto the photograph.
  • the augmentation is overlaid into the viewpoint of the physician.
  • Figure 2 illustrates one embodiment of the system of Figure 1.
  • a processor 40 generates orthographic projections 44 from 3D data 42 and/or an index 46 of the orthographic projections (e.g., histograms organized or not by similarity or clustering).
  • the 3D data-based orthographic projections 44 provide depth information for each of different orientations relative to the object.
  • the index 46 and/or the orthographic projections 44 are stored in a database 48.
  • the database 48 is populated at any time, such as days, weeks, or years prior to use for determining pose.
  • the sensor 50 captures 2.5D data and converts the 2.5D data into an orthographic projection 52.
  • the conversion may be selection of just the depth information.
  • the index 46 as stored in the database 48
  • one or more matching orthographic projections 54 are found.
  • the pose is determined from the matches.
  • the pose may be filtered by a pose filter 56.
  • the pose filter 56 refines the pose using the index, orthographic projections, and/or 3D data stored in the database 48.
  • Figure 3 shows one embodiment of a method for matching depth information to 3D data.
  • a camera viewpoint relative to an object is
  • the method is implemented by the system of Figure 1 , the system of Figure 2, or another system.
  • an RGBD sensor performs act 22.
  • An image processor performs acts 24, 26, 28, and 30.
  • the image processor and display perform act 32.
  • a smart phone or tablet perform all of the acts, such as using a perspective camera, image processor, and touch screen.
  • the method is performed in the order shown, but other orders may be used. Additional, different, or fewer acts may be provided. For example, acts for capturing 3D data, generating orthographic projections for different viewpoints from the 3D data, and indexing the orthographic projections (e.g., creating histograms of depth) are provided. As another example, act 24 is not performed where the distance measurements are used as the orthographic projection. In yet another example, acts 30 and/or 32 are not performed. In another example, acts 22 and 24 are not performed, but instead an already acquired orthographic projection is used in act 26.
  • a camera acquires measurements of distance from a camera to an object.
  • the camera includes a depth sensor for measuring the distances.
  • the distance is from the camera to each of multiple locations on the object, such as locations represented by individual pixels or groups of pixels.
  • the depth measurements are acquired with a depth sensor without the camera function.
  • the user orients the camera at the object of interest from a given viewpoint. Any range to the object within the effective range of the depth measurements may be used.
  • the user may activate augmented reality to initiate the remaining acts. Alternatively, the user activates an application for performing the remaining acts, activates the camera, or transmits a photograph or video with depth measurements to an application or service.
  • an image processor orthographically projects the measurements of depth. Where the camera captures both pixel (e.g., photograph) and depth measurements, extracting or using the depth measurements provides the orthographic projection. Where just depth measurements are acquired, the measurements are used as the orthographic projection.
  • the orthographic projection may be compressed into a different representation.
  • the depth measurements are binned into a histogram.
  • the histogram has any range of depths and/or number of bins.
  • the image processor matches a representation of the orthographic projection from depth data of the depth sensor to one or more other representations of orthographic projections.
  • the image processor may be on a mobile device with the camera or may be remote from the mobile device and camera.
  • the representations of the orthographic projections are the orthographic projections themselves or a compression of the orthographic projections. For example, depth histograms are matched.
  • the representation from the depth measurements is compared in act 28 to representations in a database.
  • the representations in the database are created from 3D data of the object.
  • the database may or may not include the 3D data, such as three-dimensional engineering data.
  • the database may include reference representations as orthographic projections from the 3D data and/or index information (e.g., histograms).
  • the reference representations in the database are of the same object for which depth measurements were acquired in act 22, but from different viewpoints (e.g., orientations and/or scales) and a different data source.
  • the references are generated from orthographic projections from known viewpoints, and the resulting database of reference representations are in a same coordinate system as the 3D data.
  • the pose relative to the object in each of the references is known. Any number of references and corresponding viewpoints may be provided, such as tens, hundreds, or thousands.
  • Each representation in the database has a different pose relative to the object, so the comparison attempts to find representations with a same or similar pose as the pose of the camera to the object.
  • the matching representation or representations from the database are found.
  • the matches are found by comparing the orthographical projection of the measurements with orthographic projections from different views of the object. This comparison may be of the histograms and/or of the orthographic projections themselves.
  • the reference representations are searched to locate a match.
  • the search finds a reference most similar to the query representation. Any measure of visual similarity may be used. For example, cross-correlation or sum of absolute differences is used.
  • More than one match may be found. Alternatively, only a single match is found.
  • the matching representation is found based on a threshold, such as a correlation or similarity threshold. Alternatively, other criterion or criteria may be used, such as finding the two, three, or more best matches.
  • the representation form the depth measurements is matched with a viewpoint or viewpoints of reference orthographic projections.
  • a viewpoint or viewpoints of reference orthographic projections For a query set of depth measurements, a ranked list of similar viewpoints is generated by using comparison of representations of orthographic projections in the database. The matching determines the references with corresponding viewpoints (e.g., orientation) of the object most similar to the viewpoint of the query depth measurements. A ranked list or response map of similar views from the database is determined.
  • the image processor determines a pose of an object relative to the depth sensor. This pose corresponds to the pose of the depth sensor relative to the reference 3D data.
  • the camera viewpoint relative to the object is determined based on the comparisons and resulting matches. The determination is based on the representation from the depth measurements being matched to the representation of the orthographic projection from the 3D data.
  • the orientation of the object relative to the depth sensor is calculated. Six or other number of degrees of freedom of the pose are determined.
  • an accurate alignment of the representation from the depth measurements with respect to the 3D data is determined. The result is a viewpoint of the depth sensor and corresponding mobile device or user view to the object and corresponding 3D representation of the object.
  • the poses of the matches may be combined. For example, the average orientation or an interpolation from multiple orientations is calculated. Alternatively, a voting scheme may be used.
  • the viewpoint for the best or top ranked matches may not be the same as the view point for the depth measurements. While the pose may be similar, the pose may not be the same. This coarse alignment may be sufficient.
  • the pose may be refined. More accurate registration is performed using the spatial distribution of the depths.
  • three or more points in the orthographic projection from the depth measurements are also located in the orthographic projection form the 3D data or in the 3D data of the matching representation or representations.
  • the points may correspond to features distinguishable by depth variation, such as ridges or junctions.
  • the similarities of the resulting features are compared to the features form the depth measurements. Any step size, search pattern, and/or stop criterion may be used.
  • the refined or adjusted pose resulting in the best matching features is calculated as the final or refined pose.
  • the image processor using a display, augments an image.
  • the augmentation is of an actual view of the object, such as by projecting the augmentation on a semi-transparent screen between the viewer and the object.
  • the augmentation is of an image displayed on a display, such as augmenting a photograph or video.
  • the display of the mobile device is augmented.
  • the image processor identifies information for the augmentation.
  • the information is an object label.
  • the label may identify a piece or part of the object, identify the object, provide non-geometric information, and/or provide a graphic of geometry for the object.
  • the label has a position relative to the object, as represented by a location in the 3D data. Which of several labels to use may be determined based on user interaction, such as the user selecting a part of the object of interest and the label for that part being added.
  • the augmentation is a graphic, text, highlight, rendered image, or other addition to the view or another image. Any augmentation may be used.
  • the augmentation is a graphic or information positioned adjacent to or to appear to interact with the object rather than being over the object. The pose controls the positioning and/or the interaction.
  • the pose determined in act 30 indicates the position of the label location relative to the viewpoint by the camera.
  • Each or any pixel of the image from the camera may be related to a given part of the object using the pose.
  • the label is transferred to the image.
  • the label transfer converts labeled surfaces of the 3D data into annotated image regions of the viewed area in the photograph.
  • the label transfer is performed by a look-up function. For each pixel from the photograph, the corresponding 3D point on the 3D data in the determined pose is found. The label for that 3D point is transferred to the two-dimensional location (e.g., pixel). By looking up the label from the 3D data, the part of the assembly or object shown at that location is identified or annotated.
  • the 3D data is rendered.
  • the rendering uses the determined pose or viewpoint.
  • a surface rendering is used, such as an on-the-fly rendering with OpenGL or other rendering engine or language.
  • the rendering may use only the visible surfaces. Alternatively, obstructed surfaces may be represented in the rendering. Only a sub-set of surfaces, such as one surface, may be rendered, such as based on user selection.
  • the object label is created as a mesh or rendering from the 3D data.
  • the rendered pixels map to the pixels of the photograph. By combining the intensities from the 3D rendering and the photograph for each pixel, the augmentation is added. Any function for combining may be used.
  • 3D data such as CAD data.
  • the correspondence between 2.5D depth images and 3D data is estimated.
  • the matching recovers the pose for the 2.5D depth data with respect to the 3D data.
  • the pose supports fusion of the involved modalities within a common metric coordinate system. Fusion enables the linkage of spatially related data within involved modalities since the 2D/3D correspondences on image, object and pixel level are found.
  • orthographic projections of the 3D data are generated from potential viewpoints. These orthographic projections may be represented as 2.5D data structures. Using these orthographic projections normalized to a pixel size is invariant to camera parameters.
  • the 3D data e.g., CAD, GIS, engineering, or medical
  • the depth sensing devices e.g., RGBD sensing
  • the orthographic representations may be normalized for pixel size, which is derived from sensor specifications.
  • an indexing system finds potentially relevant projections to the current scene observation.
  • the indexing uses depth histograms, reducing computation costs during pose estimation.
  • Potential 2.5D projections are used within a filter using the orthographic projections.
  • Knowledge of scale and the computed response map from comparison determines the final pose of the camera.
  • the mapping from perspective to orthographic data is not needed. Instead, the orthographic projection is provided directly as an output of the camera.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

Une mise en correspondance (26) d'informations de profondeur et d'informations en 3D permet de déterminer (30) la pose d'un objet (14) par rapport à une caméra (12). Des projections orthographiques des informations en 3D sont créées (24) à partir de différents points de vue. Les projections orthographiques peuvent être représentées sous la forme de structures de données en 2,5D, ce qui permet une mise en correspondance (26) avec des données en 2,5D provenant d'un capteur de profondeur (12) de la caméra (12). L'utilisation des projections orthographiques peut être invariante par rapport aux paramètres de la caméra (12) et peut permettre une mise à l'échelle simple pour la mise en correspondance (26). La mise en correspondance (26) fournit (30) la pose, ce qui permet un transfert de métadonnées depuis les informations en 3D sur les images de la caméra (12).
PCT/US2016/058014 2016-10-21 2016-10-21 Pose d'un objet sur la base d'une mise en correspondance d'informations de profondeur en 2,5d et d'informations en 3d WO2018075053A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2016/058014 WO2018075053A1 (fr) 2016-10-21 2016-10-21 Pose d'un objet sur la base d'une mise en correspondance d'informations de profondeur en 2,5d et d'informations en 3d

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2016/058014 WO2018075053A1 (fr) 2016-10-21 2016-10-21 Pose d'un objet sur la base d'une mise en correspondance d'informations de profondeur en 2,5d et d'informations en 3d

Publications (1)

Publication Number Publication Date
WO2018075053A1 true WO2018075053A1 (fr) 2018-04-26

Family

ID=57227142

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2016/058014 WO2018075053A1 (fr) 2016-10-21 2016-10-21 Pose d'un objet sur la base d'une mise en correspondance d'informations de profondeur en 2,5d et d'informations en 3d

Country Status (1)

Country Link
WO (1) WO2018075053A1 (fr)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135323A (zh) * 2019-05-09 2019-08-16 北京四维图新科技股份有限公司 图像标注方法、装置、***及存储介质
CN110704736A (zh) * 2019-09-29 2020-01-17 北京幻想纵横网络技术有限公司 一种信息发布及展示方法、装置
CN110782489A (zh) * 2019-10-21 2020-02-11 科大讯飞股份有限公司 影像数据的匹配方法、装置、设备及计算机可读存储介质
US10799206B2 (en) 2018-09-28 2020-10-13 General Electric Company System and method for calibrating an imaging system
US20200375546A1 (en) * 2019-06-03 2020-12-03 General Electric Company Machine-guided imaging techniques
CN112913230A (zh) * 2018-10-23 2021-06-04 皇家飞利浦有限公司 图像生成装置及其方法
CN113808282A (zh) * 2021-08-26 2021-12-17 交通运输部水运科学研究所 一种多通航要素数据融合方法
US11317884B2 (en) 2019-12-20 2022-05-03 GE Precision Healthcare LLC Methods and systems for mammography and biopsy workflow optimization
WO2023055825A1 (fr) * 2021-09-30 2023-04-06 Snap Inc. Suivi de vêtement pour le haut du corps 3d
US11636662B2 (en) 2021-09-30 2023-04-25 Snap Inc. Body normal network light and rendering control
US11651572B2 (en) 2021-10-11 2023-05-16 Snap Inc. Light and rendering of garments
US11673054B2 (en) 2021-09-07 2023-06-13 Snap Inc. Controlling AR games on fashion items
US11734866B2 (en) 2021-09-13 2023-08-22 Snap Inc. Controlling interactive fashion based on voice
US11900506B2 (en) 2021-09-09 2024-02-13 Snap Inc. Controlling interactive fashion based on facial expressions

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1986153A2 (fr) * 2007-04-23 2008-10-29 Mitsubishi Electric Corporation Procédé et système pour déterminer les poses d'objets à partir d'une gamme d'images
US20120076361A1 (en) * 2009-06-03 2012-03-29 Hironobu Fujiyoshi Object detection device
EP2927844A1 (fr) * 2014-04-03 2015-10-07 Airbus DS GmbH Estimation de position et de pose des objects 3d
US20160275079A1 (en) * 2015-03-17 2016-09-22 Siemens Aktiengesellschaft Part Identification using a Photograph and Engineering Data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1986153A2 (fr) * 2007-04-23 2008-10-29 Mitsubishi Electric Corporation Procédé et système pour déterminer les poses d'objets à partir d'une gamme d'images
US20120076361A1 (en) * 2009-06-03 2012-03-29 Hironobu Fujiyoshi Object detection device
EP2927844A1 (fr) * 2014-04-03 2015-10-07 Airbus DS GmbH Estimation de position et de pose des objects 3d
US20160275079A1 (en) * 2015-03-17 2016-09-22 Siemens Aktiengesellschaft Part Identification using a Photograph and Engineering Data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
PAUL WOHLHART ET AL: "Learning descriptors for object recognition and 3D pose estimation", 13 April 2015 (2015-04-13), XP055383098, Retrieved from the Internet <URL:https://arxiv.org/pdf/1502.05908.pdf> [retrieved on 20170620] *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10799206B2 (en) 2018-09-28 2020-10-13 General Electric Company System and method for calibrating an imaging system
CN112913230B (zh) * 2018-10-23 2023-09-12 皇家飞利浦有限公司 图像生成装置及其方法
CN112913230A (zh) * 2018-10-23 2021-06-04 皇家飞利浦有限公司 图像生成装置及其方法
CN110135323A (zh) * 2019-05-09 2019-08-16 北京四维图新科技股份有限公司 图像标注方法、装置、***及存储介质
US20200375546A1 (en) * 2019-06-03 2020-12-03 General Electric Company Machine-guided imaging techniques
US10881353B2 (en) 2019-06-03 2021-01-05 General Electric Company Machine-guided imaging techniques
CN110704736A (zh) * 2019-09-29 2020-01-17 北京幻想纵横网络技术有限公司 一种信息发布及展示方法、装置
CN110704736B (zh) * 2019-09-29 2021-02-09 北京幻想纵横网络技术有限公司 一种信息发布及展示方法、装置
CN110782489B (zh) * 2019-10-21 2022-09-30 科大讯飞股份有限公司 影像数据的匹配方法、装置、设备及计算机可读存储介质
CN110782489A (zh) * 2019-10-21 2020-02-11 科大讯飞股份有限公司 影像数据的匹配方法、装置、设备及计算机可读存储介质
US11317884B2 (en) 2019-12-20 2022-05-03 GE Precision Healthcare LLC Methods and systems for mammography and biopsy workflow optimization
CN113808282A (zh) * 2021-08-26 2021-12-17 交通运输部水运科学研究所 一种多通航要素数据融合方法
CN113808282B (zh) * 2021-08-26 2023-09-26 交通运输部水运科学研究所 一种多通航要素数据融合方法
US11673054B2 (en) 2021-09-07 2023-06-13 Snap Inc. Controlling AR games on fashion items
US11900506B2 (en) 2021-09-09 2024-02-13 Snap Inc. Controlling interactive fashion based on facial expressions
US11734866B2 (en) 2021-09-13 2023-08-22 Snap Inc. Controlling interactive fashion based on voice
WO2023055825A1 (fr) * 2021-09-30 2023-04-06 Snap Inc. Suivi de vêtement pour le haut du corps 3d
US11636662B2 (en) 2021-09-30 2023-04-25 Snap Inc. Body normal network light and rendering control
US11983826B2 (en) 2021-09-30 2024-05-14 Snap Inc. 3D upper garment tracking
US11651572B2 (en) 2021-10-11 2023-05-16 Snap Inc. Light and rendering of garments

Similar Documents

Publication Publication Date Title
WO2018075053A1 (fr) Pose d&#39;un objet sur la base d&#39;une mise en correspondance d&#39;informations de profondeur en 2,5d et d&#39;informations en 3d
US20200267371A1 (en) Handheld portable optical scanner and method of using
US10311648B2 (en) Systems and methods for scanning three-dimensional objects
Rogez et al. Mocap-guided data augmentation for 3d pose estimation in the wild
US10073848B2 (en) Part identification using a photograph and engineering data
KR101007276B1 (ko) 3차원 안면 인식
JP7128708B2 (ja) 機械学習用の訓練データの効率的な収集のための拡張現実を使用したシステム及び方法
US20200057778A1 (en) Depth image pose search with a bootstrapped-created database
Hoppe et al. Online Feedback for Structure-from-Motion Image Acquisition.
CN108053476B (zh) 一种基于分段三维重建的人体参数测量***及方法
US20170330375A1 (en) Data Processing Method and Apparatus
WO2019035155A1 (fr) Système de traitement d&#39;image, procédé de traitement d&#39;image et programme
US20130095920A1 (en) Generating free viewpoint video using stereo imaging
WO2017029488A2 (fr) Procédés de génération de modèles de têtes en 3d ou modèles de corps en 3d personnalisés
WO2016029939A1 (fr) Procédé et système pour déterminer au moins une caractéristique d&#39;image dans au moins une image
WO2014172484A1 (fr) Scanneur optique à main portable et son procédé d&#39;utilisation
JP5795250B2 (ja) 被写体姿勢推定装置および映像描画装置
Rogez et al. Image-based synthesis for deep 3D human pose estimation
JP7379065B2 (ja) 情報処理装置、情報処理方法、およびプログラム
Ruchay et al. Accuracy analysis of 3D object reconstruction using RGB-D sensor
Zhang et al. Dense 3D facial reconstruction from a single depth image in unconstrained environment
Kang et al. Progressive 3D model acquisition with a commodity hand-held camera
Khan et al. A review of benchmark datasets and training loss functions in neural depth estimation
Aliakbarpour et al. Multi-sensor 3D volumetric reconstruction using CUDA
Wang et al. Im2fit: Fast 3d model fitting and anthropometrics using single consumer depth camera and synthetic data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16790834

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16790834

Country of ref document: EP

Kind code of ref document: A1