CN113137916B

CN113137916B - Automatic measurement based on object classification

Info

Publication number: CN113137916B
Application number: CN202110053465.3A
Authority: CN
Inventors: A·贾因; A·萨恩卡; 单琦; A·达·威加; S·V·乔希
Original assignee: Apple Inc
Current assignee: Apple Inc
Priority date: 2020-01-17
Filing date: 2021-01-15
Publication date: 2023-07-11
Anticipated expiration: 2041-01-15
Also published as: CN113137916A

Abstract

The present disclosure relates to automatic measurement based on object classification. Various implementations disclosed herein include apparatuses, systems, and methods that acquire a three-dimensional (3D) representation of a physical environment generated based on depth data and light intensity image data, generate a 3D bounding box corresponding to an object in the physical environment based on the 3D representation, classify the object based on the 3D bounding box and 3D semantic data, and display measurements of the object, wherein the measurements of the object are determined using one of a plurality of class-specific neural networks selected based on the classification of the object.

Description

Automatic measurement based on object classification

Technical Field

The present disclosure relates generally to generating a geometric representation of an object in a physical environment, and in particular to a system and method of generating a geometric representation based on information detected in a physical environment.

Background

Object detection and accurate measurement of objects play an important role in designing, understanding, and reshaping the indoor space and generating an accurate reconstruction. There are many obstacles to providing a computer-based system to automatically generate object measurements based on sensor data. The acquired sensor data (e.g., image and depth data) about the physical environment may be incomplete or insufficient to provide accurate measurements. As another example, image and depth data often lacks semantic information, and measurements generated without such data may lack accuracy. The prior art does not allow for automatic, accurate and efficient generation of object measurements using a mobile device (e.g., based on photographs or videos or other sensor data captured while a user is walking in a room). Furthermore, the prior art may not provide sufficiently accurate and effective measurements in a real-time (e.g., direct measurement during scanning) environment.

Disclosure of Invention

Various implementations disclosed herein include devices, systems, and methods for generating measurements using three-dimensional (3D) representations of physical environments. A 3D representation of the physical environment may be generated based on sensor data such as image and depth sensor data. In some implementations, 3D representations of physical environments that are labeled with language are used to facilitate object detection and generation of object measurements. Some implementations perform semantic segmentation and tagging of 3D point clouds of a physical environment.

According to some implementations, measurements of objects (e.g., furniture, appliances, etc.) detected in a physical environment may be generated using a variety of different techniques. In some implementations, the object is measured by: a 3D bounding box of the object is first generated based on the depth data, the bounding box is refined using the various neural networks and refinement algorithms described herein, and measurements are acquired based on the refined bounding box and associated 3D data points for the respective bounding box. In some implementations, the object is measured using machine learning techniques (e.g., neural networks) using different types of measurements for different object types. For example, different types of measurements may include the seat height of a chair, the display diameter of a television, the table diameter of a round table, the table length of a rectangular table, etc.

Some implementations disclosed herein may realize various advantages by measuring objects using multiple class-specific machine learning models (e.g., class-specific neural networks). In some implementations, multiple machine learning models are trained to determine different measurements for different object classes. For example, one model may be trained and used to determine measurements of chair type objects (e.g., to determine seat height, armrest length, etc.), and another model may be trained and used to determine measurements of television type objects (e.g., to determine diagonal screen size, maximum television depth, etc.). Such class-specific measurements may provide more information than the simple length, width, and height of the bounding box that may be identified for each object.

In some implementations, automated measurement techniques use slices or horizontal planes to identify surfaces (e.g., seat tops) and use those detected surfaces to provide measurements (e.g., seat-to-ground distances).

Some implementations relate to an exemplary method of providing measurements for objects within a physical environment. An exemplary method involves acquiring a 3D representation of a physical environment generated based on depth data and light intensity image data. For example, a 3D point cloud may be generated based on depth camera information received simultaneously with the light intensity image. In some implementations, the 3D representation may be associated with 3D semantic data. For example, the algorithm may perform semantic segmentation and labeling of 3D point cloud points.

The example method also involves generating a 3D bounding box corresponding to the object in the physical environment based on the 3D representation. For example, a 3D bounding box may provide the location, pose (e.g., orientation and position), and shape of each furniture and appliance in a room or portion of a room. The 3D bounding box may be refined using dilation and cutting techniques. In some implementations, generating the refined 3D bounding box includes generating a proposed 3D bounding box of the object using the first neural network, and generating the refined 3D bounding box by expanding the proposed 3D bounding box based on a bounding box expansion scale (e.g., expanding the bounding box by 10%), identifying features of the object of the expanded proposed bounding box using the second neural network, and refining the proposed 3D bounding box based on the identified features. In some implementations, the first neural network generates the proposed bounding box based on 3D semantic data associated with the object. In some implementations, the second neural network identifies features of the object based on 3D semantic data associated with the object. In some implementations, the third neural network is trained to refine an accuracy of the identified features from the second neural network based on 3D semantic data associated with the object and light intensity image data (e.g., RGB data) acquired during the scanning process, and to output a further refined 3D bounding box based on the refined accuracy of the identified features.

The example method also involves classifying the object based on the 3D bounding box and the 3D semantic data. For example, a class name or label is provided for each generated 3D bounding box. In some implementations, classifying the object based on the 3D bounding box and the 3D semantic data includes determining a class of the 3D bounding box based on the 3D semantic data using an object classification neural network, and classifying the object corresponding to the 3D bounding box based on the classification of the 3D bounding box. In some implementations, a neural network specific to a first class is trained to determine specific points on a first object class (e.g., chair) for measuring objects in the first class. For example, the armrest length and seat height of the chair may be determined. In some implementations, a neural network specific to a second class is trained to determine a specific point on a second object class (e.g., table) for measuring objects in the second class, wherein the second object class is different from the first object class. Such as table height, table top size specific to a circular or rectangular table top. The measurement results of the objects in the second object classification are different from the measurement results of the objects in the first object classification. For example, the chair may include more or at least different measurements than a table or television.

The example method also involves displaying a measurement of the object (e.g., armrest length, seat height, television diameter, etc.), wherein the measurement of the object is determined using one of a plurality of class-specific neural networks selected based on the classification of the object. For example, a first network is trained to determine a particular point on a chair for chair measurements, and a second network is trained to determine a different point on a table for table measurements. In use, a user may scan a room using a device (e.g., a smart phone), and the processes described herein will identify an object (e.g., a chair) and provide specific measurements of the object (e.g., chair height, seat height, base width, etc.). In some implementations, the measurement results may be automatically displayed on a user device overlaying the object or in close proximity to the object. In some implementations, the measurement results may be provided after some type of user interaction with the identified object. For example, a transparent bounding box surrounding the object may be shown to the user, and the user may select or click on the bounding box, and the measurement results will then be displayed.

In some implementations, the 3D representation is associated with 3D semantic data comprising a 3D point cloud, the 3D point cloud comprising semantic tags associated with at least a portion of 3D points within the 3D point cloud. Additionally, in some implementations, the semantic tags identify walls, wall attributes (e.g., doors and windows), objects, and classifications of objects in the physical environment.

Some implementations of the present disclosure relate to an exemplary method of providing measurement data for an object within a physical environment. An exemplary method first involves acquiring a 3D representation of a physical environment generated based on depth data and light intensity image data. For example, a 3D point cloud may be generated based on depth camera information received simultaneously with the image. In some implementations, the 3D representation is associated with 3D semantic data. For example, algorithms may be used for semantic segmentation and labeling of 3D point clouds of indoor scenes.

The example method also involves generating a 3D bounding box corresponding to the object in the physical environment based on the 3D representation. For example, a 3D bounding box may provide the position, pose (e.g., orientation and position), and shape of each furniture and appliance in a room. The bounding box may be refined using expansion and cutting techniques. In some implementations, generating the refined 3D bounding box includes generating a proposed 3D bounding box of the object using the first neural network, and generating the refined 3D bounding box by expanding the proposed 3D bounding box based on a bounding box expansion scale (e.g., expanding the bounding box by 10%), identifying features of the object of the expanded proposed 3D bounding box using the second neural network, and refining the proposed 3D bounding box based on the identified features. In some implementations, the first neural network generates the proposed 3D bounding box based on 3D semantic data associated with the object. In some implementations, the second neural network identifies features of the object based on 3D semantic data associated with the object. In some implementations, the third neural network is trained to refine an accuracy of the identified features from the second neural network based on 3D semantic data associated with the object and light intensity image data (e.g., RGB data) acquired during the scanning process, and to output a further refined 3D bounding box based on the refined accuracy of the identified features.

The example method also involves determining a class of the object based on the 3D semantic data. For example, class names or labels are provided for the generated 3D bounding boxes. In some implementations, classifying the object based on the 3D semantic data includes determining a class of the object based on the 3D semantic data using an object classification neural network, and classifying a 3D bounding box corresponding to the object based on the determined class of the object. In some implementations, a neural network specific to a first class is trained to determine specific points on a first object class (e.g., chair) for measuring objects in the first class. Such as the armrest length and seat height of the chair. In some implementations, a neural network specific to a second class is trained to determine a specific point on a second object class (e.g., table) for measuring objects in the second class, wherein the second object class is different from the first object class. For example, for a table, the table height may be specified to the size of the tabletop of a circular or rectangular table. The measurement results of the objects in the second object classification are different from the measurement results of the objects in the first object classification. For example, the chair may include more or at least different measurements than a table or television.

The example method also involves determining a location of a surface of the object based on the class of the object. The location is determined by identifying planes within the 3D bounding box that have semantics in the 3D semantic data that meet the surface criteria of the object. For example, a number of chair voxels are identified within a horizontal plane indicating that the plane is the seating surface of a chair type object.

The example method also involves providing a measurement of the object (e.g., seat height, etc.). The measurement of the object is determined based on the position of the surface of the object. For example, measurements from the seat surface to the floor may be collected to provide a seat height measurement. For example, a user may scan a room using a device (e.g., a smart phone), and the processes described herein will identify an object (e.g., a chair) and provide a measurement of the identified object (e.g., seat height, etc.). In some implementations, the measurement results may be automatically displayed on a user device overlaying the object or in close proximity to the object. In some implementations, the measurement results may be provided after some type of user interaction with the identified object. For example, a transparent bounding box surrounding the object may be shown to the user, and the user may select or click on the bounding box, and the measurement results will then be displayed.

In some implementations, identifying a plane within the bounding box includes identifying that a plurality of 3D data points (e.g., voxels) are within a particular plane (e.g., horizontal), indicating that the plane is a surface of a particular feature (e.g., chair seat) of a type of object.

In some implementations, a number of 3D data points within a particular plane is determined based on a comparison to a data point plane threshold to indicate that the plane is a surface of a particular feature of a type of object. For example, the plane threshold is a particular number of data points. In some implementations, the plane threshold is a percentage of data points compared to other data points that are semantically labeled. For example, if 30% or more of the points are on the same horizontal plane (i.e., the same height level), it may be determined that the detected horizontal plane is the seat of the chair. In some implementations, different threshold percentages may be used for other object classifications. For example, a table will have a higher percentage of data points on the same horizontal plane. In some implementations, different detected planes may be used and compared to determine different features of the identified object.

According to some implementations, an apparatus includes one or more processors, non-transitory memory, and one or more programs; the one or more programs are stored in the non-transitory memory and configured to be executed by the one or more processors, and the one or more programs include instructions for performing or causing performance of any of the methods described herein. According to some implementations, a non-transitory computer-readable storage medium has instructions stored therein, which when executed by one or more processors of a device, cause the device to perform or cause to perform any of the methods described herein. According to some implementations, an apparatus includes: one or more processors, non-transitory memory, and means for performing or causing performance of any one of the methods described herein.

Drawings

Accordingly, the present disclosure may be understood by those of ordinary skill in the art, and the more detailed description may reference aspects of some illustrative implementations, some of which are shown in the accompanying drawings.

FIG. 1 is a block diagram of an exemplary operating environment, according to some implementations.

FIG. 2 is a block diagram of an exemplary server according to some implementations.

Fig. 3 is a block diagram of an exemplary device according to some implementations.

FIG. 4 is a system flow diagram of an exemplary generation of semantic 3D representations using three-dimensional (3D) data based on depth and light intensity image information and semantic segmentation, according to some implementations.

FIG. 5 is a flowchart representation of an exemplary method of generating and displaying measurements of an object determined using class-specific neural networks based on a 3D representation of a physical environment, according to some implementations.

Fig. 6A-6B are system flowcharts of exemplary generation of measurements of an object determined using one of a plurality of class-specific neural networks of a physical environment based 3D representation, according to some implementations.

FIG. 7 is a flow chart representation of an exemplary method of generating and providing measurements of an object based on surface location determination, according to some implementations.

FIG. 8 is a system flow diagram illustrating exemplary generation of measurement results for an object based on surface location determination, according to some implementations.

The various features shown in the drawings may not be drawn to scale according to common practice. Accordingly, the dimensions of the various features may be arbitrarily expanded or reduced for clarity. In addition, some figures may not depict all of the components of a given system, method, or apparatus. Finally, like reference numerals may be used to refer to like features throughout the specification and drawings.

Detailed Description

Numerous details are described to provide a thorough understanding of the example implementations shown in the drawings. However, the drawings illustrate only some example aspects of the disclosure and therefore should not be considered limiting. It will be understood by those of ordinary skill in the art that other effective aspects and/or variations do not include all of the specific details described herein. Moreover, well-known systems, methods, components, devices, and circuits have not been described in detail so as not to obscure the more pertinent aspects of the exemplary implementations described herein.

FIG. 1 is a block diagram of an exemplary operating environment 100, according to some implementations. In this example, the exemplary operating environment 100 illustrates an exemplary physical environment 105 including

walls

130, 132, 134, a chair 140, a table 142, a door 150, and a television 152. While pertinent features are shown, those of ordinary skill in the art will recognize from this disclosure that various other features are not shown for the sake of brevity and so as not to obscure more pertinent aspects of the exemplary implementations disclosed herein. To this end, operating environment 100 includes, as a non-limiting example, a server 110 and a device 120. In an exemplary implementation, the operating environment 100 does not include a server 110, and the methods described herein are performed on the device 120.

In some implementations, the server 110 is configured to manage and coordinate user experiences. In some implementations, the server 110 includes suitable combinations of software, firmware, and/or hardware. Server 110 is described in more detail below with reference to fig. 2. In some implementations, the server 110 is a computing device that is in a local or remote location relative to the physical environment 105. In one example, server 110 is a local server located within physical environment 105. In another example, server 110 is a remote server (e.g., cloud server, central server, etc.) located outside of physical environment 105. In some implementations, the server 110 is communicatively coupled with the device 120 via one or more wired or wireless communication channels (e.g., bluetooth, IEEE 802.11x, IEEE 802.16x, IEEE 802.3x, etc.).

In some implementations, the device 120 is configured to present an environment to a user. In some implementations, the device 120 includes suitable combinations of software, firmware, and/or hardware. The device 120 is described in more detail below with reference to fig. 3. In some implementations, the functionality of the server 110 is provided by and/or combined with the device 120.

In some implementations, the device 120 is a handheld electronic device (e.g., a smart phone or tablet computer) configured to present content to a user. In some implementations, the user wears the device 120 on his/her head. As such, the device 120 may include one or more displays provided for displaying content. For example, the device 120 may enclose a field of view of the user. In some implementations, the device 120 is replaced with a chamber, housing, or compartment configured to present content, wherein the device 120 is not worn or held by a user.

Fig. 2 is a block diagram of an example of a server 110 according to some implementations. While certain specific features are shown, those of ordinary skill in the art will appreciate from the disclosure that various other features are not shown for brevity and so as not to obscure more pertinent aspects of the implementations disclosed herein. To this end, as a non-limiting example, in some implementations, the server 110 includes one or more processing units 202 (e.g., microprocessors, application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), graphics Processing Units (GPUs), central Processing Units (CPUs), processing cores, etc.), one or more input/output (I/O) devices 206, one or more communication interfaces 208 (e.g., universal Serial Bus (USB), FIREWIRE, THUNDERBOLT, IEEE 802.3x, IEEE 802.11x, IEEE 802.16x, global system for mobile communications (GSM), code Division Multiple Access (CDMA), time Division Multiple Access (TDMA), global Positioning System (GPS), infrared (IR), bluetooth, ZIGBEE, and/or similar types of interfaces), one or more programming (e.g., I/O) interfaces 210, memory 220, and one or more communication buses 204 for interconnecting these components and various other components.

In some implementations, the one or more communication buses 204 include circuitry that interconnects the system components and controls communication between the system components. In some implementations, the one or more I/O devices 206 include at least one of a keyboard, a mouse, a touch pad, a joystick, one or more microphones, one or more speakers, one or more image sensors, one or more displays, and the like.

Memory 220 includes high-speed random access memory such as Dynamic Random Access Memory (DRAM), static Random Access Memory (SRAM), double data rate random access memory (DDR RAM), or other random access solid state memory devices. In some implementations, the memory 220 includes non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid state storage devices. Memory 220 optionally includes one or more storage devices located remotely from the one or more processing units 202. Memory 220 includes a non-transitory computer-readable storage medium. In some implementations, the memory 220 or a non-transitory computer readable storage medium of the memory 220 stores the following programs, modules, and data structures, or a subset thereof, including an optional operating system 230 and one or more application programs 240.

Operating system 230 includes processes for handling various basic system services and for performing hardware-related tasks. In some implementations, the application 240 is configured to manage and coordinate one or more experiences of one or more users (e.g., a single experience of one or more users, or multiple experiences of a respective group of one or more users).

The application 240 includes a 3D representation unit 242, an object detection unit 244, and a measurement unit 246. The 3D representation unit 242, the object detection unit 244 and the measurement unit 246 may be combined into a single application or unit or divided into one or more additional applications or units.

The 3D representation unit 242 is configured with instructions executable by the processor to obtain image data (e.g., light intensity data, depth data, etc.) and integrate (e.g., fuse) the image data using one or more of the techniques disclosed herein. For example, the 3D representation unit 242 fuses the RGB image from the light intensity camera with sparse depth maps from the depth camera (e.g., time-of-flight sensor) and other sources of physical environmental information to output a dense depth point cloud of information. In addition, the 3D representation unit 242 is configured with instructions executable by the processor to obtain light intensity image data (e.g., RGB) and perform a semantic segmentation algorithm to assign semantic tags to features identified in the image data and generate semantic image data (e.g., RGB-S) using one or more of the techniques disclosed herein. The 3D representation unit 242 is further configured with instructions executable by the processor to obtain light intensity image data (e.g., RGB) and depth image data, and generate a semantic 3D representation (e.g., a 3D point cloud with associated semantic tags) using one or more of the techniques disclosed herein. In some implementations, the 3D representation unit 242 includes separate units, such as an integration unit for generating 3D point cloud data, a semantic unit for semantic segmentation based on light intensity data (e.g., RGB-S), and a semantic 3D unit for generating a semantic 3D representation, as further described herein with reference to fig. 4.

The object detection unit 244 is configured with instructions executable by the processor to generate and display measurements of objects determined using one of a plurality of class-specific neural networks of the physical environment based on a 3D representation of the physical environment (e.g., 3D point cloud, 3D mesh reconstruction, semantic 3D point cloud, etc.) using one or more of the techniques disclosed herein. For example, the object detection unit 244 acquires a sequence of light intensity images from a light intensity camera (e.g., a real-time camera feed), a semantic 3D representation (e.g., a semantic 3D point cloud) generated by the 3D representation unit 242, and other sources of physical environmental information (e.g., camera positioning information from a SLAM system of cameras). The object detection unit 244 may identify objects (e.g., furniture, appliances, etc.) in the sequence of light intensity images based on the semantic 3D representation, generate a bounding box for each identified object, and perform post-processing using the fine-tuning neural network techniques further disclosed herein.

In some implementations, the object detection unit 244 includes separate units, such as an object detection neural network unit for identifying objects and generating proposed bounding boxes, an associated post-processing unit for fine-tuning the bounding box of each identified object, and an object classification neural network for classifying each type of object, as further discussed herein with reference to fig. 6 and 8.

The measurement unit 246 is configured with instructions executable by the processor to generate measurement data based on the 3D representation of the identified object (e.g., 3D point cloud, 3D mesh reconstruction, semantic 3D point cloud, etc.) using one or more techniques disclosed herein. For example, the measurement unit 246 obtains data associated with the bounding box (e.g., classified and refined bounding box) of the identified object from the object detection unit 244. The measurement unit 246 is configured with instructions executable by the processor to generate and provide measurement data based on the 3D representation of the identified object using one or more processes further disclosed herein with reference to fig. 6 and 8.

In some implementations, the measurement unit 246 includes multiple machine learning units for each particular type of object. For example, a class 1 neural network for chairs, a class 2 neural network for desks, a class 3 neural network for televisions, etc. Multiple machine learning units may be trained on different subsets of objects such that measurement unit 246 may provide different types of measurements for each subset of objects (e.g., diameter of a circular table versus length and width of a rectangular table). The measurement unit 246 is configured with instructions executable by the processor to generate and provide measurement data based on the 3D representation of the identified objects for each subset of the neural network for each object class using one or more processes further disclosed herein with reference to fig. 6.

In some implementations, the measurement unit 246 includes a plane detection unit to identify planes within the bounding box that have semantics satisfying the surface criteria of the object in the 3D semantic data. For example, a number of chair voxels are identified within a horizontal plane indicating that the plane is the seating surface of a chair type object. The measurement unit 246 is configured with instructions executable by the processor to generate and provide measurement data based on the 3D representation of the identified object and the particular plane detection using one or more processes further disclosed herein with reference to fig. 8.

While these elements are shown as residing on a single device (e.g., server 110), it should be understood that in other implementations, any combination of elements may reside in a single computing device. Furthermore, FIG. 2 is a functional description of various features that are more fully utilized in a particular implementation, as opposed to the structural schematic of the implementations described herein. As will be appreciated by one of ordinary skill in the art, the individually displayed items may be combined and some items may be separated. For example, some of the functional blocks shown separately in fig. 2 may be implemented in a single block, and the various functions of a single functional block may be implemented by one or more functional blocks in various implementations. The actual number of modules and the division of particular functions, as well as how features are allocated among them, will vary depending upon the particular implementation, and in some implementations, depend in part on the particular combination of hardware, software, and/or firmware selected for a particular implementation.

Fig. 3 is a block diagram of an example of a device 120 according to some implementations. While certain specific features are shown, those of ordinary skill in the art will appreciate from the disclosure that various other features are not shown for brevity and so as not to obscure more pertinent aspects of the implementations disclosed herein. To this end, as a non-limiting example, in some implementations, the device 120 includes one or more processing units 302 (e.g., microprocessors, ASIC, FPGA, GPU, CPU, processing cores, etc.), one or more input/output (I/O) devices and sensors 306, one or more communication interfaces 308 (e.g., USB, FIREWIRE, THUNDERBOLT, IEEE 802.3x, IEEE802.11x, IEEE 802.16x, GSM, CDMA, TDMA, GPS, IR, BLUETOOTH, ZIGBEE, SPI, I2C, and/or similar types of interfaces), one or more programming (e.g., I/O) interfaces 310, one or more AR/VR displays 312, one or more internally and/or externally facing image sensors 314, a memory 320, and one or more communication buses 304 for interconnecting these components and various other components.

In some implementations, one or more of the communication buses 304 include circuitry that interconnects and controls communications between system components. In some implementations, the one or more I/O devices and sensors 306 include at least one of: an Inertial Measurement Unit (IMU), an accelerometer, a magnetometer, a gyroscope, a thermometer, one or more physiological sensors (e.g., blood pressure monitor, heart rate monitor, blood oxygen sensor, blood glucose sensor, etc.), one or more microphones, one or more speakers, a haptic engine, one or more depth sensors (e.g., structured light, time of flight, etc.), and the like.

In some implementations, the one or more displays 312 are configured to present an experience to the user. In some implementations, one or more of the displays 312 correspond to holographic, digital Light Processing (DLP), liquid Crystal Displays (LCD), liquid crystal on silicon (LCoS), organic light emitting field effect transistors (OLET), organic Light Emitting Diodes (OLED), surface conduction electron emitter displays (SED), field Emission Displays (FED), quantum dot light emitting diodes (QD-LED), microelectromechanical systems (MEMS), and/or similar display types. In some implementations, the one or more displays 312 correspond to diffractive, reflective, polarizing, holographic, etc. waveguide displays. For example, device 120 includes a single display. As another example, device 120 includes a display for each eye of the user.

In some implementations, the one or more image sensor systems 314 are configured to acquire image data corresponding to at least a portion of the physical environment 105. For example, the one or more image sensor systems 314 include one or more RGB cameras (e.g., with Complementary Metal Oxide Semiconductor (CMOS) image sensors or Charge Coupled Device (CCD) image sensors), monochrome cameras, IR cameras, event based cameras, and the like. In various implementations, the one or more image sensor systems 314 also include an illumination source, such as a flash, that emits light. In some implementations, the one or more image sensor systems 314 also include an on-camera Image Signal Processor (ISP) configured to perform a plurality of processing operations on the image data, including at least a portion of the processes and techniques described herein.

Memory 320 includes high-speed random access memory such as DRAM, SRAM, DDR RAM or other random access solid state memory devices. In some implementations, the memory 320 includes non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid-state storage devices. Memory 320 optionally includes one or more storage devices located remotely from the one or more processing units 302. Memory 320 includes a non-transitory computer-readable storage medium. In some implementations, the memory 320 or a non-transitory computer readable storage medium of the memory 320 stores the following programs, modules, and data structures, or a subset thereof, including an optional operating system 330 and one or more application programs 340.

Operating system 330 includes processes for handling various basic system services and for performing hardware-related tasks. In some implementations, the application 340 is configured to manage and coordinate one or more experiences of one or more users (e.g., a single experience of one or more users, or multiple experiences of a respective group of one or more users). The application 340 includes a 3D representation unit 342, an object detection unit 344, and a measurement unit 346. The 3D representation unit 342, the object detection unit 344 and the measurement unit 346 may be combined into a single application or unit or divided into one or more additional applications or units.

The 3D representation unit 342 is configured with instructions executable by the processor to obtain image data (e.g., light intensity data, depth data, etc.) and integrate (e.g., fuse) the image data using one or more of the techniques disclosed herein. For example, the 3D representation unit 342 fuses the RGB image from the light intensity camera with sparse depth maps from the depth camera (e.g., time-of-flight sensor) and other sources of physical environmental information to output a dense depth point cloud of information. In addition, the 3D representation unit 342 is configured with instructions executable by the processor to obtain light intensity image data (e.g., RGB) and perform a semantic segmentation algorithm to assign semantic tags to features identified in the image data and generate semantic image data (e.g., RGB-S) using one or more of the techniques disclosed herein. The 3D representation unit 342 is further configured with instructions executable by the processor to obtain light intensity image data (e.g., RGB) and depth image data, and generate a semantic 3D representation (e.g., a 3D point cloud with associated semantic tags) using one or more of the techniques disclosed herein. In some implementations, the 3D representation unit 342 includes separate units, such as an integration unit for generating 3D point cloud data, a semantic unit for semantic segmentation based on light intensity data (e.g., RGB-S), and a semantic 3D unit for generating a semantic 3D representation, as further described herein with reference to fig. 4.

The object detection unit 344 is configured with instructions executable by the processor to generate and display measurements of objects determined using one of a plurality of class-specific neural networks of the physical environment based on a 3D representation of the physical environment (e.g., 3D point cloud, 3D mesh reconstruction, semantic 3D point cloud, etc.) using one or more of the techniques disclosed herein. For example, the object detection unit 344 acquires a sequence of light intensity images from a light intensity camera (e.g., a real-time camera feed), a semantic 3D representation (e.g., a semantic 3D point cloud) generated by the 3D representation unit 342, and other sources of physical environmental information (e.g., camera positioning information from a SLAM system of the camera). The object detection unit 344 may identify objects (e.g., furniture, appliances, etc.) in the sequence of light intensity images based on the semantic 3D representation, generate bounding boxes for each identified object, and perform post-processing using the fine-tuning neural network techniques further disclosed herein.

In some implementations, the object detection unit 344 includes separate units, such as an object detection neural network unit for identifying objects and generating proposed bounding boxes, an associated post-processing unit for fine-tuning the bounding box of each identified object, and an object classification neural network for classifying each type of object, as further discussed herein with reference to fig. 6 and 8.

The measurement unit 346 is configured with instructions executable by the processor to generate measurement data based on the 3D representation of the identified object (e.g., 3D point cloud, 3D mesh reconstruction, semantic 3D point cloud, etc.) using one or more techniques disclosed herein. For example, the measurement unit 346 acquires data associated with a bounding box (e.g., a classified and refined bounding box) of the identified object from the object detection unit 344. The measurement unit 346 is configured with instructions executable by the processor to generate and provide measurement data based on the 3D representation of the identified object using one or more processes further disclosed herein with reference to fig. 6 and 8.

In some implementations, the measurement unit 346 includes multiple machine learning units for each particular type of object. For example, a class 1 neural network for chairs, a class 2 neural network for desks, a class 3 neural network for televisions, etc. Multiple machine learning units may be trained on different subsets of objects such that measurement unit 346 may provide different types of measurements for each subset of objects (e.g., diameter of a circular table versus length and width of a rectangular table). The measurement unit 346 is configured with instructions executable by the processor to generate and provide measurement data based on the 3D representation of the identified objects for each subset of the neural network for each object class using one or more processes further disclosed herein with reference to fig. 6.

In some implementations, the measurement unit 346 includes a plane detection unit to identify planes within the bounding box that have semantics satisfying the surface criteria of the object in the 3D semantic data. For example, a number of chair voxels are identified within a horizontal plane indicating that the plane is the seating surface of a chair type object. The measurement unit 346 is configured with instructions executable by the processor to generate and provide measurement data based on the 3D representation of the identified object and the particular plane detection using one or more processes further disclosed herein with reference to fig. 8.

While these elements are shown as residing on a single device (e.g., device 120), it should be understood that in other implementations, any combination of elements may reside in a single computing device. Furthermore, FIG. 3 is a functional description of various features that are more fully utilized in a particular implementation, as opposed to the structural schematic of the implementations described herein. As will be appreciated by one of ordinary skill in the art, the individually displayed items may be combined and some items may be separated. For example, some of the functional blocks (e.g., application 340) shown separately in fig. 3 may be implemented in a single module, and the various functions of the single functional block may be implemented by one or more functional blocks in various implementations. The actual number of modules and the division of particular functions, as well as how features are allocated among them, will vary depending upon the particular implementation, and in some implementations, depend in part on the particular combination of hardware, software, and/or firmware selected for a particular implementation.

FIG. 4 is a system flow diagram of an exemplary environment 400 in which the system may generate a semantic 3D representation using 3D data and semantic segmentation data based on depth and light intensity image information detected in a physical environment. In some implementations, the system flow of the exemplary environment 400 is performed on a device (e.g., the server 110 or the device 120 of fig. 1-3) such as a mobile device, a desktop computer, a laptop computer, or a server device. The system flow of the exemplary environment 400 may be displayed on a device (e.g., the device 120 of fig. 1 and 3) having a screen for displaying images and/or a screen for viewing stereoscopic images, such as a Head Mounted Display (HMD). In some implementations, the system flow of the exemplary environment 400 is performed on processing logic (including hardware, firmware, software, or a combination thereof). In some implementations, the system flow of the exemplary environment 400 is performed on a processor executing code stored in a non-transitory computer readable medium (e.g., memory).

The system flow of the exemplary environment 400 captures image data of a physical environment (e.g., the physical environment 105 of fig. 1), and a 3D representation unit 410 (e.g., the 3D representation unit 242 of fig. 2 and/or the 3D representation unit 342 of fig. 3) generates a semantic 3D representation 445 representing a surface in the 3D environment using a 3D point cloud with associated semantic tags. In some implementations, the semantic 3D representation 445 is a 3D reconstruction grid using a meshing algorithm based on depth information detected in the physical environment, which is integrated (e.g., fused) to reconstruct the physical environment. A meshing algorithm (e.g., a double-mobile cube meshing algorithm, a poisson meshing algorithm, a tetrahedral meshing algorithm, etc.) may be used to generate a mesh representing the room (e.g., physical environment 105) and/or objects within the room (e.g., walls 130, doors 150, televisions 152, chairs 140, tables 142, etc.). In some implementations, for 3D reconstruction using a grid, to effectively reduce the amount of memory used in the reconstruction process, a voxel hashing method is used, in which the 3D space is divided into blocks of voxels, referenced by a hash table using its 3D location as a key. The voxel blocks are structured around the object surface only, freeing up memory that would otherwise be used to store empty space. At this time, the voxel hashing method is also faster than the competing method (such as the octree-based method). Furthermore, it supports data flow between the GPU, where memory is typically limited, and the CPU, where memory is more abundant.

In one exemplary implementation, environment 400 includes an image composition pipeline that collects or acquires data of a physical environment (e.g., image data from an image source). The exemplary environment 400 is an example of acquiring image data (e.g., light intensity data and depth data) for a plurality of image frames. The image sources may include a depth camera 402 that collects depth data 404 of the physical environment, and a light intensity camera 406 (e.g., an RGB camera) that collects light intensity image data 408 (e.g., a sequence of RGB image frames).

The 3D representation unit 410 includes an integration unit 420 configured with instructions executable by a processor to obtain image data (e.g., light intensity data 408, depth data 404, etc.) and integrate (e.g., fuse) the image data using one or more known techniques. For example, the image integration unit 420 receives the depth image data 404 and the intensity image data 408 from image sources (e.g., the light intensity camera 406 and the depth camera 402), and integrates the image data and generates 3D data 422. The 3D data 422 may include dense 3D point clouds 424 (e.g., incomplete depth maps and camera poses of a plurality of image frames surrounding the object) sent to a semantic 3D unit 440. Different sizes of gray points in the 3D point cloud 424 represent different depth values detected within the depth data. For example, the image integration unit 420 fuses the RGB image from the light intensity camera with sparse depth maps from the depth camera (e.g., time-of-flight sensor) and other sources of physical environmental information to output a dense depth point cloud of information. The 3D data 422 may also be voxelized, as represented by a voxelized 3D point cloud 426, where different shadows on each voxel represent different depth values.

The 3D representation unit 410 further includes a semantic unit 430 configured with instructions executable by the processor to obtain light intensity image data (e.g., light intensity data 408) and semantic segmentation wall structures (walls, doors, windows, etc.) and object types (e.g., table, teapot, chair, vase, etc.) using one or more known techniques. For example, the semantic unit 430 receives the intensity image data 408 from an image source (e.g., the light intensity camera 406) and generates semantic segmentation data 432 (e.g., RGB-S data). For example, semantic segmentation 434 illustrates a semantically tagged image of physical environment 105 in FIG. 1. In some implementations, the semantic unit 430 uses a machine learning model, where the semantic segmentation model may be configured to identify semantic tags for pixels or voxels of the image data. In some implementations, the machine learning model is a neural network (e.g., an artificial neural network), a decision tree, a support vector machine, a bayesian network, and the like.

The 3D representation unit 410 further includes a semantic 3D unit 440 configured with instructions executable by the processor to obtain 3D data 422 (e.g., 3D point cloud data 424) from the integration unit 420 and semantic segmentation data 432 (e.g., RGB-S data) from the semantic unit 430, and generate a semantic 3D representation 445 using one or more techniques. For example, semantic representation unit 440 generates semantic tagged 3D point cloud 447 by gathering 3D point cloud data 424 and semantic segmentation 434 using a semantic 3D algorithm that fuses the 3D data and semantic tags. In some implementations, each semantic tag includes a confidence value. For example, a particular point may be marked as an object (e.g., a table), and the data point will include x, y, z coordinates and confidence values as decimal values (e.g., 0.9 to represent 90% confidence that the semantic tag correctly classifies the particular data point). In some implementations, the 3D reconstruction grid may be generated as a semantic 3D representation 445.

FIG. 5 is a flowchart representation of an exemplary method 500 of providing measurement data for objects within a physical environment, according to some implementations. In some implementations, the method 500 is performed by a device (e.g., the server 110 or the device 120 of fig. 1-3) such as a mobile device, a desktop computer, a laptop computer, or a server device. The method 500 may be displayed on a device (e.g., the device 120 of fig. 1 and 3) having a screen for displaying images and/or a screen for viewing stereoscopic images, such as a Head Mounted Display (HMD). In some implementations, the method 500 is performed by processing logic (including hardware, firmware, software, or a combination thereof). In some implementations, the method 500 is performed by a processor executing code stored in a non-transitory computer readable medium (e.g., memory). Referring to fig. 6, an object measurement data creation process of a method 500 is shown.

At block 502, the method 500 obtains a 3D representation of a physical environment generated based on depth data and light intensity image data. For example, a user captures video while walking within a room to capture images of different portions of the room from multiple perspectives. The depth data may include pixel depth values from a viewpoint and sensor position and orientation data. In some implementations, depth data is acquired using one or more depth cameras. For example, one or more depth cameras may acquire depth based on Structured Light (SL), passive Stereo (PS), active Stereo (AS), time of flight (ToF), and the like. Various techniques may be applied to acquire depth image data to allocate each portion of the image (e.g., at the pixel level). For example, voxel data (e.g., grid patterns on a 3D grid, with values for length, width, and depth) may also contain multiple scalar values, such as opacity, color, and density. In some implementations, depth data is obtained from a 3D model of the sensor or image content. Some or all of the image content may be based on a real environment, such as the physical environment 105 surrounding the rendering device 120. The image sensor may capture an image of the physical environment 105 for inclusion in the image and depth information about the physical environment 105. In some implementations, a depth sensor (e.g., depth camera 402) on device 120 determines a depth value for a voxel determined based on an image captured by an image sensor on device 120. The physical environment 105 around the user may be 3D modeled based on one or more values (e.g., 3D point cloud 424) and a subsequent depth of objects depicted in a subsequent image of the physical environment may be determined based on the model and camera location information (e.g., SLAM information).

At block 504, the method 500 generates a 3D bounding box corresponding to the object in the physical environment based on the 3D representation. For example, a 3D bounding box may provide the position, pose (e.g., orientation and position), and shape of each furniture and appliance in a room. The bounding box may be refined using RGB data and a new multi-network tuning technique (e.g., a neural network based machine learning model for fine tuning). The bounding box may be refined using expansion and cutting techniques. In some implementations, generating the refined 3D bounding box includes generating a proposed 3D bounding box of the object using the first neural network, and generating the refined 3D bounding box by expanding the proposed 3D bounding box based on a bounding box expansion scale (e.g., expanding the bounding box by 10%), identifying features of the object of the expanded proposed bounding box using the second neural network, and refining the bounding box based on the identified features. In some implementations, the first neural network generates the proposed 3D bounding box based on 3D semantic data associated with the object. In some implementations, the second neural network identifies features of the object based on 3D semantic data associated with the object. In some implementations, the third neural network is trained to refine an accuracy of the identified features from the second neural network based on 3D semantic data associated with the object and light intensity image data acquired during the scanning process, and to output a further refined 3D bounding box based on the refined accuracy of the identified features from the second neural network.

At block 506, the method 500 classifies the object based on the 3D bounding box and the 3D semantic data. For example, a class name or label is provided for each generated 3D bounding box. In some implementations, classifying the object based on the 3D semantic data includes determining a class of the object based on the 3D semantic data using an object classification neural network, and classifying a 3D bounding box corresponding to the object based on the determined class of the object. In some implementations, a neural network specific to a first class is trained to determine specific points on a first object class (e.g., chair) for measuring objects in the first class. (e.g., the armrest length and the seat height of the chair). In some implementations, a neural network specific to a second class is trained to determine a specific point on a second object class (e.g., table) for measuring objects in the second class, wherein the second object class is different from the first object class. Such as table height, table top size specific to a circular or rectangular table top. The measurement results of the objects in the second object classification are different from the measurement results of the objects in the first object classification. For example, the chair may include more or at least different measurements than a table or television.

At block 508, the method 500 displays a measurement of the object (e.g., armrest length, seat height, television diameter, etc.), wherein the measurement data of the object is determined using one of a plurality of class-specific neural networks selected based on the classification of the object. For example, a first network is trained to determine a particular point on a chair for chair measurements, and a second network is trained to determine a different point on a table for table measurements. In use, a user may scan a room using a device (e.g., a smart phone), and the processes described herein will identify an object (e.g., a chair) and provide specific measurements of the object (e.g., chair height, seat height, base width, etc.). In some implementations, the measurement results may be automatically displayed on a user device overlaying the object or in close proximity to the object. In some implementations, the measurement results may be provided after some type of user interaction with the identified object. For example, a transparent bounding box surrounding the object may be shown to the user, and the user may select or click on the bounding box, and the measurement results will then be displayed.

According to some implementations, the 3D bounding box is a refined 3D bounding box, and the method 500 further involves generating a proposed 3D bounding box by using the first neural network, and generating a refined 3D bounding box of the object by identifying features of the object (e.g., low precision/high call to generate features of the object) using the second neural network and refining the proposed 3D bounding box (e.g., high precision/low call to refine an accuracy of the generated features and output the refined bounding box) using the third neural network based on the identified features. In some implementations, the first neural network generates the proposed 3D bounding box based on the 3D representation associated with the object. In some implementations, the second neural network identifies features of the object based on a 3D representation associated with the object and light intensity image data (e.g., RGB data) obtained during the scanning process. In some implementations, the third neural network is trained to refine an accuracy of the identified features from the second neural network based on 3D semantic data associated with the object and light intensity image data acquired during the scanning process, and to output a further refined 3D bounding box based on the refined accuracy of the identified features from the second neural network. In some implementations, the 3D bounding box provides position information, pose information (e.g., orientation and position information), and shape information for objects in the physical environment.

In use, for process 500, a user may scan a room using a device (e.g., a smart phone), and the process described herein will identify an object (e.g., a chair) and provide one or more particular measurements of the object (e.g., chair height, seat height, base width, etc.). In some implementations, the measurement results may be automatically displayed on a user device overlaying the object or in close proximity to the object. In some implementations, the measurement results may be provided after some type of user interaction with the identified object. For example, a transparent bounding box surrounding the object may be shown to the user, and the user may select or click on the bounding box, and the measurement results will then be displayed.

Fig. 6A-6B are system flowcharts of an exemplary environment 600 in which a system may generate and provide measurement data for objects within a physical environment based on a 3D representation of the physical environment (e.g., a 3D point cloud, a 3D mesh reconstruction, a semantic 3D point cloud, etc.). In some implementations, the system flow of the exemplary environment 600 is performed on a device (e.g., the server 110 or the device 120 of fig. 1-3) such as a mobile device, a desktop computer, a laptop computer, or a server device. The system flow of exemplary environment 600 may be displayed on a device (e.g., device 120 of fig. 1 and 3) having a screen for displaying images and/or a screen for viewing stereoscopic images, such as a Head Mounted Display (HMD). In some implementations, the system flow of exemplary environment 600 is performed on processing logic (including hardware, firmware, software, or a combination thereof). In some implementations, the system flow of the exemplary environment 600 is performed on a processor executing code stored in a non-transitory computer readable medium (e.g., memory).

The system flow of exemplary environment 600 gathers semantic 3D representations (e.g., semantic 3D representation 445) from other sources of physical environmental information (e.g., camera positioning information) at semantic 3D unit 440 and object detection unit 610 (e.g., object detection unit 244 of fig. 2 and/or object detection unit 344 of fig. 3). Some implementations of the present disclosure may include SLAM systems, and the like. The SLAM system may include a multi-dimensional (e.g., 3D) laser scanning and range measurement system that is GPS independent and provides real-time synchronous positioning and mapping. The SLAM system may generate and manage data of a very accurate point cloud generated by reflection of laser scanning from objects in the environment. Over time, the movement of any point in the point cloud is accurately tracked so that the SLAM system can use the point in the point cloud as a reference point for position, maintaining accurate knowledge of its position and orientation as it travels through the environment.

The object detection unit 610 includes an object detection neural network 620, an object fine adjustment unit 630, and an object classification neural network 640. The system flow of the exemplary environment 600 begins when the object detection unit 610 collects a semantic 3D representation (e.g., semantic 3D representation 445) at the object detection neural network 620 that generates proposed bounding boxes 625a, 625b, and 625c of the identified objects (e.g., table 142, chair 140, and television 152, respectively), the semantic 3D representation including 3D data (e.g., semantic tags at the pixel or voxel level) of the identified objects. The proposed bounding boxes 625a, 625b and 625c are then refined by the object hinting unit 630. The object hinting unit 630 gathers semantic 3D representation 445 data and proposed bounding boxes 625a, 625b, and 625c and generates refined bounding boxes. The bounding box may be refined using expansion and cutting techniques. In some implementations, generating the refined 3D bounding box includes generating a proposed 3D bounding box of the object using the first neural network, and generating the refined 3D bounding box by expanding the proposed 3D bounding box based on a bounding box expansion scale (e.g., expanding the bounding box by 10%), identifying features of the object of the expanded proposed bounding box using the second neural network, and refining the bounding box based on the identified features. In some implementations, the first neural network generates the proposed bounding box based on 3D semantic data associated with the object. In some implementations, the second neural network identifies features of the object based on 3D semantic data associated with the object. In some implementations, the third neural network is trained to refine the accuracy of the identified features from the second neural network based on 3D semantic data associated with the object and light intensity image data (e.g., RGB data) obtained during the scanning process and to output a refined bounding box. The object fine tuning unit 630 outputs refined

bounding boxes

635a, 635b, and 635c (e.g., table 142, chair 140, and television 152, respectively). As shown in FIG. 6A, the

refined bounding boxes

635a, 635b, and 635c are more accurate than the bounding boxes 625a, 625b, and 625c, respectively (e.g., the refined bounding box edges are closer to the surface of the object than the initially proposed bounding box).

The

refined bounding boxes

635a, 635b, and 635c are then sent to an object classification neural network 640, which classifies the objects and associated bounding boxes based on the 3D semantic data. In some implementations, the object classification neural network 640 classifies each object into a subclass. For example, the 3D semantic data marks a particular point at the pixel or voxel level as a chair, and the classification neural network 640 may be trained to determine the type of chair or the features that the chair has (e.g., armrests, number of legs, square or circular seats, etc.). Then, the object classification neural network 640 outputs the classified objects and bounding box data to the semantic measurement unit 650, as further shown in fig. 6B.

The semantic measurement unit 650 (e.g., measurement unit 246 of fig. 2 and/or measurement unit 346 of fig. 3) obtains the classified bounding box and semantic 3D data (e.g., semantic 3D representation 445) and generates a measurement of the object (e.g., armrest length, seat height, television diameter, etc.). Determining measurement data for the object using one of a plurality of class-specific neural networks: the object-1 class neural network 652A, the object-2 class neural network 652B, the object-1 class neural network 652C, the object-N class neural network 652N (generally referred to herein as the object-class neural network 652). For example, a first network (e.g., object-class 1 neural network 652A) is trained to determine a particular point on a table for table measurements. For example, table 142 is an exemplary coffee table and may determine and provide specific measurements such as height, depth, base width, and table top width. Alternatively, table 602 is an exemplary round table such that the table top diameter would be an alternative measurement that determines would not be applicable to table 142. Other tables may require additional measurements (e.g., single foot tables with multiple feet or legs at the center of the base). The second network (e.g., the object-class 2 neural network 652B) is trained to determine different specific points on the chair for chair measurements based on the subclass of chair. For example, the chair 140 is an exemplary restaurant or kitchen chair without armrests, and specific measurements such as seat height, base depth, base width, seat depth, total chair height, and back seat height may be determined and provided. Alternatively, the chair 604 is an exemplary office chair having one center post and five feet or legs, such that the leg length from the post may be an alternative measurement that determines would not be suitable for the chair 140. Alternatively, the chair 606 is an exemplary double chair (e.g., a chair with armrests), such that measurements regarding armrests of the chair may be alternative measurements (e.g., armrest length, armrest height, etc.) that determine that the chair 140 will not be suitable for use. Other chair types may require additional measurements. A third network (e.g., object-3 class neural network 652C) is trained to determine different specific points on a television (e.g., television 152) for television measurements based on a subset of televisions. For example, the television screen is typically measured in a diagonal fashion, such that the training object-class 3 neural network 652C detects and provides standard depth, length, and height measurements (e.g., typical bounding box measurement information) for the diagonal measurements.

The semantic measurement unit 650 also includes an unclassified object unit 660 that provides measurements of objects including bounding boxes, but without class-specific neural networks that generate specific measurements of objects. The semantic measurement unit 650 will provide x, y, and z measurements (e.g., height, length, and width) for each received

bounding box

665a, 665b, 665c, etc. For example, some objects such as gadgets (e.g., toasters) may not require the specificity of different measurements in addition to the measurements provided by the bounding box.

The number of object class-specific networks is not limited to the examples provided herein, and may include any number of classified objects that may be required to specify a measurement. For example, each type of appliance, furniture, or other object (e.g., light, gaming machine, etc.) present in a physical environment may require a particular object-class neural network 652. In addition, each object-class neural network 652 may be trained to identify any number of measurements, and is not limited to the measurements shown in FIG. 6B.

In use, a user may scan a room using a device (e.g., a smart phone), and the process described herein with respect to the system flow of the example environment 600 will identify an object (e.g., a chair) and provide one or more particular measurements of the object (e.g., chair height, seat height, base width, etc.). In some implementations, the measurement data may be automatically displayed on a user device overlaying the object or in close proximity to the object. In some implementations, the measurement data may be provided after some type of user interaction with the identified object. For example, a transparent bounding box surrounding the object may be shown to the user, and the user may select or click on the bounding box, and the measurement results will then be displayed.

FIG. 7 is a flow chart representation of an exemplary method 700 of providing measurement data for an object within a physical environment. In some implementations, the method 700 is performed by a device (e.g., the server 110 or the device 120 of fig. 1-3) such as a mobile device, a desktop computer, a laptop computer, or a server device. The method 700 may be displayed on a device (e.g., the device 120 of fig. 1 and 3) having a screen for displaying images and/or a screen for viewing stereoscopic images, such as a Head Mounted Display (HMD). In some implementations, the method 700 is performed by processing logic (including hardware, firmware, software, or a combination thereof). In some implementations, the method 700 is performed by a processor executing code stored in a non-transitory computer readable medium (e.g., memory). Referring to fig. 8, an object measurement data creation process of a method 700 is shown.

At block 702, the method 700 obtains a 3D representation of a physical environment generated based on depth data and light intensity image data. For example, a user captures video while walking within a room to capture images of different portions of the room from multiple perspectives. The depth data may include pixel depth values from a viewpoint and sensor position and orientation data. In some implementations, depth data is acquired using one or more depth cameras. For example, one or more depth cameras may acquire depth based on Structured Light (SL), passive Stereo (PS), active Stereo (AS), time of flight (ToF), and the like. Various techniques may be applied to acquire depth image data to allocate each portion of the image (e.g., at the pixel level). For example, voxel data (e.g., grid patterns on a 3D grid, with values for length, width, and depth) may also contain multiple scalar values, such as opacity, color, and density. In some implementations, depth data is obtained from a 3D model of the sensor or image content. Some or all of the image content may be based on a real environment, such as the physical environment 105 surrounding the rendering device 120. The image sensor may capture an image of the physical environment 105 for inclusion in the image and depth information about the physical environment 105. In some implementations, a depth sensor (e.g., depth camera 402) on device 120 determines a depth value for a voxel determined based on an image captured by an image sensor on device 120. The physical environment 105 around the user may be 3D modeled based on one or more values (e.g., 3D point cloud 424) and a subsequent depth of objects depicted in a subsequent image of the physical environment may be determined based on the model and camera location information (e.g., SLAM information).

At block 704, the method 700 generates a 3D bounding box corresponding to the object in the physical environment based on the 3D representation. For example, a 3D bounding box may provide the location, pose (e.g., orientation and position), and shape of a particular piece of furniture or appliance in a room. The bounding box may be refined using RGB data and a new multi-network tuning technique (e.g., a neural network based machine learning model for fine tuning). The bounding box may be refined using expansion and cutting techniques. In some implementations, generating the refined 3D bounding box includes generating a proposed 3D bounding box of the object using the first neural network, and generating the refined 3D bounding box by expanding the proposed 3D bounding box based on a bounding box expansion scale (e.g., expanding the bounding box by 10%), identifying features of the object of the expanded proposed bounding box using the second neural network, and refining the bounding box based on the identified features. In some implementations, the first neural network generates the proposed bounding box based on 3D semantic data associated with the object. In some implementations, the second neural network identifies features of the object based on 3D semantic data associated with the object. In some implementations, the third neural network is trained to refine the accuracy of the identified features from the second neural network based on 3D semantic data associated with the object and light intensity image data (e.g., RGB data) obtained during the scanning process and to output a refined bounding box.

At block 706, the method 700 determines a class of the object based on the 3D semantic data. For example, a class name or label is provided for each generated 3D bounding box. For data points labeled as chairs in 3D semantic data, the 3D bounding box will be labeled as a chair. In some implementations, classifying the object based on the 3D semantic data includes determining a class of the 3D bounding box based on the 3D semantic data using an object classification neural network, and classifying the object corresponding to the 3D bounding box based on the classification of the 3D bounding box.

At block 708, the method 700 determines a location of a surface of the object based on the class of objects, the location determined by identifying planes within the 3D bounding box that have semantics in the 3D semantic data that meet the surface criteria of the object. The location is determined by identifying planes within the 3D bounding box that have semantics in the 3D semantic data that meet the surface criteria of the object. For example, a number of chair voxels are identified within a horizontal plane indicating that the horizontal plane is a seating surface of a chair type object.

At block 710, the method 700 provides a measurement of an object. A measurement of the object is determined based on the location of the surface of the object determined at block 708. For example, measurements from the seat surface to the floor may be collected to provide a seat height measurement. For example, a user may scan a room using a device (e.g., a smart phone), and the processes described herein will identify an object (e.g., a chair) and provide a measurement of the identified object (e.g., seat height, etc.).

In some implementations, identifying a plane within the 3D bounding box includes identifying, based on a plurality of 3D data points (e.g., voxels) within the plane (e.g., horizontal), that the plane is a surface of a particular feature of the object type (e.g., chair seat). In some implementations, a surface of a particular feature of the object type (e.g., a chair seat) is identified based on a number of 3D data points (e.g., voxels) within a plane (e.g., a level) determined from a comparison to a data point plane threshold. For example, the data point plane threshold is a particular number of data points. In some implementations, the plane threshold is a percentage of data points compared to other data points that are semantically labeled. For example, if 30% or more of the points are on the same horizontal plane (i.e., the same height level), it may be determined that the detected horizontal plane is the seat of the chair. In some implementations, different threshold percentages may be used for other object classifications. For example, a table will have a higher percentage of data points on the same horizontal plane. In some implementations, different detected planes may be used and compared to determine different features of the identified object.

In use, a user may scan a room using a device (e.g., a smart phone), and the process described herein for method 700 will identify an object (e.g., a chair) and provide specific measurements of the object (e.g., chair height, seat height, base width, etc.). In some implementations, the measurement results may be automatically displayed on a user device overlaying the object or in close proximity to the object. In some implementations, the measurement results may be provided after some type of user interaction with the identified object. For example, a transparent bounding box surrounding the object may be shown to the user, and the user may select or click on the bounding box, and the measurement results will then be displayed.

FIG. 8 is a system flow diagram of an exemplary environment 800 in which a system may generate and provide measurement data for objects within a physical environment based on a 3D representation of the physical environment (e.g., 3D point cloud, 3D mesh reconstruction, semantic 3D point cloud, etc.). In some implementations, the system flow of the example environment 800 may be displayed on a device (e.g., the device 120 of fig. 1 and 3) having a screen for displaying images and/or a screen for viewing stereoscopic images, such as a Head Mounted Display (HMD). In some implementations, the system flow of the exemplary environment 800 is performed on processing logic (including hardware, firmware, software, or a combination thereof). In some implementations, the system flow of the exemplary environment 800 is performed on a processor executing code stored in a non-transitory computer readable medium (e.g., memory).

The system flow of the exemplary environment 800 gathers semantic 3D representations (e.g., semantic 3D representation 445) from other sources of physical environmental information (e.g., camera positioning information) at semantic 3D unit 440 and object detection unit 810 (e.g., object detection unit 244 of fig. 2 and/or object detection unit 344 of fig. 3). The object detection unit 810 includes an object detection neural network 820, an object trimming unit 830, and an object classification neural network 840.

The system flow of the exemplary environment 800 begins when an object detection unit 810 collects a semantic 3D representation (e.g., semantic 3D representation 445) at an object detection neural network 820 that generates a proposed bounding box 825 of an identified object (e.g., chair 140), the semantic 3D representation including 3D data (e.g., semantic labels at the pixel or voxel level) of the identified object. The proposed bounding box 825 is then refined by the object hinting unit 830. The object hinting unit 830 gathers semantic 3D representation 445 data and proposed bounding box 825 and generates a refined bounding box 835. The bounding box may be refined using the expansion and cutting techniques as described herein. The object fine tuning unit 630 outputs a refined bounding box 835 associated with the chair 140. As shown in fig. 8, the refined bounding box 835 is more accurate (e.g., closer to the surface of the chair 140) than the bounding box 825.

The refined bounding box 835 is then sent to an object classification neural network 840, which classifies the object and associated bounding box based on the 3D semantic data. In some implementations, the object classification neural network 840 classifies each object into a subclass. For example, the 3D semantic data marks a particular point at the pixel or voxel level as a chair, and the classification neural network 840 may be trained to determine the type of chair or the features that the chair has (e.g., armrests, number of legs, square or circular seats, etc.). Then, the object classification neural network 840 outputs the classified objects and bounding box data to the semantic measurement unit 850.

The semantic measurement unit 850 (e.g., measurement unit 246 of fig. 2 and/or measurement unit 346 of fig. 3) obtains the classified bounding box and semantic 3D data (e.g., semantic 3D representation 445) and generates a measurement of the object (e.g., seat height, television diameter, etc.). Measurement data of the object is determined by determining a position of a surface of the object based on the class of the object. The location is determined by identifying planes within the bounding box that have semantics satisfying the surface criteria of the object in the 3D semantic data. For example, a number of chair pixels or voxels are identified that indicate in a horizontal plane that the plane may be a seating surface of a chair type object. As shown in fig. 8, a number of semantically-labeled 3D points of the chair 140 lie on a horizontal plane 852 (e.g., x, y coordinates lie on the horizontal plane 852, but each coordinate includes a similar z-height). The semantic measurement unit 850 determines that the horizontal plane 852 is the seat of the chair 140 based on the global coordinates of the 3D semantic data, rather than using an additional neural network to train a model to determine the chair seat using the techniques described herein (e.g., the object-class neural network 652 in fig. 6).

The semantic measurement unit 650 may also provide measurement results of the object based on the associated bounding box information. The semantic measurement unit 650 will provide x, y, and z measurements (e.g., height, length, and width) for each received bounding box. For example, the refined bounding box 835 will provide the overall height, base width, and base depth for the chair 140.

In some implementations, the image synthesis pipeline may include virtual content (e.g., a virtual box placed on table 135 in fig. 1) generated for an augmented reality (XR) environment. In some implementations, operating systems 230, 330 include built-in XR functionality, including for example an XR environment application or a viewer configured to be invoked from one or more applications 240, 340 to display an XR environment within a user interface. For example, the system described herein may include an XR unit configured with instructions executable by a processor to provide an XR environment comprising a description of a physical environment comprising real physical objects and virtual content. The XR unit may generate virtual depth data (e.g., depth images of the virtual content) and virtual intensity data (e.g., light intensity images (e.g., RGB) of the virtual content). For example, one of application 240 for server 110 or application 340 for device 120 may include an XR unit configured with instructions executable by a processor to provide an XR environment comprising a description of a physical environment comprising real or virtual objects. For example, virtual objects may be located based on the detection, tracking, and representation of objects in 3D space relative to each other based on stored 3D models of real objects and virtual objects using one or more of the techniques disclosed herein.

Numerous specific details are set forth herein to provide a thorough understanding of the claimed subject matter. However, it will be understood by those skilled in the art that the claimed subject matter may be practiced without these specific details. In other instances, methods, devices, or systems known by those of ordinary skill have not been described in detail so as not to obscure the claimed subject matter.

Unless specifically stated otherwise, it is appreciated that throughout the description, discussions utilizing terms such as "processing," "computing," "calculating," "determining," or "identifying" or the like, refer to the action and processes of a computing device, such as one or more computers or similar electronic computing devices, that manipulate or transform data represented as physical, electronic, or magnetic quantities within the computing platform's memory, registers, or other information storage device, transmission device, or display device.

The one or more systems discussed herein are not limited to any particular hardware architecture or configuration. The computing device may include any suitable arrangement of components that provide results conditioned on one or more inputs. Suitable computing devices include multi-purpose microprocessor-based computer systems that access stored software that programs or configures the computing system from a general-purpose computing device to a special-purpose computing device that implements one or more implementations of the subject invention. The teachings contained herein may be implemented in software for programming or configuring a computing device using any suitable programming, scripting, or other type of language or combination of languages.

Implementations of the methods disclosed herein may be performed in the operation of such computing devices. The order of the blocks presented in the above examples may be varied, e.g., the blocks may be reordered, combined, and/or divided into sub-blocks. Some blocks or processes may be performed in parallel.

The use of "adapted" or "configured to" herein is meant to be an open and inclusive language that does not exclude devices adapted or configured to perform additional tasks or steps. In addition, the use of "based on" is intended to be open and inclusive in that a process, step, calculation, or other action "based on" one or more of the stated conditions or values may be based on additional conditions or beyond the stated values in practice. Headings, lists, and numbers included herein are for ease of explanation only and are not intended to be limiting.

It will also be understood that, although the terms "first," "second," etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first node may be referred to as a second node, and similarly, a second node may be referred to as a first node, which changes the meaning of the description, so long as all occurrences of "first node" are renamed consistently and all occurrences of "second node" are renamed consistently. The first node and the second node are both nodes, but they are not the same node.

The terminology used herein is for the purpose of describing particular implementations only and is not intended to be limiting of the claims. As used in the description of this specification and the appended claims, the singular forms "a," "an," and "the" are intended to cover the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

As used herein, the term "if" may be interpreted to mean "when the prerequisite is true" or "in response to a determination" or "upon a determination" or "in response to detecting" that the prerequisite is true, depending on the context. Similarly, the phrase "if it is determined that the prerequisite is true" or "if it is true" or "when it is true" is interpreted to mean "when it is determined that the prerequisite is true" or "in response to a determination" or "upon determination" that the prerequisite is true or "when it is detected that the prerequisite is true" or "in response to detection that the prerequisite is true", depending on the context.

The foregoing description and summary of the invention should be understood to be in every respect illustrative and exemplary, but not limiting, and the scope of the invention disclosed herein is to be determined not by the detailed description of illustrative implementations, but by the full breadth permitted by the patent laws. It is to be understood that the specific implementations shown and described herein are merely illustrative of the principles of this invention and that various modifications may be implemented by those skilled in the art without departing from the scope and spirit of the invention.

Claims

1. A method, the method comprising:

at an electronic device having a processor:

obtaining a 3D representation of a physical environment generated based on depth data and light intensity image data, wherein the 3D representation is associated with 3D semantic data;

generating a 3D bounding box corresponding to the object in the physical environment based on the 3D representation, wherein the 3D bounding box is a refined 3D bounding box, wherein generating the refined 3D bounding box comprises:

generating a proposed 3D bounding box of the object using a first neural network; and

the refined 3D bounding box is generated by:

the proposed 3D bounding box is inflated based on the bounding box inflation ratio,

Identifying features of the object of the expanded proposed 3D bounding box using a second neural network, and

refining the proposed 3D bounding box based on the identified features;

classifying the object based on the 3D bounding box and the 3D semantic data; and

displaying a measurement of the object, the measurement of the object determined using one of a plurality of class-specific neural networks selected based on the classification of the object.

2. The method of claim 1, wherein classifying the object based on the 3D bounding box and the 3D semantic data comprises:

determining a class of the 3D bounding box using an object classification neural network based on the 3D semantic data; and

the objects corresponding to the 3D bounding box are classified based on the classification of the 3D bounding box.

3. The method of claim 1, wherein the neural network specific to the first class is trained to determine specific points on a first object class for measuring objects in the first class.

4. A method according to claim 3, wherein a neural network specific to a second class is trained to determine specific points on a second object class for measuring objects in a second class, wherein the second object class is different from the first object class.

5. The method of claim 4, wherein the measurement of the object in the second object classification is different from the measurement of the object in the first object classification.

6. The method of claim 1, wherein the first neural network generates a proposed 3D bounding box based on the 3D semantic data associated with the object.

7. The method of claim 1, wherein the second neural network identifies the features of the object based on the 3D semantic data associated with the object.

8. The method of claim 1, wherein the third neural network is trained to:

refining an accuracy of the identified features from the second neural network based on the 3D semantic data and the light intensity image data associated with the object; and

a further refined 3D bounding box is output based on the refinement accuracy of the identified features from the second neural network.

9. The method of claim 1, wherein the 3D bounding box provides a position, orientation, and shape of the identified object.

10. The method of claim 1, wherein the 3D representation comprises a 3D point cloud and the associated 3D semantic data comprises semantic tags associated with at least a portion of 3D points within the 3D point cloud.

11. The method of claim 10, wherein the semantic tags identify walls, wall attributes, objects, and classifications of the objects of the physical environment.

12. An apparatus, the apparatus comprising:

a non-transitory computer readable storage medium; and

one or more processors coupled to the non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium includes program instructions that, when executed on the one or more processors, cause the apparatus to perform operations comprising:

the refined 3D bounding box is generated by:

refining the proposed 3D bounding box based on the identified features;

13. The apparatus of claim 12, wherein classifying the object based on the 3D bounding box and the 3D semantic data comprises:

14. The apparatus of claim 12, wherein the neural network specific to the first class is trained to determine specific points on a first object class for measuring objects in the first class.

15. The apparatus of claim 14, wherein a second class-specific neural network is trained to determine specific points on a second object class for measuring objects in a second class, wherein the second object class is different from the first object class.

16. The apparatus of claim 15, wherein the measurement of an object in the second object classification is different from the measurement of an object in the first object classification.

17. The apparatus of claim 12, wherein the first neural network generates a proposed 3D bounding box based on the 3D semantic data associated with the object.

18. A non-transitory computer-readable storage medium storing program instructions computer-executable on a computer to perform operations comprising:

the refined 3D bounding box is generated by:

Refining the proposed 3D bounding box based on the identified features;