WO2023063937A1

WO2023063937A1 - Methods and systems for detecting planar regions using predicted depth

Info

Publication number: WO2023063937A1
Application number: PCT/US2021/054697
Authority: WO
Inventors: Yuxin MA; Pan JI; Yi Xu
Original assignee: Innopeak Technology, Inc.
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2023-04-20

Abstract

This application is directed to identifying planes in an image. An electronic device includes a camera and an inertia motion unit (IMU), and obtains a plurality of image frames captured by the camera and a plurality of motion data captured by the IMU. The electronic device is disposed in a scene having a plurality of planes. A plurality of feature points is determined in the scene based on the plurality of image frames and the plurality of motion data. The electronic device generates a depth map from a first image, and a plane normal map from the depth map of the first image. A subset of the plurality of planes associated with the first image is identified based on the plane normal map and the plurality of feature points of the scene.

Description

Methods and Systems for Detecting Planar Regions using Predicted Depth

TECHNICAL FIELD

[0001] This application relates generally to image data processing including, but not limited to, methods, systems, and non-transitory computer-readable media for identify planes in an image based on image data and inertial sensor data.

BACKGROUND

[0002] Simultaneous localization and mapping (SLAM) is widely applied in virtual reality (VR), augmented reality (AR), autonomous driving, and navigation. In SLAM, feature points of a scene are detected and applied to map a three-dimensional (3D) virtual space corresponding to the scene. The feature poifnts can be conveniently and accurately mapped for the 3D virtual space using an optical camera and an Inertial Measurement Unit (IMU) that exist in many mobile devices. These feature points are also applied in state-of-the-art mobile AR frameworks to provide plane detection as one of the basic functions. However, such SLAM-based map points are tracked across many image frames, and areas with poor texture and self-repeating patterns are challenging for SLAM to detect high-quality feature points and build the 3D virtual space. A planar area must include a density of feature points to be accurately detected. Indoor scenes (e.g., office and household living space) tend to have homogeneously colored walls and homogeneously textured floors. Detection of approximate plane boundaries in such indoor scenes is possible, while it is difficult to achieve precise results. It would be beneficial to have an efficient and accurate plane detection mechanism for certain environments than the current practice.

SUMMARY

[0003] Accordingly, there is a need for an efficient and accurate plane detection mechanism for detecting planes in an image captured in a scene using both SLAM feature points and depth information that are extracted from the image. Various implementation of this application take advantage of a dense nature of the depth information of the image to provide detailed plane boundary information while leveraging a spatial accuracy of the SLAM feature points. The depth information is predicted from the image using deep learning techniques and provides boundary details of planes detected in the image. The SLAM feature points (also called map points) have a limited density that is insufficient to define plane boundaries accurately in some situations, and however, these feature points can provide accurate plane orientation information (e.g., information of a normal direction of a plane). By these means, the depth information is applied jointly with the information of SLAM feature points to provide highly-detailed plane boundaries, detect planes in poor texture areas, and provide robust plane detection results.

[0004] In one aspect, a method is implemented at an electronic device for identifying planes in an image. The electronic device includes a camera and an inertia motion unit (IMU). The method includes obtaining a plurality of image frames captured by the camera and a plurality of motion data captured by the IMU, and the electronic device is disposed in a scene having a plurality of planes. The method further includes determining a plurality of feature points in the scene based on the plurality of image frames and the plurality of motion data, generating a depth map from a first image, and generating a plane normal map from the depth map of the first image. The method further includes identifying a subset of the plurality of planes associated with the first image based on the plane normal map and the plurality of feature points of the scene. In some embodiments, the method further includes identifying a plurality of plane normal vectors associated with the plurality of planes existing in the scene based on the plurality of feature points in the scene. The subset of the plurality of planes associated with the first image is identified based on the plane normal map and the plurality of plane normal vectors that are identified based on the plurality of feature points.

[0005] In another aspect, some implementations include an electronic system that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.

[0006] In yet another aspect, some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures. [0008] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.

[0009] Figure 2 is a block diagram illustrating a data processing system, in accordance with some embodiments.

[0010] Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.

[0011] Figure 4A is an example neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments, and Figure 4B is an example node in the neural network, in accordance with some embodiments.

[0012] Figure 5 is a flowchart of a process for processing inertial sensor data and image data of an electronic system using a SLAM module, in accordance with some embodiments.

[0013] Figure 6 is a simplified flow diagram of a plane identification process, in accordance with some embodiments.

[0014] Figure 7 is a detailed flow diagram of a plane identification process, in accordance with some embodiments.

[0015] Figure 8 is a flowchart of a method for identifying planes in an image, in accordance with some embodiments.

[0016] Like reference numerals refer to corresponding parts throughout the several views of the drawings.

DETAILED DESCRIPTION

[0017] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.

[0018] Extended reality includes augmented reality (AR) in which virtual objects are overlaid on a view of a real physical world, virtual reality (VR) that includes only virtual content, and mixed reality (MR) that combines both AR and VR and in which a user is allowed to interact with real -world and digital objects. In extended reality (e.g., AR), planar regions are detected, tracked, and represented with geometric modeling. Image data and inertial motion data are collected and processed using SLAM techniques, and enable a six degrees of freedom (DOF) tracking system to identify stationary feature points distributed in a scene in many mobile AR system. The six DOFs include 3D translational movement and 3D rotation. Additionally, the image data used in SLAM can be used in monocular depth estimation of depth information using deep learning techniques. This application is directed to identifying planes in the scene based on the feature points and depth information tracked by the SLAM techniques and deep learning techniques, respectively.

[0019] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera, a smart television device, a drone). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by another client device 104 or the server(s) 102. The one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, process the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.

[0020] The one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and provides the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera and a mobile phone 104C. The networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104C to monitor the events occurring near the networked surveillance camera in the real time and remotely.

[0021] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Intemet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.

[0022] In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. The content data may broadly include inertial sensor data captured by inertial sensor(s) of a client device 104. In these deep learning techniques, data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. Subsequently to model training, the mobile phone 104C or HMD 104D obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the data processing models locally.

[0023] In some embodiments, both model training and data processing are implemented locally at each individual client device 104 (e.g., the mobile phone 104C and HMD 104D). The client device 104 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A and HMD 104D). The server 102A obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The client device 104 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the server 102A, presents the results on a user interface (e.g., associated with the application), rending virtual objects in a field of view based on the poses, or implements some other functions based on the results. The client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102 A. Additionally, in some embodiments, data processing is implemented locally at a client device 104 (e.g., the client device 104B and HMD 104D), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally. [0024] In some embodiments, a pair of AR glasses 104D (also called an HMD) are communicatively coupled in the data processing environment 100. The AR glasses 104D includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. The camera and microphone are configured to capture video and audio data from a scene of the AR glasses 104D, while the one or more inertial sensors are configured to capture inertial sensor data. In some situations, the camera captures hand gestures of a user wearing the AR glasses 104D, and recognizes the hand gestures locally and in real time using a two-stage hand gesture recognition model. In some situations, the microphone records ambient sound, including user’s voice commands. In some situations, both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses. The video, static image, audio, or inertial sensor data captured by the AR glasses 104D is processed by the AR glasses 104D, server(s) 102, or both to recognize the device poses. Optionally, deep learning techniques are applied by the server(s) 102 and AR glasses 104D jointly to recognize and predict the device poses. The device poses are used to control the AR glasses 104D itself or interact with an application (e.g., a gaming application) executed by the AR glasses 104D. In some embodiments, the display of the AR glasses 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items (e.g., an avatar) on the user interface.

[0025] As explained above, in some embodiments, deep learning techniques are applied in the data processing environment 100 to process video data, static image data, or inertial sensor data captured by the AR glasses 104D. 2D or 3D device poses are recognized and predicted based on such video, static image, and/or inertial sensor data using a first data processing model. Visual content is optionally generated using a second data processing model. Training of the first and second data processing models is optionally implemented by the server 102 or AR glasses 104D. Inference of the device poses and visual content is implemented by each of the server 102 and AR glasses 104D independently or by both of the server 102 and AR glasses 104D jointly.

[0026] Figure 2 is a block diagram illustrating a data processing system 200, in accordance with some embodiments. The data processing system 200 includes a server 102, a client device 104 (e.g., AR glasses 104D in Figure 1), a storage 106, or a combination thereof. The data processing system 200, typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The data processing system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the data processing system 200 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more cameras 260, scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The data processing system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104. In some embodiments, the client device 104 includes the IMU 280 for collecting inertial sensor data (also called motion data) of the client device 104 disposed in a scene. [0027] Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:

• Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;

• Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;

• User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);

• Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;

• Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;

• One or more user applications 224 for execution by the data processing system 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices);

• Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;

• Data processing module 228 (e.g., applied in a plane detection process 600 in Figure 6) for processing content data using data processing models 228 (e.g., a CNN depth network 718 in Figure 7), thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224;

• Pose determination and prediction module 230 for determining and predicting a device pose of the client device 104 (e.g., an HMD 104D) based on images captured by the camera 260 and motion data captured by the IMU 280, where in some embodiments, the device pose is determined and predicted jointly by the pose determination and prediction module 230 and data processing module 228, and the module 230 further includes an SLAM module 232 for applying image data captured by the camera 260 and motion data measured by the IMU 280 to map a scene where the client device 104 is located and identify a device pose of the client device 104 within the scene; • Plane identification module 234 that operates to generates a plane normal map from an image jointly with the data processing module 228, combine the feature points identified by the SLAM module 232 and the plane normal map, and identify plane candidates in the image; and

• One or more databases 238 for storing at least data including one or more of o Device settings 240 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 242 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 244 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 246 for training one or more data processing models 248; o Data processing model(s) 248 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques, where the data processing models 240 includes at least a CNN depth network for generating a depth map from an image (e.g., an RGB or grayscale image); and o Content data and results 250 that are obtained by and outputted to the client device 104 of the data processing system 200, respectively, where the content data include one or more of historic inertial sensor data 252, historic image data 254, historic pose data 256, and feature points and plane data 258.

[0028] Optionally, the one or more databases 238 are stored in one of the server 102, client device 104, and storage 106 of the data processing system 200. Optionally, the one or more databases 238 are distributed in more than one of the server 102, client device 104, and storage 106 of the data processing system 200. In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively.

[0029] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.

[0030] Figure 3 is another example data processing system 300 for training and applying a neural network based (NN-based) data processing model 248 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments. The data processing system 300 includes a model training module 226 for establishing the data processing model 248 and a data processing module 228 for processing the content data using the data processing model 248. In some embodiments, both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104. The training data source 304 is optionally a server 102 or storage 106. Alternatively, in some embodiments, both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300. The training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106. Additionally, in some embodiments, the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 248 to the client device 104.

[0031] The model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312. The data processing model 248 is trained according to a type of the content data to be processed. The training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data. For example, an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size. Alternatively, an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform. The model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 248, and generates an output from each training data item. During this course, the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item. The model training engine 310 modifies the data processing model 248 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold). The modified data processing model 248 is provided to the data processing module 228 to process the content data.

[0032] In some embodiments, the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.

[0033] The data processing module 228 includes a data pre-processing modules 314, a model -based processing module 316, and a data post-processing module 318. The data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model -based processing module 316. Examples of the content data include one or more of: video, image, audio, textual, and other types of data. For example, each image is pre-processed to extract an ROI or cropped to a predefined image size, and an audio clip is pre-processed to convert to a frequency domain using a Fourier transform. In some situations, the content data includes two or more types, e.g., video data and textual data. The model -based processing module 316 applies the trained data processing model 248 provided by the model training module 226 to process the pre-processed content data. The model -based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 248. In some embodiments, the processed content data is further processed by the data postprocessing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.

[0034] Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 248, in accordance with some embodiments, and Figure 4B is an example node 420 in the neural network (NN) 400, in accordance with some embodiments. The data processing model 248 is established based on the neural network 400. A corresponding model-based processing module 316 applies the data processing model 248 including the neural network 400 to process content data that has been converted to a predefined content format. The neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs. As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the one or more node inputs are combined based on corresponding weights wi, W2, W3, and W4 according to the propagation function. In an example, the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.

[0035] The collection of nodes 420 is organized into one or more layers in the neural network 400. Optionally, the one or more layers includes a single layer acting as both an input layer and an output layer. Optionally, the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406. A deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer. In some embodiments, a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer. In some embodiments, one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers. Particularly, max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.

[0036] In some embodiments, a convolutional neural network (CNN) is applied in a data processing model 248 to process content data (particularly, video and image data). The CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406. The one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product. Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network. Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN. The pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map. By these means, video and image data can be processed by the CNN for video and image recognition, classification, analysis, imprinting, or synthesis. [0037] Alternatively and additionally, in some embodiments, a recurrent neural network (RNN) is applied in the data processing model 248 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior. In an example, each node 420 of the RNN has a time-varying real-valued activation. Examples of the RNN include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor. In some embodiments, the RNN can be used for handwriting or speech recognition. It is noted that in some embodiments, two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly.

[0038] The training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402. The training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied. In the forward propagation, the set of weights for different layers are applied to the input data and intermediate results from the previous layers. In the backward propagation, a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error. The activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types. In some embodiments, a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied. The network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data. The result of the training includes the network bias parameter b for each layer.

[0039] Figure 5 is a flowchart of a process 500 for processing inertial sensor data (also called motion data) and image data of an electronic system (e.g., a server 102, a client device 104, or a combination of both) using a SLAM module (e.g., 232 in Figure 2), in accordance with some embodiments. The process 500 includes measurement preprocessing 502, initialization 504, local visual-inertial odometry (VIO) with relocation 506, and global pose graph optimization 508. In measurement preprocessing 502, a camera 260 captures image data of a scene at an image frame rate (e.g., 30 FPS), and features are detected and tracked (510) from the image data. An inertial measurement unit (IMU) 280 measures inertial sensor data at a sampling frequency (e.g., 1000 Hz) concurrently with the camera 260 capturing the image data, and the inertial sensor data are pre-integrated (512) to provide pose data. In initialization 504, the image data captured by the camera 260 and the inertial sensor data measured by the IMU 280 are temporally aligned (514). Vision-only structure from motion (SfM) techniques 514 are applied (516) to couple the image data and inertial sensor data, estimate three-dimensional structures, and map the scene of the camera 260.

[0040] After initialization 504 and in relocation 506, a sliding window 518 and associated states from a loop closure 520 are used to optimize (522) a VIO. When the VIO corresponds (524) to a keyframe of a smooth video transition and a corresponding loop is detected (526), features are retrieved (528) and used to generate the associated states from the loop closure 520. In global pose graph optimization 508, a multi-degree-of-freedom (multiDOF) pose graph is optimized (530) based on the states from the loop closure 520, and a keyframe database 532 is updated with the keyframe associated with the VIO.

[0041] Additionally, the features that are detected and tracked (510) are used to monitor (534) motion of an object in the image data and estimate image-based poses 536, e.g., according to the image frame rate. In some embodiments, the inertial sensor data that are pre-integrated (513) may be propagated (538) based on the motion of the object and used to estimate inertial-based poses 540, e.g., according to the sampling frequency of the IMU 280. The image-based poses 536 and the inertial-based poses 540 are stored in a pose data buffer and used by the SLAM module 232 to estimate and predict poses. Alternatively, in some embodiments, the SLAM module 232 receives the inertial sensor data measured by the IMU 280 and obtains image-based poses 536 to estimate and predict more poses.

[0042] In SLAM, high frequency pose estimation is enabled by sensor fusion, which relies on data synchronization between imaging sensors and the IMU 280. The imaging sensors (e.g., camera 260, lidars) provide image data desirable for pose estimation, and oftentimes operate at a low frequency (e.g., 30 frames per second) and with a large latency (e.g., 30 millisecond). Conversely, the IMU 280 can measure inertial sensor data and operate at a very high frequency (e.g., 1000 samples per second) and with a negligible latency (e.g., < 0.1 millisecond). Asynchronous time warping (ATW) is often applied in an AR system to warp an image before it is sent to a display to correct for head movement that occurs after the image is rendered. ATW algorithms reduce a latency of the image, increase or maintain a frame rate, or reduce judders caused by missing image frames. In both SLAM and ATW, relevant image data and inertial sensor data are stored locally such that they can be synchronized and used for pose estimation/predication. In some embodiments, the image and inertial sensor data are stored in one of multiple STL containers, e.g., std::vector, std::queue, std: :list, etc., or other self-defined containers. These containers are generally very convenient for use. The image and inertial sensor data are stored in the STL containers with their time stamps, and the timestamps are used for data search, data insertion, and data organization. [0043] Figure 6 is a simplified flow diagram of a plane identification process 600, in accordance with some embodiments. The process 600 is implemented at an electronic system (e.g., a data processing system 200), e.g., including an HMD 104D, another client device 104, and a server 102. The HMD 104D has a camera 260 configured to collect image data and an IMU 280 configured to collect motion data (also called inertial sensor data), and is disposed in a scene. This scene has a plurality of planes (e.g., wall, table top, window, door). In some embodiments, the process 600 is implemented at the HMD 104D directly. Alternatively, in some embodiments, the image and motion data are provided to a distinct client device 104 (e.g., a mobile device 104C) coupled to the HMD 104D, e.g., via a wired or wireless link, and the process 600 is implemented at the client device 104. Alternatively, in some embodiments, the image and motion data are provided to a server 102, and the process 600 is implemented at the server 102. The plane identification process 600 includes a plane detection stage 600 A, a composition and validation stage 600B, and a plane tracking stage 600C.

[0044] In the plane detection stage 600A, the electronic system obtains a plurality of image frames 602 captured by the camera 260 of the HMD 104D and a plurality of motion data 604 captured by the IMU 280 of the HMD 104D. A plurality of feature points 606 (also called map points) are determined based on the plurality of image frames 602 and the plurality of motion data 604 using a SLAM module 232. The plurality of feature points 606 map the scene to a 3D virtual space. Each plane of the scene corresponds to a subset of the plurality of feature points 606.

[0045] In the composition and validation stage 600B, the electronic system generates a depth map 608 from a first image 602A, e.g., using a convolutional neural network (CNN) depth neural network. Optionally, the plurality of image frames 602 includes the first image 602A. Optionally, the plurality of image frames 602 does not include the first image 602A. For example, the first image 602A temporally follows the plurality of image frames 602. A plane normal map 610 is generated from the depth map 608 of the first image 602 A. The depth map 608 optionally has the same or a smaller resolution as the first image 602A, and each element of the depth map 610 has a depth value corresponding to one pixel or a set of adjacent pixels of the first image 602A. In some embodiments, the depth map 608 corresponds to a confidence map having the same resolution of the depth map 608, and each element of the confidence map indicating a confidence level of a validity of the depth value at a corresponding element of the depth map 608. The plane normal map 610 optionally has the same resolution as the depth map 608, and each element of the plane normal map 610 includes a vector representing a normal direction of a plane containing the corresponding one pixel or set of adjacent pixels of the first image 602A.

[0046] The electronic system identifies one or more plane candidates 612 (i.e., a subset of the plurality of planes) associated with the first image 602A based on the plane normal map 610 and the plurality of feature points 606 of the scene. In some embodiments, a plurality of planes 614 existing in the scene are associated with a plurality of plane normal vectors 616 that represent normal directions of the planes 614. The plane normal vectors 616 (also called plane normal directions) are identified based on the plurality of feature points 606 in the scene. The subset of the plurality of planes (i.e., the plane candidate(s) 612) associated with the first image 602A is identified based on the plane normal map 610 and the plurality of plane normal vectors 616 that are identified based on the plurality of feature points 606. In some embodiments, for a first plane 614A of the scene, the subset of the plurality of feature points has a limited number of feature points (15 features points) that define a boundary of the first plane 614A up to a boundary resolution that is lower than a threshold resolution.

Given that a sparsity of the feature points associated with the first plane 614A, the plurality of feature points 606 indicate an existence of the first plane 614A and cannot identify a size or a boundary of the first plane 614A in the scene. Rather, the plane normal map 610 provides information of normal directions at all or a subset of pixels of the first image 602A, thereby indicating where the boundary of the first plane 614A is located.

[0047] Specifically, in some embodiments, each of the one or more plane candidates 612 (i.e., the subset of the plurality of planes 614) associated with the first image 602A is defined by a plane boundary and a plane orientation. For each of the one or more plane candidates 612 (i.e., the subset of the plurality of planes 614), the plane boundary of the respective plane candidate 612 is determined in the scene from the plane normal map 610 of the first image 602A, and the plane orientation of the respective plane candidate 612 is determined in the scene from a respective plane normal vector 616.

[0048] During the course of mapping the 3D virtual space of the scene, the plurality of planes 614 and the plurality of feature points 606 are tracked for the scene. In the plane tracking stage 600C, a subset of feature points 606A are identified for the one or more plane candidates 612 (i.e., the subset of the plurality of planes 614) associated with the first image 602A. The subset of feature points 606A are compared (618) with the plurality of feature points 606. In accordance with a comparison of feature points, the electronic system updates the plurality of planes 614 of the scene with the one or more plane candidates 612 (i.e., the subset of the plurality of planes 614) associated with the first image 602A. In some embodiments, the subset of feature points 606A is entirely enclosed in the plurality of feature points 606, and the one or more plane candidates 612 are included in the plurality of planes 614. Alternatively, in some embodiments, the subset of feature points 606A is partially or entirely excluded from the plurality of feature points 606, and the one or more plane candidates 612 are partially or entirely excluded from the plurality of planes 614. The plurality of planes 614 are updated (620) to include the one or more plane candidates 612, and plane parameters of the one or more plane candidates 612 are merged with those of the plurality of planes 614.

[0049] In some embodiments, the first image 602A includes a single RGB image 602A, and a CNN monocular depth prediction network is applied to determine the depth map 608 with the single RGB image 602A. The depth map 608 includes boundary information and dense and smooth surface normals of a subset of the plurality of planes 614. The plurality of feature points 606 obtained by the SLAM module 232 provide a higher accuracy level of depth information than the depth map 608, and can complement the CNN-based depth map 608 to achieve a high plane parameter accuracy and mitigate a tracking drift.

[0050] In various embodiments of this application, a plane detection pipeline includes at least three stages: plane detection 600A, composition and validation 600B, and plane tracking 600C. The presence of planes in the scene is detected via the feature points 606 mapped by SLAM. One benefit of map point plane detection is that map points are sparse, and plane detection demands a limited amount of computational resources. Plane detection 600A is simplified to only detect points aligned in the same normal direction instead of all of plane parameters, further reducing computation resources. Once the plane normal directions 616 are detected, the corresponding RGB image 602A is fed to a depth prediction network to estimate the depth map 608. The depth map 608 is used to validate planes 614, estimate plane parameters, and build plane boundary. The planes 614 are tracked from frame to frame by common map points 606, and plane boundaries are updated in a region growing fashion. [0051] Figure 7 is a detailed flow diagram of a plane identification process 700, in accordance with some embodiments. The plane identification process 700 includes a plane detection stage 600A, a composition and validation stage 600B, and a plane tracking stage 600C. In the plane detection stage 600 A, the electronic system detects a presence of a plane 614. Estimation of plane parameters is deferred until the composition and validation stage 600B. The presence of a single plane 614 is detected by determining that a set of adjacent map points 606 faces the same direction. The map points 606 identified via SLAM in the first image 602A are projected (702) onto a two-dimensional (2D) image plane. Delaunay triangulation is performed on the 2D points to form (704) a mesh structure including a 2D mesh 706. The 2D mesh 706 is back projected into a 3D mesh 708 using a camera intrinsic projective geometry. Smoothing techniques 710 (e.g., Laplacian smoothing) is applied to the 3D mesh 708. A normal direction 616 (also called normal vector) of each map point 606 is estimated by edges connecting itself with its first ring neighbors. This is accomplished by computing the surface normal of triangles formed by the current map point 606 and averaging surface normals. Presence of vertical planes is detected (714) by filtering the map points 606 by the normal direction 616 perpendicular to a gravity direction 716 and applying one-point RANSAC algorithm to find a large ratio of map points sharing similar normal direction 616. Presence of horizontal planes is detected by a filter of the normal directions 616 parallel to the gravity direction 716. By these means, the presence of the plurality of planes 614 is promptly detected in a plane detection stage 600A.

[0052] More specifically, in some embodiments, a mesh model 706 or 708 of a 3D virtual space is created (704) based on the plurality of feature points 606. The plurality of planes 614 are identified in the 3D virtual space based on the mesh model 706 or 708. The electronic device determines the plurality of plane normal vectors 616 for the plurality of planes 614. The plurality of feature points 606 are identified in the 3D virtual space corresponding to the scene based on the plurality of image frames 602 and the plurality of motion data using a simultaneous localization and mapping (SLAM) method. Further, in some embodiments, the electronic device identifies a plurality of device poses 712 of the camera 260 associated with the plurality of image frames 602 using the SLAM method. Each device pose 712 includes a device position and a device orientation of the camera 260. The gravity direction 716 associated with the 3D virtual space is determined based on the camera poses 712, and the plurality of planes 614 are detected in the 3D virtual space based on the gravity direction 716.

[0053] In the composition and validation stage 600B, an RGB image 602A is passed through a depth prediction network 718 to obtain a depth map 608. A normal map 610 is determined from the depth map by computing horizontal and vertical depth gradient at each element of the depth map 608. Given the potential plane normal directions 616, the electronic system generates one or more plane masks 720 based on the normal map 610 for each plane normal direction 616. The one or more plane masks 720 identify regions where surface normal direction from the normal map 610 aligns with detected plane normal directions 616 (e.g., with a deviation smaller than a direction deviation threshold). In some embodiments, the depth map 608 produces locally smooth depth values with a relatively unreliable depth scale, thereby affecting the normal map 610. The map points 606 are grouped based on the normal directions 616. When the plane masks 720 are generated based on plane directions, the normal map 610 is sampled at the corresponding map points 606 that are projected to 2D coordinates. Instead of filtering the normal map 610 with average of map point normals, the electronic system averages these samples. The detected plane directions are mapped to values in the normal map 610 making a masking process robust to depth map scale errors.

[0054] For each plane mask 720, the electronic system segments the respective plane mask 720 to large regions 724 via Connected Component Analysis (CCA) 722. Map points 606 are identified for each connected large region 724 by projecting the map points 606 to 2D (e.g., by reusing projection results 702). Each of plane candidates 612 includes a set of pixels and map points 606. 3D plane parameters are determined from the map points 606 for each plane candidate 612. In a validation operation 726, a map point 606 is associated with a plane candidate 612 as an inlier if its distance to the plane candidate 612 is smaller than a preset threshold and only those plane candidate with enough number of inliers are kept as valid candidates. In some embodiments, the electronic system stores the pixels of each plane candidate 612 into an 2D occupancy grid having an x-y axis plane that coincides with the plane candidate 612. At this point, the plane candidates 612 are not limited to the first image 602A and are tracked in a global coordinate system. Each map point 606 associated with the plane candidate 612 becomes part of the global map points.

[0055] Stated another way, in some embodiments, the plurality of planes 614 of the scene are represented with a set of plane masks 720. Each plane mask 720 corresponds to a respective plane 614 and has a plurality of elements. Each of a first subset of elements in the respective plane mask 720 has a predefined value (e.g., “1”) indicating that the respective element corresponding to the respective plane captured by a respective set of pixels in the first image 602A. Further, in some embodiments, each plane mask 720 is segmented to one or more regions. Each region of the respective plane mask 720 is associated with a set of feature points 606 mapped in the scene and a subset of pixels of the first image 602 A. One or more plane parameters are determined based on the set of feature points 606 of each region of the respective plane mask 720. Additionally, in some embodiments, for a first plane mask 720A, a distance is determined between the first plane mask 720A and each of the set of feature points of the one or more regions. For each feature point, the distance is less than a threshold distance. If the distance of a feature point and the first plane mask 720A is greater than the threshold distance, the feature point is an outlier and excluded from the set of feature points associated with the regions of the first plane mask 720A. It is further determined that a total number of feature points are associated with the one or more regions of the first plan mask 720A. In accordance with a determination that the total number of feature points is greater than a threshold point number, the electronic system confirms that the first plane mask 720A and an associated plane candidate 612 is a valid plane 614 in the scene.

[0056] In the plane tracking stage 600C, plane candidates 612 are generated at each image frame 602 and subsequently tracked in the global coordinate system. Plane candidates 612 that are consistently tracked across a number image frames (e.g., more than 5 consecutive image frames) are considered to stable planes. Each plane’s boundary and parameters that are stored in association with the scene are updated by merging with those of the plane candidates 612 detected for each new image frame 602. Such a merge operation is conducted when a plane candidate 612 shares a number of map points with a plane 614 that has been stored in association with the scene.

[0057] Specifically, in some embodiments, a subset of feature points 606 are identified to correspond to the subset of the plurality of planes 614 (i.e., the one or more plane candidates 612) associated with the first image 602A. The subset of feature points are compared (618) with the plurality of feature points 606 that have been stored in association with a 3D virtual space of the scene. In accordance with a comparison of feature points, the electronic system updates the plurality of planes 614 of the scene with the identified subset of the plurality of planes (i.e., the one or more plane candidates 612) associated with the first image 602A. For example, a new plane is added to the plurality of planes 614. In another example, a normal direction of an existing plane 614 is adjusted. Additionally, in some embodiments, the first plane 614A is identified in a predefined number of (e.g., 5 or more) consecutive image frames 602 including the first plane 614A, and the first plane 614A is determined as a stable plane.

[0058] The plane detection processes 600 and 700 have the advantages of high compatibility, high fidelity, and high accuracy. Specifically, the plane detection processes 600 and 700 use images and motion data applied in SLAM, and do not require additional sensors or computing hardware. The electronic system provides dense 3D reconstruction-like plane boundary output without dense depth input that can only be provided by a depth camera. Fusion with SLAM map points adds a scale accuracy of SLAM backend to a CNN predicted depth. In some embodiments, the processes 600 and 700 are used when a depth camera image is used in place of the depth map 608 generated from the CNN depth network 718. A depth camera provides highly accurate depth image data, and corresponding plane candidates can still be benefitted from SLAM tracking (e.g., in the plane detection stage 600A) and backend optimization (e.g., in the plane tracking stage 600C). Additionally, in some embodiments, map point based plane detection is applied jointly with plane parameter estimation and convex hull generation.

[0059] When the CNN depth network 718 is applied to generate the depth map 608 and depth map 608, plane boundaries of the plane candidates 612 are accurately detected on texture regions that have a limited quality and where feature points 606 cannot be applied to identify the plane boundaries accurately. The processes 600 and 700 do not need to use a depth camera, and rely on the existing camera 260 and the IMU 280. The plane boundaries have a high level of details on the textured regions. It is noted that in the plane detection processes 600 and 700, deep learning techniques are applied and that an increased activity level is observed in CPU, GPU, or DSP usage. In some situations, remote procedure calls are used to interact with neural network computing libraries.

[0060] Figure 8 is a flowchart of an example method 800 for identifying planes in an image, in accordance with some embodiments. For convenience, the method 800 is described as being implemented by an electronic system (e.g., a data processing system 200 including an HMD 104D, a mobile device 104C, a server 102, or a combination thereof). Method 800 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 8 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 of the computer system 200 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 800 may be combined and/or the order of some operations may be changed. [0061] The electronic system obtains (802) a plurality of image frames 602 captured by a camera 260 and a plurality of motion data 604 captured an IMU 280. An electronic device (e.g., the HMD 104D) includes the camera 260 and the IMU 280, and is disposed in a scene having a plurality of planes 614. The camera 260 includes (804) one of an RGB camera and a greyscale camera, and each of the plurality of image frames 602 is one of an RGB image and a greyscale image. The electronic system determines (806) a plurality of feature points 606 in the scene based on the plurality of image frames 602 and the plurality of motion data 604. The electronic system generates (808) a depth map 608 from a first image 602A. In some embodiments, the plurality of image frames 602 includes the first image 602A. Alternatively, the plurality of image frames 602 does not include the first image 602A. In some embodiments, the depth map 608 is generated from the first image 602A using a convolutional neural network.

[0062] The electronic system generates (810) a plane normal map 610 from the depth map 608 of the first image 602A. The electronic system identifies (812) a subset of the plurality of planes 614 (i.e., one or more plane candidates 612) associated with the first image 602 A based on the plane normal map 610 and the plurality of feature points 606 of the scene. In some embodiments, the electronic system identifies (814) a plurality of plane normal vectors 616 associated with the plurality of planes 614 existing in the scene based on the plurality of feature points 606 in the scene. The subset of the plurality of planes 614 associated with the first image 602A is identified based on the plane normal map 610 and the plurality of plane normal vectors 616 that are identified based on the plurality of feature points 606. Further, in some embodiments, each of the subset of the plurality of planes 614 is defined by a plane boundary and a plane orientation. For each of the subset of the plurality of planes 614 (i.e., each plane candidate 612), the electronic system determines (816) the plane boundary of the respective plane candidate 612 in the scene from the plane normal map 610 of the first image 602A, and determines (818) the plane orientation of the respective plane candidate 612 in the scene from a respective plane normal vector 616.

[0063] Additionally, in some embodiments, when the plurality of plane normal vectors 616 associated with the plurality of planes 614 are identified, the electronic system creates a mesh model 708 of the 3D virtual space based on the plurality of feature points 606, detects the plurality of planes 616 in the 3D virtual space based on the mesh model 708, and determines the plurality of plane normal vectors 616 for the plurality of planes 614. The plurality of feature points 606 are identified in the 3D virtual space corresponding to the scene based on the plurality of image frames 602 and the plurality of motion data 604 using a simultaneous localization and mapping (SLAM) method, e.g., implemented by a SLAM module 232 in Figure 2. Further, in some embodiments, the electronic system identifies a plurality of device poses 712 of the camera 260 associated with the plurality of image frames 602 using the SLAM method, and identifies a gravity direction 716 associated with the 3D virtual space, and the plurality of planes 614 are detected in the 3D virtual space based on the gravity direction 716.

[0064] In some embodiments, the electronic device identifies the subset of the plurality of planes 614 (i.e., the one or more plane candidates 612) associated with the first image 602A, and represents the subset of the plurality of planes 614 with a set of plane masks 720. Each plane mask 720 corresponds to a respective plane 614 (or 612) and has a plurality of elements, and each of a first subset of elements in the respective plane mask 720 has a predefined value indicating that the respective element corresponding to the respective plane captured by a respective set of pixels in the first image 602A. Further, in some embodiments, for each plane mask 720, the electronic system segments the respective plane mask 720 to one or more regions, e.g., via connected component analysis (CCA), and associates each region of the respective plane mask 720 with a set of feature points 606 mapped in the scene and a subset of pixels of the first image 602A. The electronic device determines one or more plane parameters based on the set of feature points 606 of each region of the respective plane mask 720. Additionally, in some embodiments, for a first plane mask 720A, the electronic system determines that a distance of the first plane mask 720A with each of the set of feature points of the one or more regions is less than a threshold distance and that a total number of feature points are associated with the one or more regions of the first plan mask. In accordance with a determination that the total number of feature points is greater than a threshold point number, the electronic system confirms that the first plane mask 720A (i.e., a corresponding plane candidate 612 corresponds to a valid plane 614 in the scene.

[0065] In some embodiments, the electronic device identifies (820) a subset of feature points 606 corresponding to the subset of the plurality of planes 614 (i.e., the one or more plane candidates 612) associated with the first image 602 A. The electronic device compares (822) the subset of feature points with the plurality of feature points 606 corresponding to the 3D virtual space of the scene. In accordance with a comparison of feature points, the electronic device updates (824) the plurality of planes 614 of the scene with the identified subset of the plurality of planes associated with the first image 602A.

[0066] In some embodiments, the first image 602 A includes a first plane 614A. In accordance with a determination that the first plane 614A is identified in a predefined number of consecutive image frames 602 including the first plane 614A, the electronic system determines (826) that the first plane 614A is a stable plane.

[0067] In various embodiments of this application, the information of the candidate planes 612 is applied to recognize a local environment associated with each image and therefore localize a camera of the electronic device (e.g., the HMD) in the scene. The candidate planes 612 can also be used to update mapping of the scene, e.g., in the plane tracking stage 600C. Additionally, such plane information is used to render virtual objects in some embodiments.

[0068] It should be understood that the particular order in which the operations in Figure 8 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to detect a plane as described herein. Additionally, it should be noted that details of other processes described above with respect to Figures 6 and 7 are also applicable in an analogous manner to method 800 described above with respect to Figure 8. For brevity, these details are not repeated here.

[0069] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.

[0070] As used herein, the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.

[0071] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.

[0072] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Claims

What is claimed is:

1. A plane identification method, wherein an electronic device includes a camera and an inertia motion unit (IMU), the method comprising: obtaining a plurality of image frames captured by the camera and a plurality of motion data captured by the IMU, wherein the electronic device is disposed in a scene having a plurality of planes; determining a plurality of feature points in the scene based on the plurality of image frames and the plurality of motion data; generating a depth map from a first image; generating a plane normal map from the depth map of the first image; and identifying a subset of the plurality of planes associated with the first image based on the plane normal map and the plurality of feature points of the scene.

2. The method of claim 1, further comprising: identifying a plurality of plane normal vectors associated with the plurality of planes existing in the scene based on the plurality of feature points in the scene; wherein the subset of the plurality of planes associated with the first image is identified based on the plane normal map and the plurality of plane normal vectors that are identified based on the plurality of feature points.

3. The method of claim 2, wherein each of the subset of the plurality of planes is defined by a plane boundary and a plane orientation, identifying the subset of the plurality of planes associated with the first image further comprising, for each of the subset of the plurality of planes: determining the plane boundary of the respective plane in the scene from the plane normal map of the first image; and determining the plane orientation of the respective plane in the scene from a respective plane normal vector.

4. The method of any of claims 2 or 3, identifying the plurality of plane normal vectors associated with the plurality of planes further comprising: creating a mesh model of the 3D virtual space based on the plurality of feature points; detecting the plurality of planes in the 3D virtual space based on the mesh model; and determining the plurality of plane normal vectors for the plurality of planes;

27 wherein the plurality of feature points are identified in the 3D virtual space corresponding to the scene based on the plurality of image frames and the plurality of motion data using a simultaneous localization and mapping (SLAM) method.

5. The method of claim 4, further comprising: identifying a plurality of device poses of the camera associated with the plurality of image frames using the SLAM method; and identifying a gravity direction associated with the 3D virtual space, wherein the plurality of planes are detected in the 3D virtual space based on the gravity direction.

6. The method of any of the preceding claims, wherein identifying the subset of the plurality of planes associated with the first image further comprises: representing the subset of the plurality of planes with a set of plane masks, each plane mask corresponding to a respective plane and having a plurality of elements, each of a first subset of elements in the respective plane mask having a predefined value indicating that the respective element corresponding to the respective plane captured by a respective set of pixels in the first image.

7. The method of claim 6, wherein identifying the subset of the plurality of planes associated with the first image further comprises, for each plane mask: segmenting the respective plane mask to one or more regions; associating each region of the respective plane mask with a set of feature points mapped in the scene and a subset of pixels of the first image; and determining one or more plane parameters based on the set of feature points of each region of the respective plane mask.

8. The method of claim 7, wherein identifying the subset of the plurality of planes associated with the first image further comprises, for a first plane mask: for each region, determining that a distance of the first plane mask with each of the set of feature points is less than a threshold distance; determining that a total number of feature points are associated with the one or more regions of the first plan mask; and in accordance with a determination that the total number of feature points is greater than a threshold point number, confirming that the first plane mask corresponds to a valid plane in the scene.

9. The method of any of the preceding claims, further comprising: identifying a subset of feature points corresponding to the subset of the plurality of planes associated with the first image; comparing the subset of feature points with the plurality of feature points; and in accordance with a comparison of feature points, updating the plurality of planes of the scene with the identified subset of the plurality of planes associated with the first image.

10. The method of any of the preceding claims, wherein the subset of the plurality of planes associated with the first image includes a first plane, further comprising: in accordance with a determination that the first plane is identified in a predefined number of consecutive image frames including the first plane, determining that the first plane is a stable plane.

11. The method of any of the preceding claims, wherein the depth map is generated from the first image using a convolutional neural network.

12. The method of any of the preceding claims, wherein the camera includes one of an RGB camera and a greyscale camera, and each of the plurality of image frames is one of an RGB image and a greyscale image.

13. An electronic system, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-12.

14. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the one or more processors to perform a method of any of claims 1-12.