WO2024096819A1

WO2024096819A1 - System and method for map building and map- and image-based object control

Info

Publication number: WO2024096819A1
Application number: PCT/SG2023/050726
Authority: WO
Inventors: Shaojun CAI
Original assignee: National University Of Singapore
Priority date: 2022-11-02
Filing date: 2023-11-01
Publication date: 2024-05-10

Abstract

Disclosed are systems and methods for building maps of an area and navigating by those maps. Methods includes forming a multi-level map including a first level formed using images of the area that are aligned based on image features that appear in two or more of the images, and a co-visibility and traversability is determined for each location at which a said image was captured. A second level of the map is formed by defining clusters of the locations based on the co- visibility and traversability of the locations, and assigning a switch node to each cluster, each switch node corresponding to a passage for the vehicle to traverse the respective cluster. The map is then uploaded to a control system for controlling an object through the area.

Description

System and Method For Map Building And Map- And Image-Based Object Control

Technical Field

The present invention relates, in general terms, to systems for building maps of areas through which robots and autonomous vehicles can be controlled. The present invention also relates to navigating a robot or vehicle through an area using such a map, and to methods relating to building maps and navigation control.

Background

Guide dogs navigate in complex and large-scale environments to help visually impaired individuals become more independent and mobile. For example, a typical use case of guide dogs is to lead the user from their home to a nearby shopping mall. However, unlike a real guide dog, it is possible to program a robot to perform specific tasks in environments it has never seen before.

Robotic mapping is a discipline related to computer vision and cartography. Most of the current robotic mapping and navigation applications rely on expensive LiDAR sensors and Simultaneous Localization and Mapping (SLAM). These methods may work reasonably well in small-scale environments such as for service robots in the restaurants or home robot vacuum, but they do not suffice for complex environments that are frequently encountered in guiding applications.

Using currently available methods for mapping such as SLAM, a device can simultaneously localize (locate itself in a map) and map (create a virtual mapof the location). However, complex environments such as rooms or street blocks connected by narrow passages with doors and intersections require a more complex solution.

It would be desirable to alleviate at least one of the above-described problems with existing robotic mapping and navigation systems. Summary

Described herein is a new topological map representation system. The map can be divided into two levels. In the low (first) level, images are colleted from robot on-board cameras and and those images are connected based on co-visibility and traversability. In the high (second) level, a clustering algorithm groups low- level locations (at which images were captured) into a set of topological nodes. A local controller is used to travel from one topological node to the next. The field of applications include navigation for robotic systems and autonomous cars.

Disclosed is a system for building a map of an area, the system comprising: memory; a vehicle comprising an image capture device; an image analysis engine; at least one processor; and a control system for controlling the vehicle, wherein the memory stores instructions that, when executed by the at least one processor, cause the at least one processor to: form a first level of the map by: controlling the vehicle, using the control system, to navigate through the area; capturing images of the area using the image capture device, each image corresponding to a location in the area; aligning the images based on image features that appear in two or more of the images, and determining co-visibility and traversability for each location at which a said image was captured; form a second level of the map by: defining clusters of the locations based on the co-visibility and traversability of the locations; assigning a switch node to each cluster, each switch node corresponding to a passage for the vehicle to traverse the respective cluster; and upload the map to the control system.

Also disclosed is a navigation control system for navigating an object through an area, comprising: an input device; a controller comprising memory, the memory storing a map comprising: a first level formed from images captured at locations in the area, the images being aligned based on image features appearing in two or more of the images and a co-visibility and traversability for each location; a second level formed by defining clusters of the locations based on the co-visibility and traversability of the locations, and assigning a switch node to each cluster, each switch node corresponding to a passage for the vehicle to traverse the respective cluster; an image capture device; a path planning module; and at least one processor, the at least one processor being programmed to: receive a further image from the image capture device, the further image being captured at a current location of the object; localise the object in the map based on image features in the further image and the first level; receive a destination from the input device; and use the path planning module to determine a path from the current location to the destination via one or more of the switch nodes.

Also disclosed is a method for building a map of an area, the method comprising: forming a first level of the map by: controlling a vehicle to navigate through the area; capturing images of the area using an image capture device on the vehicle, each image corresponding to a location in the area; aligning the images based on image features that appear in two or more of the images, and determining co-visibility and traversability for each location at which a said image was captured; forming a second level of the map by: defining clusters of the locations based on the co-visibility and traversability of the locations; assigning a switch node to each cluster, each switch node corresponding to a passage for the vehicle to traverse the respective cluster; and uploading the map to a control system for controlling an object through the area. Also disclosed is a method for navigating an object through an area using a map built according to the method described above, comprising: receiving a further image captured at a current location of the object; localising the object in the map based on image features in the further image and the first level; receiving a destination; and using the path planning module to determine a path from the current location to the destination via one or more of the switch nodes.

Advantageously, the map is a topometric map representation of the area. This removes the need for accurate map reconstruction in traditional simultaneous localisation and mapping (SLAM)-based robot systems.

Advantageously, the systems use a learning-based local controller. Such a controller can be generalised to different environments without the need of customisation.

Brief description of the drawings

Embodiments of the present disclosure will now be described, by way of nonlimiting example, with reference to the drawings in which:

Figure 1 illustrates the overall topometric map construction and usage;

Figure 2 illustrates progressive steps in building a topometric map in accordance with an embodiment of the invention;

Figure 3 illustrates local sensorimotor control in accordance with an embodiment of the invention;

Figure 4 is a visualisation of map clustering results in accordance with an embodiment of the invention;

Figure 5 illustrates a block diagram of a system for building a map of an area in accordance with an embodiment of the invention; Figure 6 illustrates a flowchart for a method for building a map of an area according to an embodiment of the invention;

Figure 7 illustrates a block diagram of a navigation control system in accordance with an embodiment of the invention; and

Figure 8 illustrates a flowchart for a method for navigating an object through an area according to an embodiment of the invention.

Detailed description

Described below are systems and methods for building maps of areas through which robots and autonomous vehicles can be controlled. Also described is the process of navigating a robot or vehicle through an area using such a map, and methods relating to map building and navigation control by tackling the limitations of traditional SLAM methods using abstraction. The abstract topology of environments is captured with just a few traversals. The abstraction typically contains a set of sub-spaces, such as rooms and street blocks, that are connected by narrow passages, such as doors and intersections.

Hereinafter, the system and method for map building and map- and imagebased object control according to the present invention will be described in detail with reference to Figure 1 to Figure 8 according to the preferred embodiments. It is to be understood that limiting the description to the preferred embodiments of the invention is merely to facilitate discussion of the present invention and it is envisioned without departing from the scope of the appended claims.

As used hereinafter, the term "co-visibil ity " denotes the number of places that a point can see and the number of places from which the point can be seen, and the term "traversability" denotes the cases of whether or not a trajectory can be traversed directly between two points that can see each other.

Thus, described are methods for building a map of an area. The methods broadly comprise forming a two-level map and outputting the map. The first level of the map is a low-level graph, and the second level map is a high-level graph. The low-level graph is formed by controlling a vehicle to navigate through the area over which the map is to be generated and capturing images of the area using an image capture device, such as a camera, on the vehicle. Each image corresponds to a location in the area, such as the location from which the image was captured. The images are then aligned based on image features that appear in two or more of the images - i.e. co-registration of image features. The covisibility and traversability for each location is then determined. The high-level graph can then be formed by defining clusters of the locations ascertained during generation of the low-level graph, based on the co-visibility and traversability of the locations. Switch nodes are then assigned to each cluster, each switch node corresponding to a passage for the vehicle to traverse the respective cluster. This accounts for, for example, cases where a direct line of sight exists between two points but a robot travelling along that direct line will impact an object disposed slightly laterally (to the side) of the line of sight. In such circumstances the co-visibility of those points will be lower than for the allocated switching node through which the robot navigates between clusters. The switching node, having higher covisibility, has greater space on either side of the line of sight to the switching node of the cluster to which the robot is to navigate. Once the low- and -high-level graphs are generated, the resulting map can be uploaded to a control system for controlling an object, such as the robot, through the area.

With reference to the navigation flowchart 100 of Figure 1, a low-level (metric) graph is generated. The low-level graph 102, denoted as Gi = (Vi , Ei ), is directly constructed from the pre-exploration or exploration sequence 104. During the exploration sequence 104 the robot (or at least an image capture device) moves through the area or environment, capturing images 106 from multiple locations. The low-level graph 102 is constructed to include nodes each denoted as a place node Vi (corresponding to a location at which the corresponding image was captured) with an associated red, green, blue and depth (RGB-D) panoramic image L. Place node Vi represents a previously visited place (i.e. a location) in the environment. An edge (or link) in the graph indicates that the two views have sufficient similarity, and the vehicle or mapping agent (e.g. robot) is able to travel from one node to the other through a simple path - e.g. a linear path. The similarity between two images is measured by the number of matched Scale-Invariant Feature Transform (SIFT) features between the two images. The similarity is sufficient for creation of an edge, if the number of matched SIFT features (or image features determined by an alternative image analysis model) are above a predetermined threshold - that threshold may be set by a user or may be empirically determined based on past traversals between nodes, made by an object such as a robot or vehicle. Then, determination of a simple path between the two images is done by comparing the ratio of the geodesic distance to the euclidean distance between the two images. A simple path exists if the ratio is below 1.2.

After construction of the low-level graph, a high-level graph 108 is constructed. To achieve this, normalized cut 110 is performed iteratively on the low-level graph Gi. Each normalised cut cuts through bottlenecks corresponding to the narrow passages. A narrow passage may be a passage that has a line of sight to a destination location, but has obstacles off the line of sight that would impact the vehicle if the vehicle were to traverse along the line of sight.

The cuts result in a high-level graph Gh = (C, Eh) that contains a number of clusters 112. Each cluster Ci corresponds to a sub-space that is termed a "context". Due to the definitions in the low-level graph 102, the place nodes in each context share similar visual information (co-visibility) and permit travel between each other (traversability). Based on the graph, a switch node that connects two neighboring contexts is defined. For an edge between two neighboring contexts Ci to , the switch node is defined as the place node in Ci that has the most low-level edges to nodes in . For each context, six place nodes are randomly sampled, and these nodes are defined as the local map of the context. Downsampling place nodes reduces redundancy and improves the computational efficiency of localization.

The two-level graph serves multiple purposes. Firstly, the mapping agent can use the most relevant information (i.e., place nodes in its immediate context) for localization, which improves efficiency and accuracy. Secondly, switch nodes signal the need for context change and serve as goals that guide local navigation. Thirdly, the map acts as an abstraction that decouples low-level control from the environment, enabling the controller to learn robust obstacle avoidance skills that are orthogonal to global environments.

Figure 1 illustrates this concept. The leftmost inset shows panoramas 106 taken at locations in the area over which the topometric map is to be built. The centre inset illustrates edges connecting co-visible ones of the locations between which a simple path is defined. The rightmost inset illustrates the switch nodes generated for each cluster, that connect various regions or sub-areas of the map - e.g. different rooms or other areas within which there is essentially little or no difficulty navigating.

Figure 2 illustrates the use of the map generated in accordance with Figure 1. A path planning module (see 714 in Figure 7) uses the map generated in accordance with Figure 1 to produce a control sequence for controlling an autonomous vehicle or robot through the area. To achieved this, the navigation control system (see 700 in Figure 7) obtains the map 200 and receives a further image from the image capture device. The further image will be captured at a current location of the object (i.e. the autonomous vehicle or robot). The object is localised, per step 202, in the map based on image features in the further image and the first level map. Localisation involves identifying the cluster in which the object is located, per 204, and computing the coordinate of the object in the cluster, per 206. A destination is received from an input device, the destination being the location in the area, to which the robot must navigate. The path planning module then determines a path, per 208 from the current location, reflected in the current cluster and current coordinates, to the destination via one or more of the switch nodes. The path may be the shortest topological path between the current position and the destination. Where the path is not calculated by a processor in the object, the path generated by the path planning module can be uploaded to the object and a controller can then control the object to move from the current location to the destination along the path, per 210.

Localisation can involve any appropriate method. In the present embodiment, localisation involves context and metric localisation. Context localization is formulated as a place recognition problem. Pre-trained AlexNet, or some other feature recognition framework, is used to extract features from visual observations and use cosine similarity between features as a distance measure. To find the current context of the agent, the place node that is most similar to the current agent observation is identified among local maps of all contexts. The context of the most similar place node is the context of the agent.

To perform metric localisation, the metric pose of the current observation is calculated with respect to the switch node with the help of the local map. The object is thus localised based on features in the image corresponding to the current location of the object, and the path between that current location and the location of the nearest switching node (or the switching node in the same cluster as the object's current location) is calcaulted. Relatedly, SIFT feature matching is performed between the current observation and the features of the nodes in the local map. The matched map features are unprojected to the 3D space and transformed to the coordinate of the switch node. Then, perspective- n-point (PnP) algorithm with RANSAC is performed to estimate the metric pose between the current observation and the switch node. If there are no sufficient inlier matches, the localization fails and the previously successfully estimated pose is used instead.

Localisation of the object on the map affords path planning and subsequent control of the object to the destination. For global path planning, a weighted A- star (A*) is used to search for a path from the initial pose (current location) to the goal (destination) on the high-level map in the path planning module. The resultant path passes through a sequence of switch nodes. Si, Si+1 ...Sn . For the controller, the immediate next switch node Si is used as the local navigation target.

As illustrated in Figure 3, the controller takes as input the current depth observation Id 300 and a relative pose Psi that contains the relative distance t and angle (direction vector 302) from the object's current position to the next switch node Si. The output of the controller is a discrete action, instructing the object to move along a predetermined trajectory to the switching node Si. As a pre-processing step, the distance with bins of 0.75m and the actions with bins of 5 degrees are discretised. Other bin sizes can be used where the object is a different size, affecting traversability, where gaps through which the object must navigate are of a different size to those of the present area, or where the area includes obstacles.

The controller 304 is parameterized with a deep neural network (e.g. a convolutional neural network (CNN)) and trained with imitation learning. Based on the direction vector 302 and the depth 300, when processed through the CNN 306, a navigation instruction 308 is issued by which the object can autonomous move to the switching node - for subsequent movements, the instruction will generally direct the object to the next switching node, or the destination, along the path to the destination. The learned policy can zero-shot transfer to the real world.

Figure 4 shows the visualization of map clustering results. In Figure 4, image (a) a low-level map is shown ina condition prior to down-sampling (i.e. clustering and allocation of switching nodes), where each dot is a place node (location at which an image was captured). In Figure 4, image (b) the down- sampled low-level map is shown, where each color denotes one context. Notably, very little of the metric map is covered by switching nodes since many locations are suboptimal for traversal through the area.

The main navigation results are shown in Table I. In short, the system of embodiments of the present invention has a success rate of 82.0% and 66.3% in the standard and curved setting respectively, significantly outperforming baselines.

Path Type Model Succ SPL

NoMetric 57.9 50.1

NoTopo-EndGoal 69.9 62.8

, NoTopo-PlaceGoal 78.0 66.2 standard _NoTopo.EndGoal-GTPose 87.6 86.0

NoTopo-PlaceGoal-GTPose 90.0 74.9

Ours-GTPose 89.5 87.7

Ours 82.0 77.9

NoMetric 47.1 32.9

NoTopo-EndGoal 48.0 38.1

, NoTopo-PlaceGoal 45.0 36.1 curved NoTopo-EndGoal-GTPose 65.3 56.2

NoTopo-PlaceGoal-GTPose 68.3 50.2

Ours-GTPose 80.0 68.5

Ours 66.3 53.9

TABLE I: Navigation performance in the Habitat simulator. The metric information remains highly relevant. In the absence of metric information, the baseline navigation system has a much lower success rate compared to the present methodology. This is because it estimates directly the direction vector between the current observation and the switch node. This performs particularly poorly in environments where the switch nodes are spread out. For example, in a long corridor where the switch nodes are only at the two ends, the direction estimation is inherently impractical due to the limited overlap between the two images.

Similarly, high-level topological information is important. In the absence of topological information, the baseline experiences a significant drop in success rate from the present methodology. This is because it estimates a global vector pointing directly towards the goal and thus lacks global topology information.

Figure 5 illustrates a block diagram of a system for building a map of an area 500 in accordance to an embodiment of the invention. According to Figure 5, the system comprises memory 502, a vehicle 504 comprising an image capture device 506, an image analysis engine 516, at least one processor 508 and a control system for controlling the vehicle 510. The memory 502 comprises instructions executable by the at least one processor 508 to cause the at least one processor to perform at least a part of the method of the embodiments described herein. The instructions cause the at least one processor 508 to form a first level of the map 512 before then forming a second level of the map 514 and outputting the map to a control system.

Some methods used herein are embodied by Figure 6, in which the method 600 for building a map of an area involves the at least one processor 508 forming a first level of the map (step 602). This is achieved by controlling the vehicle, using the control system, to navigate through the area (604). While doing so, the vehicle captures images of the area using the image capture device with each image corresponding to a location in the area (606). The images are then aligned based on image features appearing in two or more of the images as discussed above. Once aligned, the co-visibility and traversability can be determined for each location at which one of the images was captured (608). As discussed previously, traversability is the determination of a simple path between the two images which is done by comparing the ratio of the geodesic distance to the euclidean distance between the two images. A simple path exists if the ratio is below 1.2

Further, the at least one processor 508 forms a second level of the map (step 610) by defining clusters of the locations based on the co-visibility and traversability of the locations (612). This is achieved by performing normalized cuts (defined, e.g., in h ttpsi./Ziegexplore Jeee.Qr /a bstrac /d oc ur e nt/868688). iteratively on the low level graph, each cut cutting through bottlenecks corresponding to the narrow passages. Further, a switch node is assigned to each cluster, with each switch node corresponding to a passage for the vehicle to traverse the respective cluster (614). The switch node may be the node with highest covisibility and/or traversability, where traversability may be favoured over covisbility, particularly for undulating maps with step changes in floor height that do not affect covisibility. The switch node may be located at a location at which an image was captured during generation of the low-level graph, or may result from interpretation between two or more locations in the low-level graph where those locations have the highest traversability and covisibility of the nodes in the relevant cluster. Finally, the processor 508 then uploads the map to the control system (616) for controlling an object through the area.

Figure 7 illustrates a block diagram of a navigation control system 700 in accordance with an embodiment of the invention. According to Figure 7, the system comprises an input device 716 for requesting or supplying a destination, a controller 706 comprising memory 708, in which the memory stores a first level graph or map 710 and a second level graph or map 712. The system further comprises an image capture device 702, a path planning module 714 and at least one processor 704. The at least one processor 704 is programmed to perform at least a part of the method of the embodiments described herein

Some methods used herein are embodied by Figure 8, in which the method 800 involves the at least one processor being programmed to receive a further image from the image capture device, the further image being captured at a current location of the object (step 802). Then, the at least one processor localises the object in the map based on image features in the further image and the first level of the map (804). A destination from the input device is then received by the at least one processor in step 806. Finally, the at least processor uses the path planning module to determine a path from the current location to the destination via one or more of the switch nodes (808).

As an object navigates through the environment or area, based on the map, control may be maintained by the object supplying, and the navigation control system receiving, a sequence of images (e.g. a video feed) from the image capture device of the object as the object moves through the environment or area. From these images, the navigation control system calculates, from each image in the sequence, a depth observation and relative pose to the switch node that is next along the path. The navigation control system then controls the object using discrete actions to repeatedly align the object with the path.

The present methodologies have been evaluated using a quadruped robot in a school of computing which contains both indoor and outdoor environments. The methods disclosed herein can navigate robustly at places with limited textures where SLAM-based methods fail most of the time.

While SLAM is used in some prior art methodolgoies, it has its limitations. SLAM is sensitive to motion blur. Unlike common wheeled robots, quadruped robots unavoidably have more vibrations due to their locomotion. The vibration causes motion blur in the camera views, especially during rapid turns, which causes feature matching to fail. It is found that the SLAM is not able to successfully map beyond past some intersection. Secondly, the Real-Time Appearance Based Mapping (RTAB-Map) fails almost certainly during self-rotations, which happens when the robot reaches the end of a path or a corridor and needs to take a 180- degree turn back.

Due to difficulty in feature matching, RTAB-Map can only localize in feature-rich environments. Localisation can be brittle in environments with fewer distinctive image features by which to localise the object. Furthermore, the system cannot recover once the localization is lost, thus yielding poor performance in large- scale navigation.

The projected 2D Occupancy Grid Map produced using RTAB-Map contains fake obstacles, i.e., the dots along the corridor are actually not occupied. This could result from inaccurate point clouds generated from the mapping process, e.g., the depth sensor reading is noisy. When the occupancy map is too noisy, the planner is unable to find a feasible path to the goal, which fails navigation.

The methodologies presented herein outperform RTAB-Map SLAM in all tested routes. In particular, when the routes cover featureless environments, e.g., indoor COMI building of the National University of Singapore, RTAB-Map could not even successfully map the environment. This causes the baseline to fail completely. In contrast, the present methodologies do not require reconstruction or metric maps, thus free from such failures. This is made possible by the data-driven local controller that learns robust obstacle avoidance skills from data so that the map does not need to capture the notion of freespace.

It will be appreciated that many further modifications and permutations of various aspects of the described embodiments are possible. Accordingly, the described aspects are intended to embrace all such alterations, modifications, and variations that fall within the spirit and scope of the appended claims.

Throughout this specification and the claims which follow, unless the context requires otherwise, the word "comprise", and variations such as "comprises" and "comprising", will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps.

The reference in this specification to any prior publication (or information derived from it), or to any matter which is known, is not, and should not be taken as an acknowledgment or admission or any form of suggestion that that prior publication (or information derived from it) or known matter forms part of the common general knowledge in the field of endeavour to which this specification relates.

Claims

1. A system for building a map of an area, the system comprising : memory; a vehicle comprising an image capture device; an image analysis engine; at least one processor; and a control system for controlling the vehicle, wherein the memory stores instructions that, when executed by the at least one processor, cause the at least one processor to: form a first level of the map by: controlling the vehicle, using the control system, to navigate through the area; capturing images of the area using the image capture device, each image corresponding to a location in the area; aligning the images based on image features that appear in two or more of the images, and determining co-visibility and traversability for each location at which a said image was captured; form a second level of the map by: defining clusters of the locations based on the co-visibility and traversability of the locations; assigning a switch node to each cluster, each switch node corresponding to a passage for the vehicle to traverse the respective cluster; and upload the map to the control system for controlling an object through the area.

2. The system of claim 1, wherein each location in the first level of the map is a node with an associated panoramic image captured by the image capture device at the location.

3. The system of claim 2, wherein the at least one processor assembles the nodes into a graph, wherein an edge between two said nodes in the graph indicates that the vehicle can navigate between the locations corresponding to the two nodes along a simple path. The system of claim 3, where a simple path comprises a path with a ratio of geodesic distance to Euclidean distance of less than 1.2. The system of any one of claims 1 to 4, wherein defining clusters comprises defining a context for a sub-space of the area, the sub-space comprising locations that share image features and between which the vehicle can directly traverse. The system of any one of claims 1 to 5, wherein assigning a switch node to each cluster comprises assigning the switch node for each cluster based on a location with a highest number of edges to a neighbouring cluster. A navigation control system for navigating an object through an area, comprising: an input device; a controller comprising memory, the memory storing a map comprising: a first level formed from images captured at locations in the area, the images being aligned based on image features appearing in two or more of the images and a co-visibility and traversability for each location; a second level formed by defining clusters of the locations based on the co-visibility and traversability of the locations, and assigning a switch node to each cluster, each switch node corresponding to a passage for the vehicle to traverse the respective cluster; an image capture device; a path planning module; and at least one processor, the at least one processor being programmed to: receive a further image from the image capture device, the further image being captured at a current location of the object; localise the object in the map based on image features in the further image and the first level; receive a destination from the input device; and use the path planning module to determine a path from the current location to the destination via one or more of the switch nodes. The navigation control system of claim 7, further comprising the object, the controller controlling the object from the current location to the destination along the path. The navigation control system of claim 7 or 8, wherein localising the vehicle comprises performing context localisation and metric localisation. The navigation control system of claim 9, wherein performing context localisation comprises based on cosine similarilty between features in the further image, and image features of the locations in the first level. The navigation control system of claim 9 or 10, wherein performing metric localisation comprises estimating a pose between the current location and switch node of a cluster by solving a perspective-n-point (PnP) problem based on the image features. The navigation control system of any one of claims 7 to 11, wherein the path planning module determines the path using a weighted A-star search of the switch nodes. The navigation control system of claim 8, wherein the controller: receives a sequence of images from the image capture device; calculates, from each image in the sequence, a depth observation and relative pose to the switch node that is next along the path; and controlling the object using discrete actions to repeatedly align the object with the path. A method for building a map of an area, the method comprising: forming a first level of the map by: controlling a vehicle to navigate through the area; capturing images of the area using an image capture device on the vehicle, each image corresponding to a location in the area; aligning the images based on image features that appear in two or more of the images, and determining co-visibility and traversability for each location at which a said image was captured; forming a second level of the map by: defining clusters of the locations based on the co-visibility and traversability of the locations; assigning a switch node to each cluster, each switch node corresponding to a passage for the vehicle to traverse the respective cluster; and uploading the map to a control system for controlling an object through the area. The method of claim 14, wherein each location in the first level of the map is a node with an associated panoramic image captured by the image capture device at the location, the method comprising assembling the nodes into a graph, wherein an edge between two said nodes in the graph indicates that the vehicle can navigate between the locations corresponding to the two nodes along a simple path. The method of claim 14 or 15, wherein defining clusters comprises defining a context for a sub-space of the area, the sub-space comprising locations that share image features and between which the vehicle can directly traverse. The method of any one of claims 14 to 16, wherein assigning a switch node to each cluster comprises assigning the switch node for each cluster based on a location with a highest number of edges to a neighbouring cluster. A method for navigating an object through an area using a map built according to the method of any one of claims 14 to 17, comprising : receiving a further image captured at a current location of the object; localising the object in the map based on image features in the further image and the first level; receiving a destination; and using the path planning module to determine a path from the current location to the destination via one or more of the switch nodes. The method of claim 18, further comprising controlling the object from the current location to the destination along the path. O.The method of claims 18 or 19, wherein localising the vehicle comprises performing context localisation based on cosine similarilty between features in the further image, and image features of the locations in the first level, and performing metric localisation by estimating a pose between the current location and switch node of a cluster by solving a perspective-n-point (PnP) problem based on the image features.