WO2023165718A1

WO2023165718A1 - Apparatus and methods for visual localization with compact implicit map representation

Info

Publication number: WO2023165718A1
Application number: PCT/EP2022/058974
Authority: WO
Inventors: Arthur MOREAU; Nathan PIASCO; Dzmitry Tsishkou
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2022-03-04
Filing date: 2022-04-05
Publication date: 2023-09-07

Abstract

The present disclosure refers to a method of localizing a mobile apparatus in an area of interest, comprising the steps of: capturing an image using a camera of the mobile apparatus, the camera having a current camera pose when capturing the image; determining an image signature based on the image using a pre-trained image encoder of the mobile apparatus, the image signature being a representation of the current camera pose; performing iterations comprising the steps of selecting a pool of pose anchors from a map representation of the area of interest, each pose anchor corresponding to a candidate camera pose; generating a map signature for each pose anchor, each map signature being a representation of the corresponding candidate camera pose; comparing the image signature with the generated map signatures by determining a similarity score for each comparison; and identifying a number of pose anchors with highest similarity scores; wherein an initial iteration is performed based on an initial predefined pool of pose anchors, and in each subsequent iteration the step of selecting the pool of pose anchors is based on the pose anchors identified in the previous iteration; and estimating the current camera pose based on the pose anchors identified in the iterations. The present disclosure further refers to a corresponding mobile apparatus.

Description

Apparatus and methods for visual localization with compact implicit map representation

TECHNICAL FIELD

The present disclosure relates to a method of localizing a mobile apparatus in an area of interest and a corresponding mobile apparatus.

BACKGROUND

The disclosure addresses the relocalization problem of a mobile platform in a known environment using images, i.e. recovering the precise 6 or 3 Degrees of Freedom (DoF) of a mobile platform within a map from an image taken from its visual sensor. It is widely used in mobile robotics, advanced driver assistance systems (ADAS), autonomous driving and augmented reality systems.

Visual relocalization systems can use different types of deep learning based algorithms. One approach consists in storing dense representations of the environment content, enabling camera pose estimation with geometric reasoning at the cost of high computational cost and heavy memory footprint. Other approaches bypass this problem by direct regression of the camera pose resulting in a lower accuracy.

SUMMARY

In view of the above, it is an objective underlying the present disclosure to overcome at least some of the disadvantages indicated above.

The foregoing and other objectives are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

According to a first aspect, a method of localizing a mobile apparatus in an area of interest is provided. The method comprises the steps of capturing an image using a camera of the mobile apparatus, the camera having a current camera pose when capturing the image; determining an image signature based on the image using a pre-trained image encoder of the mobile apparatus, the image signature being a representation of the current camera pose; performing iterations comprising the steps (i) - (iv) as follows: (i) selecting a pool of pose anchors from a map representation of the area of interest, each pose anchor corresponding to a candidate camera pose; (ii) generating a map signature for each pose anchor, each map signature being a representation of the corresponding candidate camera pose; (iii) comparing the image signature with the generated map signatures by determining a similarity score for each comparison; and (iv) identifying a number of pose anchors with highest similarity scores; wherein an initial iteration is performed based on an initial predefined pool of pose anchors, and in each subsequent iteration the step of selecting the pool of pose anchors is based on the pose anchors identified in the previous iteration; and the method comprises the further step of estimating the current camera pose based on the pose anchors identified in the iterations.

Initially, the current camera pose is unknown. In fact, the current camera pose is to be determined by the method according to the present disclosure. The camera pose may include coordinate values and one or more orientation/angle values. The present disclosure involves an implicit map representation that enables to compress map specific content into a lightweight representation, such that localization in large environments can be performed in an efficient way. The accuracy of the method is not bounded by the density of reference poses. The terms localization/localizing and re-localization are used synonymously in the present disclosure.

The iterations may be performed a number of times until the further step of estimating the current camera pose is performed. This number of iterations may be predefined or predetermined, and may be based on a precision criterion for the camera pose, for example, or the number of iterations may be determined during the iterations based on a convergence criterion of the pose anchors in the sequence of iterations.

According to an implementation, the pre-trained image encoder may comprise a set of predetermined parameters and the map representation may comprise a set of predetermined parameters.

According to an implementation, the predetermined parameters may include weights of a neural network and are optionally provided in the form of respective parameter vectors.

According to an implementation, in each iteration a region in the map representation used to select new anchor poses based on anchors identified in the previous iteration may be decreased, in particular in each iteration new anchors closest to the identified anchors in the previous iteration may be selected to refine the pose estimate.

According to an implementation, the similarity score may be based on a measure of similarity using the image signature and the respective generated map signature. According to an implementation, the method may comprise an initial step of receiving, from a server, the map representation of the area of interest.

According to an implementation, the method may comprise a further step of receiving, from the server, a further map representations of a further area of interest when the mobile apparatus moves towards or into a further area of interest.

According to an implementation, each map representation may have been previously obtained by performing the steps of obtaining training data in the area of interest using respective cameras of one or more mobile devices moving in the area of interest, the training data comprising image data and camera pose data; transmitting the obtained training data to a remote computing device, such as the server or a cloud computing device; and using the training data to train the map representation. This has the advantage that there can be a continual growth of the reference images database that improves the accuracy while keeping a fixed-size memory footprint.

According to an implementation, the image encoder of the mobile apparatus may be pretrained once by performing the steps of providing reference images and corresponding reference camera poses; and training the image encoder by feeding the image encoder with the reference images and adjusting parameters of the image encoder by comparing an output of the image encoder with the reference camera poses.

According to an implementation, training the image encoder and training the map representation may be performed jointly, in particular at least partially using the same images and camera poses.

According to an implementation, the step of estimating the current pose of the camera based on the pose anchors identified in the iterations may comprise a step of selecting the pose with a maximum score or by computing an average or a weighted average of the pose anchors.

According to a second aspect, a mobile apparatus is provided. The mobile apparatus comprises a camera for capturing an image, the camera having a current camera pose when capturing the image; a pre-trained image encoder for determining an image signature based on the image, the image signature being a representation of the current camera pose; a memory for storing a map representation of the area of interest; and processing circuitry configured to perform iterations comprising the steps (i) - (iv) as follows: (i) selecting a pool of pose anchors from the map representation of the area of interest, each pose anchor corresponding to a candidate camera pose; (ii) generating a map signature for each pose anchor, each map signature being a representation of the corresponding candidate camera pose; (iii) comparing the image signature with the generated map signatures by determining a similarity score for each comparison; and (iv) identifying a number of pose anchors with highest similarity scores; wherein the processing circuitry is further configured to perform an initial iteration based on an initial predefined pool of pose anchors, and in each subsequent iteration to perform the step of selecting the pool of pose anchors based on the pose anchors identified in the previous iteration; and estimate the current camera pose based on the pose anchors identified in the iterations.

The iterations may be performed a number of times until the further step of estimating the current camera pose is performed.

The explanations and advantages provided above for the method according to the first aspect and the implementations thereof apply vis-a-vis to the mobile apparatus according to the second aspect and the implementations thereof. In order to avoid repetition, these are omitted here and in the following.

According to an implementation, the predetermined parameters may include weights of a neural network and may be provided in the form of respective parameter vectors.

According to an implementation, the similarity score may be based on a measure of similarity using the image signature and the respective generated map signature.

According to an implementation, the mobile apparatus may comprise a receiver configured to receive, from a server, the map representation of the area of interest.

According to an implementation, wherein the receiver may be further configured to receive, from the server, a further map representations of a further area of interest when the mobile apparatus moves towards or into a further area of interest.

According to an implementation, the processing circuitry may be configured to estimate the current pose of the camera based on the pose anchors identified in the iterations by selecting the pose with a maximum score or by computing an average or a weighted average of the pose anchors. According to a third aspect, a system is provided. The system comprises one or more mobile devices, each having a camera for capturing images in an area of interest; a localization device for obtaining a respective camera pose corresponding to the captured images; a transmitter for transmitting training data comprising image data of the captured images and camera pose data of the obtained camera poses; and a remote computing device, such as the server or a cloud computing device, for receiving the transmitted training data, and for training a map representation of the area of interest using the training data.

According to an implementation, the remote computing device may be configured to transmit the map representation of the area of interest to a mobile apparatus.

According to a fourth aspect, a computer program is provided. The computer program comprises instructions which, when the program is executed by a computer, cause the computer to carry out the method according to the first aspect or any implementation thereof.

According to a fifth aspect, a computer-readable medium is provided. The computer-readable medium comprises instructions which, when executed by a computer, cause the computer to carry out the method according to the first aspect or any implementation thereof.

According to the disclosure, a compact learned representation of the environment enables real-time localization with high accuracy.

BRIEF DESCRIPTION OF THE DRAWINGS

In the following, embodiments of the present disclosure are described in more detail with reference to the attached figures and drawings, in which:

Figure 1 illustrates the localization solution for mobile platforms.

Figure 2 illustrates a localization process.

Figure 3 illustrates a training process.

Figure 4 illustrates a computational workflow.

Figure 5 illustrates localization on multiple maps.

Figure 6 illustrates localization on a new map (map adaptation).

Figure 7 illustrates a computational workflow for multi-maps and map adaptation.

Figure 8 illustrates discrete and continuous implicit map representation. Figure 9 illustrates a general method of localizing a mobile apparatus in an area of interest.

Figure 10 illustrates a mobile apparatus according to the present disclosure.

DETAILED DESCRIPTION OF THE EMBODIMENTS

According to the present disclosure, the relocalization solution for a mobile apparatus (mobile platform, vehicle, mobile robot, etc.) consists of using a learning-based visual localization algorithm into an embedded computing device, as describe in Figure 1. After data collection on a target environment, a map is built and used to train a deep learning based system which is able to relocalize accurately and efficiently into the map. This is achieved by an implicit map representation that replaces traditional point clouds or images database as the environment representation. This new formulation enables fast computation, low memory footprint, and the ability to deploy on multiple areas with minimal scene-specific training.

Technical problem(s) which are overcome by the invention

Localization systems for autonomous driving need to be deployed at city-scale or countryscale. Best algorithms that solve the camera pose estimation problem store a lot of information about the 3D environment of the target area in memory. In the context of very large environments, this prevents deployment of the algorithm in real time in embedded device of the prior art. The present disclosure stores a very compact representation of the surrounding environment in memory, enabling large scale deployment in multiple areas on embedded devices and real-time processing and localization by the mobile apparatus.

The present disclosure uses camera-only devices to perform localization, that makes the method cheaper and more scalable compared to LIDAR based localization solution.

Fixed-memory learned based localization methods suffer from poor localization performances. By "inverting" the localization process and not directly regressing the camera pose from the image (see Figures 2) the solution according to the present disclosure improves localization performances against a learned localization baseline with comparable memory footprint and processing load. Key points = core of the disclosure

• Separated image and map learned representations: using an implicit scene representation to store information about the target environment, instead of an explicit representation such as 3D point clouds. In other words, the map is represented by a neural network that connects map coordinates to a latent code. This facilitates embedded deployment at large scale thanks to a lightweight memory footprint, and provides a continuous representation instead of a discrete one.

• Iterative refinement of the camera pose: the camera pose prediction is obtained by iteratively comparing the image representation with pose candidates representations which are sampled in a hierarchical process.

• Multi map system & new map adaptation: this solution can be deployed in multiple target areas with a single neural network and new maps can be integrated into a deployed system in a fast process.

• Compact map representation: the present solution compresses map specific content into few MBs, enabling fast transfer with cloud during navigation between different areas.

• Continual learning and self-improvement from crowd-sourced data: crowdsourced data obtained from the system users can be used to track temporal modifications of the environment and continuously improve the localization accuracy.

Detailed technical description of the present disclosure

Embodiment 1 : localization on a map

Data collection: Before deployment, visual data in the area of interest must be recorded and stored, in order to build the map and train the localization algorithm. This can be done by a fleet of vehicles deployed for the purpose or by gathering crowd-sourced data. During deployment, crowd-sourced images from system users can be collected for tracking modifications in the map and improving the localization accuracy.

Localization process: The relocalization algorithm takes a RGB image as input and outputs a camera pose with 3 or 6 degrees of freedom (3 translation and 3 rotation in SE3 (3D) or 2 translation and 1 rotation in SE2 (2D)). It is trained with the image database captured in the area of interest and labeled with camera poses computed during the mapping step. The main processing steps and computing modules are described below (see Figure 2 and Table 1):

1. First, the input image is encoded by a neural network, that is named Image encoder. Then a compact intermediate representation of the image is obtained, named image signature. The image encoder can be pre-trained on a larger database of images.

2. Then, from a pool of pose anchors, map signatures are computed, which are representations of camera poses in the map of interest. These map signatures are produced by the implicit map/scene representation, i.e. a module with learnable parameters that provides higher-dimensional representations of poses in the target area. For the first iteration of the localization algorithm, pose anchors can be chosen at random or uniformly distributed among all the training poses or sampled in a predefined regular grid.

3. Image signatures and map signatures are compared by a Matching module. The Matching module is defined as a computing unit that predicts a similarity score between image and map signatures. It can either be a learnable module or based on hand crafted heuristics.

4. Poses with higher similarity scores are stored in memory

5. Based on similarity scores of pose anchors, a Candidates proposers selects a new pool of pose anchors that will be evaluated as described in steps 2 and 3.

6. After repeating the process described by steps 2, 3, 4, 5 a given number of times, we use the stored poses with higher scores to provide a localization estimate, by selecting the pose with maximum score or by computing an average or a weighted average.

Table 1 : Technical details on the localization pipeline

In the following more details about the key part of the invention, the implicit learned map representation, are given.

Implicit learned map representation: from any camera pose input (= pose anchor), a neural network outputs a map signature. There are multiple possible architectural design choices for this network: a simple multi-layer perceptron ensures a fast computation, whereas approaches based on features aggregation along camera rays could ensure the 3D consistency of the learned signatures. Multi-dimensional pose embedding like positional encoding in order to better capture small variation in the pose space are also considered. During training, the learned map representation is randomly initialized and optimized to reduce the localization error. The idea is to learn a mapping between camera poses in the target area and the visual content observable from this viewpoint. During online localization, the optimized representation is loaded in the localization module.

Training procedure: the trainable module of the method is shown in Figure 3. Camera poses as only source of supervision are used and the reference poses in the offline mapping process are obtained. Both image encoder and implicit map representation are trained jointly. For a given image with corresponding camera pose, an ideal target score is computed that correspond to an ideal output of the localization pipeline. Target scores are defined using the distance between pose candidates and the reference pose. The system learns to minimize scores errors on training samples. For instance, the ideal target score can be designed as a 6D Laplacian kernel centered at the camera position. During training, the loss function between the ideal score and the similarity score outputted by the localization pipeline as described earlier is computed. A loss is computed at each refinement level with anchors manually selected close to the target to speed-up the training.

Workflow and devices: the computational workflow of our invention is described in Figure 4. It can be divided in 4 components: 1 . A fleet of vehicle are equipped with various sensors (cameras, LIDAR, I MU, etc.) in order to map a target area.

2. The mappings vehicle are deployed on the target area to record data. Data are stored internally or transferred to a remote server or a cloud.

3. A remote device, a cloud or a High Performance Computing device process the collected data in two steps: first by mapping the area of interest to generate the training data (= images with camera poses as labels) and in a second step by training the localization model (neural network weights) by following the training procedure described earlier.

4. During online localization, neural networks weights are transferred to the computing device through cloud. Images coming in real-time from cameras are processed by the localization algorithm in the embedded device, providing camera pose estimates at a high framerate.

Advantages of embodiment 1

The implicit map representation enables to compress map specific content into a lightweight representation, enabling relocalization in large environments in an efficient way. In contrast with prior art image retrieval methods, the accuracy of the present methods is not bounded by the density of reference poses, and the continual growth of the reference images database improves the accuracy while keeping a fixed-size memory footprint.

Embodiment 2: multi-map and map adaptation

In the perspective of large-scale deployment of autonomous systems, city scale or country scale maps are needed. In this scenario, most of existing visual localization approaches are limited by their accuracy and the amount of memory storage needed. The common approach divides the area of interest in multiple maps and train a specific localization algorithm for each map. In contrast, the present solution is adapted by design for operation in any environment with the same neural network, provided that the learned map representation has been trained and loaded into the computing device. The compactness of this module (~1 MB) allows easy transfer between the mobile platform and the cloud.

Multi-map training: The present localization system can be trained simultaneously on multiple areas of interest. The image encoder is shared between all maps, whereas each area of interest is attached to a specific compact learned map representation (see Figure 5). Another important perspective for scaling up map-based autonomous systems is the deployment time on a new area. A system operating in an environment which is continuously growing needs to be able to adapt fast to new environments. A technology able to operate autonomously in an area of interest few minutes after data collection would facilitate large scale deployment. Once the present multi-map localization system is deployed, new maps can be integrated into the framework in a small fraction of time compared to the entire training.

New map adaptation: after data collection and mapping in the target environment, the new learned map representation can be trained directly to fit an already trained multi-map localization algorithm. Image encoder is not optimized during the new map adaptation training process. As a result, learning only the small number of parameters of the learned map representation is a very fast process (see Figure 6).

Advantages of embodiment 2

Multimap and new map adaptation mechanisms enable city/country scale deployment of the localization service thanks to the compactness of map specific content, which enables fast transfer with cloud during the mobile platform navigation. Using a multimap system instead of several independent single map systems reduces the computational cost of the training step and improves accuracy thanks to transfer learning.

Alternative implementation of the inventive solution: discrete and continuous implicit representation

The core of the present disclosure is the implicit map representation module. It is defined as a map-specific learnable module that connects a camera pose in the area of interest to a map signature (i.e. a higher dimensional latent vector). In the optimal embodiment the implicit map representation is described as learnable neural network that output a map signature for every continuous input pose.

Another formulation of such an implicit learned map representation is an array of spatially arranged learnable vectors. In other words, the map is discretized across its dimensions into a finite number of map cells, to each of which a signature is attached, see Figure 8. The signatures are directly learned with backpropagation and stored in memory. The main benefit is a very compact representation and signatures can be accessed without additional computation. However, the precision is limited by the resolution of the discretization that has be small in order to keep a compact representation. Discrete vectors could be interpolated to obtain representations at an arbitrary resolution.

General aspects of the present disclosure

Figure 9 illustrates a general method of localizing a mobile apparatus in an area of interest according to the present disclosure, covering the embodiments as described above. The general method comprises the steps:

910: capturing an image using a camera of the mobile apparatus, the camera having a current camera pose when capturing the image;

920: determining an image signature based on the image using a pre-trained image encoder of the mobile apparatus, the image signature being a representation of the current camera pose;

930: performing iterations comprising the steps of:

931 : selecting a pool of pose anchors from a map representation of the area of interest, each pose anchor corresponding to a candidate camera pose;

932: generating a map signature for each pose anchor, each map signature being a representation of the corresponding candidate camera pose;

933: comparing the image signature with the generated map signatures by determining a similarity score for each comparison; and

934: identifying a number of pose anchors with highest similarity scores;

940: an initial iteration is performed based on an initial predefined pool of pose anchors, and in each subsequent iteration the step of selecting the pool of pose anchors is based on the pose anchors identified in the previous iteration; and

950: estimating the current camera pose based on the pose anchors identified in the iterations.

Step 934 may comprise storing of at least a part of the pose anchors with the highest similarity scores for the final pose estimation in step 950.

The iterations may be performed a number of times until the further step of estimating the current camera pose is performed. This number of iterations can be predefined or predetermined based on a precision criterion for the camera pose, or the number may be determined during the iterations based on a convergence criterion of the pose anchors in the sequence of iterations.

Figure 10 illustrates a mobile apparatus 1000 according to the present disclosure. The mobile apparatus 1000 comprises a camera 1010 for capturing an image, the camera 1010 having a current camera pose when capturing the image; a pre-trained image encoder 1020 for determining an image signature based on the image, the image signature being a representation of the current camera pose; a memory 1030 for storing a map representation of the area of interest; and processing circuitry 1040 configured to perform iterations comprising the steps (i) - (iv) as follows: (i) selecting a pool of pose anchors from the map representation of the area of interest, each pose anchor corresponding to a candidate camera pose; (ii) generating a map signature for each pose anchor, each map signature being a representation of the corresponding candidate camera pose; (iii) comparing the image signature with the generated map signatures by determining a similarity score for each comparison; and (iv) identifying a number of pose anchors with highest similarity scores; wherein the processing circuitry 1040 is further configured to perform an initial iteration based on an initial predefined pool of pose anchors, and in each subsequent iteration to perform the step of selecting the pool of pose anchors based on the pose anchors identified in the previous iteration; and estimate the current camera pose based on the pose anchors identified in the iterations.

The mobile apparatus 1000 is configured to perform the method as described in Figure 9.

Application scenarios

• The present disclosure I system mainly targets autonomous driving applications. Vehicles are equipped with a computing device and cameras and make use of the localization service to ensure precise and safe navigation. The system can be first deployed on a limited area which can be continuously enlarged by collecting data in new areas. Data recorded in user vehicles is used to improve the system's accuracy over time.

• Autonomous mobile robots can be equipped with our system in order to navigate in their environments. Applications include transport of goods in warehouses, charging robots operating in parking areas, or domestic robots.

• Augmented reality systems can benefit from the present disclosure I system because they need a precise real-time localization ability. Applications include assistance systems for staff that performs maintenance and repair of complex equipment, tourism industry, or public safety (software that provide instructions in emergency situations).

Beneficial effects and advantages of the invention • According to the present disclosure, the accuracy for real-time relocalization algorithms can be improved.

• Training the model is one order of magnitude faster than competitors.

• Separated map and image representations enable to efficiently adapt to a new area of interest. As a result, scaling up to city-scale areas can be done faster. • The obtained distribution of scores provides information about model confidence, enabling uncertainty quantification according to the final poses disparity, which is crucial for sensor fusion.

• The compact map representations of the method enables easy transfer through the cloud. The present disclosure is generally defined by the claims.

Claims

1 . A method of localizing a mobile apparatus in an area of interest, comprising the steps of: capturing an image using a camera of the mobile apparatus, the camera having a current camera pose when capturing the image; determining an image signature based on the image using a pre-trained image encoder of the mobile apparatus, the image signature being a representation of the current camera pose; performing iterations comprising the steps of: selecting a pool of pose anchors from a map representation of the area of interest, each pose anchor corresponding to a candidate camera pose; generating a map signature for each pose anchor, each map signature being a representation of the corresponding candidate camera pose; comparing the image signature with the generated map signatures by determining a similarity score for each comparison; and identifying a number of pose anchors with highest similarity scores; wherein an initial iteration is performed based on an initial predefined pool of pose anchors, and in each subsequent iteration the step of selecting the pool of pose anchors is based on the pose anchors identified in the previous iteration; and estimating the current camera pose based on the pose anchors identified in the iterations.

2. The method according to claim 1 , wherein the pre-trained image encoder comprises a set of predetermined parameters and the map representation comprises a set of predetermined parameters.

3. The method of claim 2, wherein the predetermined parameters include weights of a neural network and are provided in the form of respective parameter vectors. The method according to any one of claims 1 to 3, wherein in each iteration a region in the map representation used to select new anchor poses based on anchors identified in the previous iteration is decreased, in particular in each iteration new anchors closest to the identified anchors in the previous iteration are selected to refine the pose estimate. The method according to any one of claims 1 to 4, wherein the similarity score is based on a measure of similarity using the image signature and the respective generated map signature. The method according to any one of claims 1 to 5, further comprising an initial step of receiving, from a server, the map representation of the area of interest. The method according to claim 6, further comprising a step of receiving, from the server, a further map representations of a further area of interest when the mobile apparatus moves towards or into a further area of interest. The method according to any one of claims 1 to 7, wherein each map representation has been previously obtained by performing the steps of: obtaining training data in the area of interest using respective cameras of one or more mobile devices moving in the area of interest, the training data comprising image data and camera pose data; transmitting the obtained training data to a remote computing device, such as the server or a cloud computing device; and using the training data to train the map representation. The method according to any one of claims 1 to 8, wherein the image encoder of the mobile apparatus is pre-trained once by performing the steps of: providing reference images and corresponding reference camera poses; and training the image encoder by feeding the image encoder with the reference images and adjusting parameters of the image encoder by comparing an output of the image encoder with the reference camera poses. The method according to claim 8 and 9, wherein training the image encoder and training the map representation is performed jointly, in particular at least partially using the same images and camera poses. The method according to any one of claims 1 to 10, wherein the step of estimating the current pose of the camera based on the pose anchors identified in the iterations comprises selecting the pose with a maximum score or by computing an average or a weighted average of the pose anchors. A mobile apparatus, comprising: a camera for capturing an image, the camera having a current camera pose when capturing the image; a pre-trained image encoder for determining an image signature based on the image, the image signature being a representation of the current camera pose; a memory for storing a map representation of the area of interest; and processing circuitry configured to perform iterations comprising the steps of: selecting a pool of pose anchors from the map representation of the area of interest, each pose anchor corresponding to a candidate camera pose; generating a map signature for each pose anchor, each map signature being a representation of the corresponding candidate camera pose; comparing the image signature with the generated map signatures by determining a similarity score for each comparison; and identifying a number of pose anchors with highest similarity scores; wherein the processing circuitry is further configured to: perform an initial iteration based on an initial predefined pool of pose anchors, and in each subsequent iteration to perform the step of selecting the pool of pose anchors based on the pose anchors identified in the previous iteration; and estimate the current camera pose based on the pose anchors identified in the iterations. The mobile apparatus according to claim 12, wherein the pre-trained image encoder comprises a set of predetermined parameters and the map representation comprises a set of predetermined parameters. The mobile apparatus of claim 13, wherein the predetermined parameters include weights of a neural network and are provided in the form of respective parameter vectors. The mobile apparatus according to any one of claims 12 to 14, wherein in each iteration a region in the map representation used to select new anchor poses based on anchors identified in the previous iteration is decreased, in particular in each iteration new anchors closest to the identified anchors in the previous iteration are selected to refine the pose estimate. The mobile apparatus according to any one of claims 12 to 15, wherein the similarity score is based on a measure of similarity using the image signature and the respective generated map signature. The mobile apparatus according to any one of claims 12 to 16, wherein the mobile apparatus comprises a receiver configured to receive, from a server, the map representation of the area of interest.

18. The mobile apparatus according to claim 17, wherein the receiver is further configured to receive, from the server, a further map representations of a further area of interest when the mobile apparatus moves towards or into a further area of interest.

19. The mobile apparatus according to any one of claims 12 to 18, wherein the processing circuitry is configured to estimate the current pose of the camera based on the pose anchors identified in the final iteration by selecting the pose with a maximum score or by computing an average or a weighted average of the pose anchors.

20. A system comprising: one or more mobile devices, each having a camera for capturing images in an area of interest; a localization device for obtaining a respective camera pose corresponding to the captured images; a transmitter for transmitting training data comprising image data of the captured images and camera pose data of the obtained camera poses; and a remote computing device, such as the server or a cloud computing device, for receiving the transmitted training data, and for training a map representation of the area of interest using the training data.

21 . The system of claim 20, wherein the remote computing device is configured to transmit the map representation of the area of interest to a mobile apparatus.

22. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method according to any one of claims 1 to 11 . A computer-readable medium comprising instructions which, when executed by a computer, cause the computer to carry out the method according to any one of claims 1 to 11.