Contemporary smart phones use various localization methods based on GPS, cellular networks and Wifi networks. However, none of the methods available today is able to reliably and accurately determine a user's location inside buildings.
Normally, no infrastructure supporting localization is available inside buildings. Similarly, smartphones may not be equipped with specialized localization hardware.
With recent advances in content based image retrieval (CBIR), fast visual localization of mobile devices becomes feasible. Accordingly, the visual information that may be made available through a phone's camera is used for location estimation. By comparing the features visible in the image taken by the camera to geo-tagged reference images recorded previously during a mapping run, the location of the camera can be determined.
Utilizing video recordings of a mobile device as a visual fingerprint of the environment and matching them to a geo-referenced database provides pose information in a very natural way. Hence, location based services (LBS) can be provided without complex infrastructure in areas where the accuracy and availability of GPS is limited. This is particularly interesting for indoor environments, where traditional localization methods like GPS are unavailable.
However, the application of CBIR to mobile location recognition implies several challenges. The complex 3D shape of the environment results in occlusions, overlaps, shadows, reflections, etc., which require a robust description of the scene. Bag-of-Features based image representations are able to fulfill these requirements, however they require a huge amount of reference images in order to be useful for localization.
Vision-based localization systems make use of local image features, organized in a searchable index using content-based image retrieval (CBIR) methods. Once trained on a set of reference images, CBIR systems are able to rapidly identify images similar in appearance to a query image. However, when applied to the problem of visual localization, two major problems surface:
Limited accuracy: In order to provide reference images for the image retrieval system, the environment needs to be mapped, i.e. images have to be captured at various locations and orientations, and corresponding map coordinates have to be stored. This is commonly achieved by mapping trolleys which automatically capture images and acquire a 3D point cloud model as it is moved through the environment. Although automated to a large degree, mapping buildings on a large scale is a time-consuming and tedious endeavour, and it is impossible to capture images at every combination of location and orientation that might occur during localization. In practice, images are captured along a single trajectory only, drastically limiting the resolution of position and orientation estimates as returned by the image retrieval process.
Perspective distortion: The limited affine and perspective invariance of feature descriptors is a severe problem, as a location can be recognized only if a reference image with a pose similar enough to the query image exists. There has been extensive work on improving the robustness of feature descriptors under perspective distortion. However, robustness is gained at the expense of distinctiveness, hence such approaches tend to increase recall only, but not precision.
It is known to apply content based image retrieval approaches for location recognition in textured outdoor environments [1, 2, 10, 11]. Indoor environments, however, are more challenging, as only few distinctive features are available and perspective distortion is more pronounced, especially in narrow corridors.
Attempts to address perspective distortions are described in [3] and [7]. However, these methods are computationally expensive or do not have to deal with complex geometric variations.
Further, it is known to determine information on the 3D structure of an environment, e.g. via laser scans, and to use such information to generate locally orthogonal projections. In [2] there is described a combination of conventional, perspective images with orthogonal projections of building facades to increase invariance with respect to the viewpoint. Increasing feature invariance however, generally deteriorates distinctiveness, which is particularly unfavourable in texture-poor indoor environments.
From [13] it is known to generate viewpoint invariant patches (VIP) to improve robustness in respect of 3D camera motion.
The generation of synthetic views is described in [5]. However, the approach described in this document may be insufficient in case of sparse reference imagery. Further, occlusions are not handled by this approach, which is of particular importance in indoor environments where obstacles and walls restrict visibility.
From [1] it is known to generate orthogonal projections of buildings facades. Query images are normalized to surface-parallel views after analyzing them for vanishing points. However, this approach too is expensive in terms of processing.