Information about the location, orientation, and context of a mobile device is of central importance for future multimedia applications and location-based services (LBS). With the widespread adoption of modern camera phones, including powerful processors, inertial measurement units, compass, and assisted GPS receivers, the variety of location and context-based services has significantly increased over the last years. These include, for instance, the search for points of interest in the vicinity, geo tagging and retrieval of user generated media, targeted advertising, navigation systems, social applications etc.
While satellite navigation systems can provide sufficient positioning accuracy, a clear view to at least four satellites is required, limiting its applicability to outdoor scenarios with few obstacles. Unfortunately, most interesting LBS could be provided in densely populated environments, which include urban canyons and indoor scenarios. Problems may be caused by multipath effects, which are even more severe if the user is traveling on the sidewalks and not in the middle of the street.
As GPS is virtually not available in indoor environments and the localization accuracy in urban canyons is insufficient, alternative positioning mechanisms, which can complement the available systems, are required.
One approach may consist in using images recorded on the mobile device as a visual fingerprint of the environment and matching them to an existing georeferenced database like Google Street View or Microsoft Street-Side views. In contrast to WiFi based indoor localization systems, no infrastructure, which grows in complexity with the size of the environment, is required. Further, LBS do not only rely on a precise location and orientation information to determine the user's actual field of view but also benefit from information on its content like exhibits, store names, trademarks, etc., which can be derived from the images the user is intentionally recording. Ideally, the pose information from visual localization is fused with all other available sensor data providing location or orientation like GPS, IMU, WiFi, or Cell-IDs if available.
The main challenge for visual localization is to rapidly and accurately search for images related to the current recording in a large georeferenced database. This task is known as Content Based Image Retrieval (CBIR). Objects, recorded at different size, pose, and with varying background have to be distinctively described and efficiently retrieved from a database. The application of CBIR to location recognition complicates these requirements.
In particular, images captured with a mobile device are used to retrieve the spatially closest image from a georeferenced dataset. This could, for instance, include the 360° panoramic images from Google Street View, which can be fetched from the web. Typically, only sparse reference data can be assumed. For instance, Street View panoramas are available online with varying inter-panorama distances, typically in the range of 12 to 17 m. However, there are problems associated with wide baselines. Whereas distant buildings can be well associated among the views, close objects like the train station or the tree are difficult to match even for a human observer. The description of distinct objects is complicated due to the three-dimensional structure of the environment and the resulting occlusions and overlaps. Further, different lighting conditions between the query and database image, which cause shadows and reflections, can change the visual appearance of the scene. Also, both query and database images typically contain dynamic objects, like cars or pedestrians, which lead to significant differences between matching views. As advertisements or even buildings alter over time and seasons change the appearance dramatically, a dynamic update process for the database is required. Due to the properties of mobile device cameras, query images are typically affected by motion blur and provide a limited field of view, which makes it difficult to match them against high resolution panoramas. Additionally, limitations on the processing power, battery capacity, and network performance require low complexity approaches on the mobile device and efficient communication including data compression.
Finally, very low retrieval times are an essential prerequisite for most LBS due to the rapidly changing field of view of the mobile device caused by user motion and constantly changing user attention.
Whilst there are different known image retrieval algorithms, the major bottleneck is the communication delay introduced by feature uploading. Including network delay, communication timeouts, and the retrieval itself, the delay until the client receives results from the server may be insufficient for some location-based services due to user motion and dynamically changing user attention.
Accordingly, in one embodiment the present invention aims to address two central challenges of mobile visual location recognition, namely the complex retrieval task and the communication delay.
Moreover, in order to achieve the required low query time, tree-based bag-of-feature (BOF) approaches are typically used, which quantize image descriptors into visual words.
The retrieval of images or image sequences in large databases has been studied extensively during the last decades. Object retrieval and location recognition are among the most known applications in this field. While image retrieval results can be efficiently improved via Bayesian filtering in location recognition scenarios, the requirements on the query time are very stringent. A typical example would be an online service providing location information based on image recordings from mobile devices and a geo-tagged reference image database like Google Street View. In this scenario, images are typically dimensionality reduced on the mobile device with the aid of robust features like SIFT or SURF. The extracted features are sent to a server, which has to compute the position estimate within a few milliseconds to meet the stringent real-time requirements of mobile location recognition. The ability to rapidly estimate the absolute location is essential to continuously limit the temporally increasing uncertainty of the user's position and thus the computational complexity.
In feature based retrieval approaches, the similarity of images is typically determined by a score based on the count of matching high dimensional feature descriptors. To avoid a query time, which scales linearly with the number of database images, efficient indexing structures like the popular kd-tree are typically used. These trees perform an approximate k-nearest neighbor search to achieve query times lower than the one obtained for a linear search for dimensions higher than 10. However, backtracking through neighboring leaves, which is required to achieve reasonable retrieval accuracy, amounts for a significant percentage of the overall query time. Further, in these approaches the descriptors of every image have to be stored, which results in a linear increase of the database size.
Accordingly, in another embodiment the present invention aims to address the challenge of rapid location recognition, and in particular to provide a location recognition system that reduces data processing time and expenses as well as data storage requirements.