Localizing where a photo or video was taken is a key problem in computer vision, with a broad range of applications in consumer photography, augmented reality, photo editing, autonomous and human navigation, and forensics. Information about camera location can also aid in other vision tasks, such as estimating the illumination of a photograph and scene understanding. With the rapid growth of online photo sharing sites as well as the creation of more structured image collections such as Google's Street View, increasingly any new photo can in principle be localized with respect to this growing set of existing imagery.
There are several problem areas of recent interest in computer vision including landmark recognition and localization as well as localization from point clouds. With respect to landmark recognition and localization, the problem of “where was this photo taken?” can be answered in several ways. Some techniques approach the problem as that of classification into one of a predefined set of place names—e.g., “Eiffel Tower” or “Empire State Building”. Another is to create a database of localized imagery, then formulate the problem as that of image retrieval, after which the query image can be associated with the location of the retrieved images. The interpretation of the results of such methods varies from technique to technique. In what is known as im2gps, the location of arbitrary images such as “forests” and “deserts” have been characterized with a rough probability distribution over the surface of the Earth with confidences on the order of hundreds of kilometers. In other related work, human travel priors are used to improve performance for sequences of images, but the resulting locations remain fairly coarse. Other work seeks to localize urban images more precisely, often by matching to databases of street-side imagery using Bag-of-Words (BoW) retrieval techniques.
With respect to localization from point clouds, results of structure from motion (SfM) techniques are leveraged, for example, certain work uses SfM reconstructions to generate a set of “virtual” images that cover a scene, then index these as documents using BoW methods. Direct 2D-to-3D approaches have recently been used to establish correspondence between the reconstructed 3D model and the query image without going through an intermediate image retrieval step. While inverse matching from 3D points to image features can sometimes find correct matches very quickly though search prioritization, its success is dependent on being able to find them early on in the search process. Thus, it become less effective when as the size of the model grows. Certain other work follows the more conventional forward matching from image features to 3D points, but uses search prioritization to avoid considering every image feature and hence improving matching speed. However, the accuracy of camera pose estimation may decrease since it results in a smaller set of matches. Moreover, the set of matches obtained in this way is often noisy enough such that RANdom SAmple Consensus (RANSAC) needs to be run for up to one minute.
In many applications of digital photos, one wants to know exactly where on the Earth's surface a given photograph was taken, and which direction it was looking. At times one has Global Positioning Systems (GPS) associate with the photo, which gives approximate camera location, but at times one wants much more accurate information than what GPS provides. In other situations, GPS information is unavailable for a photo.
Clearly, there is a demand for a system and methods for determining where a photograph was taken by estimating camera pose using a world-wide point cloud database to which the photograph is matched. The present invention satisfies this demand.