Digital cameras have become commonplace, and advances in technology have made it easy for a single person to take thousands of photographs and store all of them on a hard drive. At the same time, it has become much easier to share photographs with others, whether by posting them on a personal web site, or making them available to a community of enthusiasts using a photo-sharing service. As a result, anyone can have access to millions of photographs through the Internet. Sorting through and browsing such huge numbers of photographs, however, is a challenge. At the same time, large collections of photographs, whether belonging to a single person, or contributed by thousands of people, create exciting opportunities for enhancing the browsing experience by gathering information across multiple photographs. Some photo-sharing services, such as FLICKR®, available at www.flickr.com, allow users to tag photos with keywords, and provide a text search interface for finding photos. However, tags alone often lack the level of specificity required for fine-grained searches, and can rarely be used to organize the results of a search effectively. For example, searching for “Notre Dame” in FLICKR® results in a list of thousands of photographs, sorted either by date or by other users' interest in each photo. Within this list, photographs of both the inside and the outside of Notre Dame cathedral in Paris are interspersed with photographs taken in and around the University of Notre Dame.
Finding a photograph showing a particular object, for instance, the door of the cathedral, amounts to inspecting each image in the list. Searching for both “Notre Dame” and “door” limits the number of images to a manageable number, but almost certainly excludes relevant images whose owners simply omitted the tag “door.”
The computer vision community has conducted work on recovering camera parameters and scene geometry from sets of images. The work of Brown and Lowe [2005] and of Schaffalitzky and Zisserman [2002] involves application of automatic structure from motion to unordered data sets. A more specific line of research focuses on reconstructing architecture from multiple photographs, using semi-automatic or fully automatic methods. The semi-automatic Facade system of Debevec, et al. [1996] has been used to create compelling fly-throughs of architectural scenes from photographs. Werner and Zisserman [2002] developed an automatic system for reconstructing architecture, but was only demonstrated on small sets of photographs.
Techniques have been developed for visualizing or searching through large sets of images based on a measure of image similarity (histogram distances such as the Earth Mover's Distance [Rubner et al. 1998] are often used). A similarity score gives a basis for performing tasks such as creating spatial layouts of sets of images or finding images that are similar to a given image, but often the score is computed in a way that is agnostic to the objects in the scene (for instance, the score might just compare the distributions of colors in two objects). Therefore, these methods are most suitable for organizing images of classes of objects, such as mountains or sunsets.
Finally, several tools have been developed for organizing large sets of images contributed by a community of photographers. For example, the World-Wide Media eXchange (WWMX) is one such tool. WWMX allows users to contribute photographs and provide geo-location information by using a GPS receiver or dragging and dropping photos onto a map. However, the location information may not be extremely accurate, and the browsing interface of WWMX is limited to an overhead map view. Other photo-sharing tools, such as FLICKR®, do not explicitly use location information to organize users' photographs, although FLICKR® supports tools such as “Mappr” for annotating photos with location, and it is possible to link images in FLICKR® to external mapping tools such as GOOGLE® Earth.
Finally, the following references are relevant to the description of the invention.    ARYA, S., MOUNT, D. M., NETANYAHU, N. S., SILVERMAN, R., AND WU, A. Y. 1998. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM 45, 6, 891-923.    BROWN, M., AND LOWE, D. G. 2005. Unsupervised 3D object recognition and reconstruction in unordered datasets. In International Conference on 3D Imaging and Modeling.     CANNY, J. 1986. A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 8, 6, 679-698.    DEBEVEC, P. E., TAYLOR, C. J., AND MALIK, J. 1996. Modeling and rendering architecture from photographs: a hybrid geometry- and image-based approach. In SIGGRAPH '96: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, ACM Press, New York, N.Y., USA, 11-20.    Flickr. http://www.flickr.com.    HARTLEY, R. I., AND ZISSERMAN, A. 2004. Multiple View Geometry in Computer Vision, second ed. Cambridge University Press, ISBN: 0521540518.    JOHANSSON, B., AND CIPOLLA, R. 2002. A system for automatic pose-estimation from a single image in a city scene. In IASTED Int. Conf. Signal Processing, Pattern Recognition and Applications.     LOURAKIS, M. I., AND ARGYROS, A. A. 2004. The design and implementation of a generic sparse bundle adjustment software package based on the levenberg-marquardt algorithm. Tech. Rep. 340, Institute of Computer Science—FORTH, Heraklion, Crete, Greece, Aug. Available from http://www.ics.forth.gr/˜lourakis/sba.    MIKOLAJCZYK, K., AND SCHMID, C. 2005. A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis & Machine Intelligence 27, 10, 1615-1630.    RUBNER, Y., TOMASI, C., AND GUIBAS, L. J. 1998. A metric for distributions with applications to image databases. In Intl Conf. on Computer Vision (ICCV), 59-66.    SCHAFFALITZKY, F., AND ZISSERMAN, A. 2002. Multi-view matching for n-ordered image sets, or “How do I organize my holiday snaps?” In Proceedings of the 7th European Conference on Computer Vision, Copenhagen, Denmark, vol. 1, 414-431.    SUTHERLAND, I. E. 1964. Sketchpad: a man-machine graphical communication system. In DAC '64: Proceedings of the SHARE design automation workshop, ACM Press, New York, N.Y., USA, 6.329-6.346.    SZELISKI, R. 2005. Image alignment and stitching: A tutorial. Tech. Rep. MSR-TR-2004-92, Microsoft Research.    WERNER, T., AND ZISSERMAN, A. 2002. New techniques for automated architecture reconstruction from photographs. In Proceedings of the 7th European Conference on Computer Vision, Copenhagen, Denmark, vol. 2, 541-555.    WWMX. World-Wide Media eXchange. http://www.wwmx.org.    YEH, T., TOLLMAR, K., AND DARRELL, T. 2004. Searching the web with mobile images for location recognition. In CVPR (2), 76-81.