Large image collections may be used for many tasks such as computer vision, geometry reconstruction, data mining, and so on. Despite the effectiveness of some algorithms that may perform such tasks, improved results can be obtained by relying on the density of image data in such collections. The web on the whole has billions of images and is an ideal resource for such endeavors. Recent algorithms have performed similarity based retrieval on large image collections. While this is useful, further benefits can be derived by discovering the underlying structure of an image collection through computationally demanding tasks such as unsupervised appearance-based image clustering. Applications of clustering include improved image search relevance, facilitation of content filtering, and generating useful web image statistics. Also, by caching image clusters, significant runtime savings can be obtained for various applications. However, large scale clustering is challenging both in terms of accuracy and computational cost.
Traditional algorithms for data clustering do not scale well for large image collections. In particular, iterative algorithms (e.g., k-means, hierarchical clustering) and probabilistic models exhibit poor scaling. Further, some traditional algorithms may need to determine the number of clusters, which is difficult in large collections. The scale of an image dataset may also lead to a preference for certain platforms. In particular, datacenter platforms and programming frameworks like Map-Reduce and Dryad Linq may be desirable, yet iterative algorithms may not adapt well to such platforms.
Scalable techniques for image clustering are discussed below.