The increasing popularity of image applications enable users to leverage use of cameras in mobile devices when performing some tasks. An image application may enable users to capture images on cameras of their mobile devices, which the image application can then submit as queries to perform a search. Typically, the image application evaluates candidate images that are similar to the query submitted to retrieve images that are relevant to the query.
An image may be represented by a vector such as that employed in a bag-of-features model (BoF model) in which image features may be treated analogously as words in a document. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a sparse vector of occurrence counts of a vocabulary of local image features.
To achieve this, the features of an image are detected. After feature detection, an image is abstracted by several local patches. Feature representation methods deal with how to represent the patches as numerical vectors. These vectors are called feature descriptors. One of the most famous descriptors is Scale-invariant feature transform (SIFT). SIFT converts each patch to 128-dimensional vector. After this step, the image is a collection of vectors of the same dimension (128 for SIFT), where the order of different vectors is of no importance. The final step for the BoF model is to convert vector represented patches to “codewords” (analogy to words in text documents), which also produces a “codebook” (analogous to a word dictionary). A codeword can be considered as a representative of several similar patches. Codewords are then defined as the centers of the learned clusters. The number of the clusters is the codebook size (analogous to the size of the word dictionary).
K-means has been widely used in computer vision and machine learning for clustering and vector quantization. In image retrieval and recognition, it is often used to learn the codebook for the popular bag-of-features model.
The standard k-Means algorithm, Lloyd's algorithm, is an iterative refinement approach that greedily minimizes the sum of squared distances between each point and its assigned cluster center. It consists of two iterative steps, the assignment and update step. The assignment step aims to find the nearest cluster for each point by checking the distance between the point and each cluster center; the update step updates the cluster centers based on current assignment. When clustering n points into k clusters, the assignment step costs O(nk). For applications with large nk, the assignment step in exact k-Means becomes prohibitively expensive.
In large-scale image retrieval, it is advantageous to learn a large codebook containing one million or more entries, which requires clustering tens or even hundreds of millions of high-dimensional feature descriptors into one million or more clusters. Another emerging application of large-scale clustering is to organize a large corpus of web images for various purposes such as web image browsing/exploring. Thus, efficient clustering of large data sets is desired.