As mobile devices grow in popularity, patch-based image retrieval allows a user to photograph current surroundings via a camera-embedded mobile telephone or other device, transmit the photograph to a server as a query, and receive a corresponding GPS location and/or other location information. Additional location-related information may include shopping information, restaurant reviews and so forth, and may be returned to the user as part of the query results.
To determine the location corresponding to a photograph, images are offline-indexed for use by the server, using patch-based scene recognition model. However, to ensure sufficient coverage of a large area such as a city, enormous amounts of data need to be used. This means that the scene recognition model has to be effectively constructed and maintained in large-scale scenario.
In this technology, textual descriptors of scenes are quantized by hierarchical k-means clustering to generate a vocabulary tree, which produces “visual words” (quantized clusters with SIFT features) to represent each image as a Bag-of-Word (BoW) vector. In retrieval, the similarity of images is evaluated by the cosine distance between their BoW vectors. While this system works to a reasonable extent, the scene dataset requires a substantial amount of updating and extending, which is computationally expensive given the enormous amounts of data being maintained and accessed.