The present application relates to systems and methods for image retrieval.
Retrieval of visually similar images from large databases is becoming important for many commercial applications. In one exemplary application, query images are captured by phone cameras and compared against a database with millions of original digital copies with a single image for each object. This scenario presents some unique challenges: the digital copies may appear quite different from their physical counterparts, especially because of lighting, reflections, motion and out-of-focus blur, not to mention significant viewpoint variations.
In terms of the methodologies and features, recent large-scale image retrieval algorithms may be categorized into two lines: 1) compact hashing of global features; and 2) efficient indexing of local features by a vocabulary tree. Global features such as GIST features or color histograms delineate the holistic contents of images, which can be compactly indexed by binary codes or hashing functions. Thus, the retrieval is very efficient on both computation and memory usage though it is unable to attend to the details of images. In the other line of work, images are represented by a bag of local invariant features which are quantized into visual words by a huge vocabulary tree. This vocabulary tree based retrieval is very capable of finding near-duplicate images, i.e., images of the same objects or scenes undergoing different capturing conditions, at the cost of memory usage for the inverted indexes of a large number of visual words.
In the large hierarchical vocabulary tree, local features are encoded into a bag-of-words (BoW) histogram with millions of visual words. This histogram is so sparse that inverted index files are well suited to implement the indexing and searching efficiently. Visual words are conventionally weighted by the TF-IDF (term frequency-inverse document frequency), where the IDF reflects their discriminative abilities in database images and the TF indicates their importance in a query image. Only the feature descriptors, without the scale and orientation, are used in this method.
In the vocabulary tree based image retrieval, since images are essentially represented by a bag of orderless visual words, the geometric relations of the local features or their spatial layout are largely ignored. Therefore, a post re-ranking procedure is often employed to re-order the retrieved candidate images by verifying the geometrical consistency against the query image in order to further improve the retrieval precision. Usually, in the geometrical re-ranking, the local feature descriptors of two images are first matched reliably using conventional methods, then a RANSAC procedure can be employed to fit a global affine transform. The candidate images are re-ranked according to the number of inliers in the RANSAC or fitting errors. This conventional re-ranking approach confronts by two issues. First, this procedure is generally computational intensive because it operates on the high dimensional descriptors. The running time could be even longer than the retrieval. Second, the assumption of a global affine transform between two image may not hold, e.g., for images of a 3D object from different view angles.