Conventional visual search methods are based on a bag-of-features framework. Bag-of-features frameworks typically work by, first, extracting local region descriptors such as a Scale Invariant Feature Transform (SIFT) for each image from a database. Bag-of-features frameworks, next, typically quantize high-dimensional local descriptors into discrete visual words using a visual vocabulary. The most common method to construct a visual vocabulary is to perform clustering, e.g., K-means clustering, on the descriptors extracted from a set of training images, and to treat each cluster as a visual word described by an interest point at its center. The quantization step assigns a local descriptor to the visual word closest to the local descriptor in the vocabulary in terms of Euclidean (l2) distance using approximate search algorithms like KD-tree and Locality Sensitive Hashing. Bag-of-features frameworks then, typically after quantization, represent each image by a frequency histogram of a bag of visual words. However, visual descriptor quantization causes two serious and unavoidable problems, mismatch and semantic gap.
Mismatch, which is due to one visual word being too coarse to distinguish descriptors extracted from semantically different objects, is termed a polysemy phenomenon. The polysemy phenomenon is particularly prominent when the visual vocabulary is small. By contrast, semantic gap means several different visual words describe visual objects that are semantically the same, which leads to a synonymy phenomenon. The synonymy phenomenon is especially prevalent when adopting a large visual vocabulary, and usually results in poor recall performance in visual search and computer vision applications such as object retrieval, object recognition, object categorization, etc. One of the reasons behind the synonymy phenomenon is a lack of robust local descriptor algorithms. Local descriptors like scale invariant feature transform (SIFT) are sensitive to small disturbances, e.g., changes in viewpoint, scale, blur, and capturing condition. Such disturbances cause SIFT to output unstable and noisy feature values.