In many applications, such as in the context of web-based search engine image queries, brute-force methods of comparing entire query images against entire stored images to find appropriate matches is prohibitively inefficient and computationally over-expensive. For this reason, current frameworks for two-dimensional image searching instead process the query image and stored database images to find matches using various feature detection schemes that quantize such images as “bags of visual features.”
In content-based image retrieval, an image may be represented as a bag of visual features—that is, an image file can be viewed as a “bag” (an unsorted container) filled with visual features such as edges, corners, blobs, and so forth. These individual features found in an image may correspond to individual features in an index of known visual features, akin to how individual words extracted from a novel may correspond to individual words found in a dictionary. Of course, an individual feature from an image may not have a perfect visual match in the database—just like a handwritten word may not look like its equivalent typeset word in a dictionary—but analogous to how the human eye and mind can match a handwritten word to a typeset word, visual features can be closely matched to corresponding entries in a visual feature database using a combination of techniques for feature detection, feature description, and feature book generation (described below). Once the visual features of an image have been identified and counted—again, like the words in a novel being individually identified and tallied—the image as a whole can be quantified based on, for example, a histogram representation of its independent visual features that is then compared to other images (with their own representations of their independent visual features) to identify those images with sufficiently similar histogram representations. In this way, a bag of visual features corresponding to an image may serve as the basic element for processing that image in a content-based retrieval context.
However, with regard to image retrieval and recognition, many visual features detected in an image are often not reliable or are irrelevant to the objects of interest found in an image, and the inclusion of such non-relevant visual features can reduce retrieval performance. This is particularly relevant when there exists an object of interest in the image that is of paramount importance to image matching, which is often the case in the context of web-based image searching. For example, given an image of a prominent foreground object such as a building, the features of a tree in the background of the image may in fact degrade the performance of image recognition and retrieval. Similarly, some visual features from highly textured regions may not be repeatable—that is, such features may change with small disturbances due to camera viewpoint or image ‘noise’—and thus these visual features would hinder, not help, in the image recognition and retrieval process. While some common approaches may utilize simple weighting schemes—such as those based on visual-word-counts such as term-frequency inverse-document-frequency (TFIDF)—these approaches do not effectively reduce the impact of irrelevant or unreliable image features.