Large numbers of data objects (e.g., text documents, image files, etc.) can be stored in databases or other storage structures. A task that can be performed with respect to such stored data objects is similarity-based retrieval, in which one or more data objects similar to other data object(s) are identified. Examples of applications that utilize similarity-based retrieval include similarity-based image retrieval (where one or more images that are similar to a query image are retrieved), copyright violation detection (to detect whether use or copying of a particular data object is authorized), or other applications.
Traditionally, a data object (e.g., text document, image file, and so forth) can be represented by a set of one or more feature vectors, where a feature vector can include aspects of the data object that is being represented. For example, a feature vector for a text document can be a bag (collection) of certain words that appear in the text document. An image file can be associated with a relatively large number of feature vectors (e.g., hundreds of feature vectors), where each feature vector of the image file represents some different aspect of the image file.
Two data objects are deemed to be similar if the feature vector(s) of a first data object is (are) similar to the feature vector(s) of a second data object, according to some measure (e.g., cosine similarity measure). However, comparing a relatively large number of vectors for the purpose of performing similarity-based retrieval can be computationally very expensive and inefficient.