Electronic search for digital objects on a computer, over a network, or over the Internet presently relies on textual search techniques. However, textual search techniques have minimal applicability for digital objects that are not comprised of text, such as still images, videos, audio files and multimedia objects in general.
Efforts to search on non-textual digital objects have included search on attributes intrinsic to an image, such as content-based image recognition. Such approaches may make use of global attributes, such as color histograms, or local attributes, such as object recognition. Search for non-textual digital objects on intrinsic attributes have meet with mixed success and are often supplemented by other search techniques. Because intrinsic attributes for an image are unchanging, search accuracy improvement may require associating additional data to the non-textual digital object.
Associating textual data with digital objects allows text search techniques to be leveraged on otherwise non-textual digital objects. Textual data may be associated by various techniques such as overloading file names, adding meta-tags, and associating links to data stores containing meta-tags. Searching for digital objects by searching their metadata, alone or in combination with other search techniques, have yielded improved results.
One difficulty with associating text metadata with digital objects is that near-duplicates of objects create either redundant records, or disperse tags. A near-duplicate is a digital object that stores similar data, but has slight differences in attributes not of interest to searching users. For example, if there are two photos of the Eiffel tower in a data store, one taken at 12:00 Noon, the other at 11:00 AM but under similar lighting conditions and from similar angles, in effect, the two photos are duplicates of each other. The photos are not exact duplicates, because of the small variances of lighting and angle, but the data clearly represents the Eiffel tower and shows similar features of the Eiffel tower.
As near-duplicates, the two photos of the Eiffel tower may be considered redundant. At best both photos will have tags with the name of the Eiffel Tower. From that perspective, it might be better to keep the best photo and eliminate the near-duplicate in order to eliminate redundancy. However, over time, some users will add tags to the first photo and others will add tags to the second photo. Thus the first photo may be tagged with “Paris, France” and the second photo may be tagged with, “1889 World's Fair”. Here, because of the existence of near-duplicates in the data store, the tags for a photo of the Eiffel tower have been dispersed. A query for the 1889 World's Fair will obtain the second Eiffel tower photo but not the first, and a query for Paris, France, will obtain the first Eiffel tower photo, but not the second.
For these and other related reasons, near-duplicates are not only presently disfavored, but are also often removed from digital object data stores. However, it may be impractical to remove near-duplicates from a data store. The photos may be dispersed over several stores or over the Internet, where a user would not have privileges to delete digital objects.