Many search engine services, such as Google and Overture, provide for searching for information that is accessible via the Internet. These search engine services allow users to search for display pages, such as web pages, that may be of interest to users. After a user submits a search request (i.e., a query) that includes search terms, the search engine service identifies web pages that may be related to those search terms. To quickly identify related web pages, the search engine services may maintain a mapping of keywords to web pages. This mapping may be generated by “crawling” the web (i.e., the World Wide Web) to identify the keywords of each web page. To crawl the web, a search engine service may use a list of root web pages to identify all web pages that are accessible through those root web pages. The keywords of any particular web page can be identified using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on. The search engine service identifies web pages that may be related to the search request based on how well the keywords of a web page match the words of the query. The search engine service then displays to the user links to the identified web pages in an order that is based on a ranking that may be determined by their relevance to the query, popularity, importance, and/or some other measure.
The Internet is being increasingly used to search for and view images (e.g., photographs). To support this use, commercial search engine services have located and indexed over 1 billion images since 2005. The indexing techniques for images of web pages typically work in a similar manner to the indexing of web pages. Once a search engine service identifies an image, it attempts to identify keywords related to that image from text surrounding the image on the web page that contains the image or from text surrounding links on other web pages that reference the web page that contains the image. The search engine service then creates a mapping from those keywords to the image. A user can then submit a textual query when searching for an image. For example, a user who is interested in locating images relating to a tiger may submit the query “tiger animal.” The search engine service may then search the keyword indexes for the keywords “tiger” and “animal” to locate the related images. The search engine service displays a thumbnail of each related image as the search result. Since many web pages may contain different copies of the same image, a search result may include many duplicate images. To improve the user's experience, a search engine service may want to identify and remove, or at least group together, duplicate images. Unfortunately, current techniques for detecting duplicate images are typically too slow to be performed in real time or are too inaccurate to be particularly useful.
The rapid and accurate identification of duplicate images would be useful in many applications other than a search engine service. For example, the owner of a copyright in an image may want to crawl the web searching for duplicate images in an attempt to identify copyright violations. Indeed, an organization that sells electronic images may have millions of images for sale. Such an organization may periodically want to crawl the web to check for unauthorized copies of its millions of images. The speed and accuracy of duplicate images detection are very important to such an organization.