Many search engine services, such as Google and Overture, provide for searching for information that is accessible via the Internet. These search engine services allow users to search for display pages, such as web pages, that may be of interest to users. After a user submits a search request (also referred to as a “query”) that includes search terms, the search engine service identifies web pages that may be related to those search terms. To quickly identify related web pages, the search engine services may maintain a mapping of keywords to web pages. This mapping may be generated by “crawling” the web (i.e., the World Wide Web) to identify the keywords of each web page. To crawl the web, a search engine service may use a list of base web pages to identify all web pages that are accessible through those base web pages. The keywords of any particular web page can be identified using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on. The search engine service may generate a relevance score to indicate how related the information of the web page may be to the search request. The search engine service then displays to the user links to those web pages in an order that is based on their relevance.
Several search engine services also provide for searching for images that are available on the Internet. These image search engines typically generate a mapping of keywords to images by crawling the web in much the same way as described above for mapping keywords to web pages. An image search engine service can identify keywords based on text of the web pages that contain the images. An image search engine may also gather keywords from metadata associated with images of web-based image forums, which are an increasingly popular mechanism for people to publish their photographs and other images. An image forum allows users to upload their photographs and requires the users to provide associated metadata such as title, camera setting, category, and description. The image forums typically allow reviewers to rate each of the uploaded images and thus have ratings on the quality of the images. Regardless of how the mappings are generated, an image search engine service inputs an image query and uses the mapping to find images that are related to the image query. An image search engine service may identify thousands of images that are related to an image query and presents thumbnails of the related images. To help a user view the images, an image search engine service may order the thumbnails based on relevance of the images to the image query. An image search engine service may also limit the number of images that are provided to a few hundred of the most relevant images so as not to overwhelm the viewer.
Unfortunately, the relevance determination may not be particularly accurate because image queries may be ambiguous (e.g., “tiger” may represent the animal or the golfer), the keywords derived from web pages may not be very related to an image of the web page (e.g., a web page can contain many unrelated images), and so on. To help a user view the thousands of images, an image search engine service could cluster a search result based on the content of the images and present the clusters, rather than individual images, to the user. Such clustering techniques include content-based techniques and link-based techniques. The content-based techniques use low-level visual information to identify related images. There are, however, disadvantages to content-based clustering. Content-based clustering is computationally expensive and cannot be practically performed in real time when an image search result contains thousands of images. Moreover, if the clustering is limited to a few hundred of what are thought to be the most relevant images, some very relevant images may be missed because of the difficulties in assessing relevance. The link-based search techniques typically assume that images on the same web page are likely to be related and that images on web pages that are each linked to by the same web page are related. Since this assumption is, however, often not true, unrelated images are often clustered together. It is also difficult for either cluster technique to automatically identify meaningful names for a cluster of images. As a result, a user may not be able to effectively identify relevant clusters.