This specification relates to data processing and information retrieval.
The Internet enables access to a wide variety of resources. Images, video files, audio files, web pages for particular subjects, book articles, and news articles are examples of resources that are accessible over the Internet. A search system can identify resources in response to a user query that includes one or more search terms or phrases. Search systems generally identify and score resources based, at least in part, on their relevance to the query, and the search results can be ordered for presentation according to these scores.
The relevance of a resource to a user query can be determined, in part, based on the textual content of the resource or textual content associated with the resource. For example, text included in the content of a resource can be compared to the query to determine a relevance score indicative of the relevance of the resource to the query. In turn, the resources can be ordered, at least in part, based on the relevance scores.
Images are example resources that can be identified as relevant to a query based on textual content associated with the image. Text appearing with an image on a web page can be used to classify the image and/or compute a relevance score that is indicative of the relevance of the image to a search query. For example, an image that appears on a web page with the text “football” may be identified as relevant to the queries “football,” and/or “sports.” In turn, the image can be referenced in search results for these queries. While images can be identified as relevant to a query based on text that is associated with the images, images that are presented to a user in response to a particular query may be near-duplicate images. For example, a same image or a slight variation of an image may appear on many different web pages, such that each instance of the image (or slight variation) may be identified as a separate image that is responsive to the search query.