The present disclosure relates to data processing, and in particular, image classification.
Users can locate images that are available on the Internet by submitting a search query to a search engine. The search query can be a text query that includes words describing image subject matter for which the user is attempting to locate. The search system identifies images that correspond to the subject matter and provides image search results that include references to the identified images. The images can be identified, for example, based on labels that are associated with the images and/or text appearing near the images on the web pages with which the images are presented.
The identified images can be, for example, images that are presented with web pages. Many different categories of webpages can include images that are identified in response to a search query. For example, images are provided with webpages such as weblogs (“blogs”), social networking pages and newsgroups that can be published by many different individuals. Within a single domain (e.g., www.example.com) there can be thousands of webpages, many of which have different individual authors.
Due to the large number of different authors creating webpages located in the same domain, it can be difficult to classify images provided through the domain as belonging to a common topic. For example, within a single blog domain, users may publish blogs directed to topics ranging from sports, to politics, to parenting advice, or even explicit (e.g., pornographic) topics. Thus, if each image available through a common domain is classified as belonging to a common topic, the images may not be accurately classified.
Labels associated with the image and/or text appearing near the image can inaccurately describe the subject matter of the image or be ambiguous among different topics. For example, an image of Babe Ruth appearing in a blog may be associated with the text “The Babe.” While this text is relevant to the image, it is possible that the image could be of Babe Ruth, an actor playing Babe Ruth, a pig featured in a move titled “Babe,” or even explicit images.
Providing images that are less relevant to the topic of a user query can reduce the quality of image search results. This is particularly true when images including explicit content (e.g., pornography) are referenced in search results responsive to a query that is not directed to the explicit content. For example, search results responsive to a search query for “Babe Movie” including an explicit image can substantially reduce the quality of the search results for a user that is searching for images of Babe Ruth.
The quality of image search results can be enhanced when images are accurately classified, such that images that are not relevant to the user query can be filtered or otherwise suppressed.