1. Field of the Invention
Systems and methods consistent with the principles of the invention relate generally to information searching and, more particularly, to using image duplicates to assign labels to images for use in image searching.
2. Description of Related Art
Existing information searching systems use search queries to search data to retrieve specific information that corresponds to the received search queries. Such information searching systems may search information stored locally, or in distributed locations. The World Wide Web (“web”) is one example of information stored in distributed locations. The web contains a vast amount of information, but locating a desired portion of that information can be challenging. This problem is compounded because the amount of information on the web and the number of new users inexperienced at web searching are growing rapidly.
Search engines attempt to return hyperlinks to web documents in which a user is interested. Generally, search engines base their determination of the user's interest on search terms (called a search query) entered by the user. The goal of the search engine is to provide links to high quality, relevant results to the user based on the search query. Typically, the search engine accomplishes this by matching the terms in the search query to a corpus of pre-stored web documents. Web documents that contain the user's search terms are “hits” and are returned to the user.
Documents that include digital images may be searched using existing search engine technology. Existing search engines employ keyword searching to select which images to return as search results based on labels associated with the images. For example, if a user queries on “cars,” the search engine searches a corpus of image documents for images that have the label “cars” associated with them. This label may have been automatically assigned to the image by looking at the surrounding text from the document on which the image is located. For example, the following surrounding text may be used to assign labels to an image: 1) the filename of the image; 2) the anchor text associated with the image; 3) the caption associated with the image; and 4) document title.
Unfortunately, labels assigned using such surrounding text may be incomplete since only a small portion of the surrounding text may be relevant to the image, and since different documents may focus on different parts of the image when describing it. For example, multiple news articles might each contain a copy of a picture of a company's headquarters, but with different captions, like “headquarters,” “corporate campus,” “IPO,” “stock dividends,” or “earnings report.” All of these words are associated in some way with the image of the corporate headquarters, but each is associated with only one copy of the image. This can lead to less than ideal image searching, because ideally it would be desirable to associate all of the words with each of the duplicate images.