The World Wide Web (the “Web”) provides a breadth and depth of information to users. Typically, a user accesses portions of the information by visiting a World Wide Web (“Web”) site. Due to the rapid growth of the Web and the number of Web sites accessible via the Web, it is often difficult for a user looking for information about a particular topic to determine if a Web site exists that contains such information, which Web site to go to, or what the Uniform Resource Locator (URL) is for a web site of interest.
As a result of a desire by users to search for relevant Web sites related to the users' topics of interests, some Web sites provide search engines or other capabilities that allow users to provide one or more search terms or keywords. For example, the Web site provided by iWon, Inc., of Irvington, N.Y., USA, provides a search capability on the home page of its Web site at www.iwon.com. Besides searching for text, users also search for images on Web sites. Once a user enters one or more image search terms or keywords, the search engine provides search results based on the search terms or keywords. Such search results include a set or one or more images from Web sites corresponding to the search terms or keywords. Typically, the search engine provides a set of image thumbnails that the users can use to see larger versions of images, as well as to connect to the web pages on which the images are located.
When searching for an image, a search engine typically displays an image search result containing multiple duplicate or near duplicate images. Duplicates or near-duplicates of images abound on the Web because users often copy and paste popular images, e.g., the Mona Lisa, from one Web site to another. Users may also scan in and place images, such as music album covers, on Web sites. Further, the same image can also be found on one or more Web sites in multiple formats, such as Raster image formats (RIFs), Graphic Interchange Format (GIF), Joint Photographic Experts Group (JPEG), and so on. Because multiple duplicates or near duplicates of any given image exist on the Web, when a user uses a search engine to search for the given image, the duplicates or near duplicates appear in the search result display.
The abundance of duplicate and near duplicate images in a search engine result list is problematic in that it can be frustrating for a user looking for images. For instance, the user may have to click through several pages of redundant image search results displayed by the search engine before finding the image the user was looking for. The search engine also requires tremendous resources, such as processing power and storage, to store and search through the large number of redundant images.
Some techniques to find exact replicas of images in an image search result exist. These techniques typically use a Message Digest 5 (MD5) hashing technique to determine if two images are exact binary equals of each other. These techniques are flawed in that a small change to an image will result in two very similar, albeit not duplicate, images to be presented in a set of image search results. For instance, two images may be near duplicate when there is a difference in size, color, chroma channels, luminance, background, texture, or storage format, or one may be a cropped version of the other, or one may be an edited version of the other, or one may have some text superimposed on it. Two images may be near duplicates when one is derived through one or more transformations of the other.
Another method of determining similarity in images is to compare the images pixel by pixel. However, this method is also very limited in is use. For instance, the method is useless when comparing an image stored multiple times using different storage formats. Using different storing formats not only yields different file formats, but also results in changes in the pixels themselves. Most popular formats perform a destructive compression altering the content of the picture, such that the decompressed picture is different pixel by pixel from the original one. Thus, a pixel comparison would fail in determining images that are similar. For instance, GIF reduces the number of colors in the image to 256, while JPEG alters the content itself and introduces artifacts that although are hardly visible to the eye, yet alter the pixel content of the original uncompressed picture.
Detecting whether or not images are near duplicates, is very difficult, particularly in large collections of documents, such as on the Web. Thus, despite the state-of-the-art in Web sites and image search engines, there remains a need for a method and apparatus for determining similarity in images for a large-scale image search.