The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
Near duplicate images are images that are visually identical to the human eye but that do not have identical data representations. Various data processing techniques, such as scaling, down sampling, clipping, rotating and color processing can generate near duplicate images. For example, an original image may be copied and the copy modified by performing color processing on the copy. The modified copy of the image may appear visually identical to the original image, but have a different data representation because of the color processing applied to the copy.
Various issues have arisen relating to near duplicate images on the Internet. In the context of Internet searching, it is not uncommon for the results of a search to include near duplicate images. One reason for this is that most search engines identify matching images based upon keyword matching. That is, keywords contained in a search query are compared to keywords associated with images. An image having an associated keyword that matches a keyword contained in a query is determined to be a match for that query and is included in the search results. When an image is copied and the copy modified to create a near duplicate image, the near duplicate image may have the same associated keywords as the original image. In this situation, both the original image and the modified near duplicate image are included in the search results. From the prospective of both the search engine host and end users, it is desirable to not include near duplicate images in search results.
Although approaches exist and have been employed to detect duplicate images, using these approaches to detect near duplicate images has proven to be ineffective. One such approach involves comparing pixel information at fixed locations. While this approach may be useful in detecting exact duplicate images, it has significant limitations when used to detect near duplicate images. For example, the approach is effective when a copy of an image is cropped and the pixels being compared are not in the portion that has been cropped. In this situation, the comparison of pixel information would correctly identify the images as near duplicate images. On the other hand, this approach is not useful when the changes include slight changes in color, scaling or rotation. When a copy of an image is modified in this manner, a comparison of pixel information would indicate that the original image and the modified copy are not near duplicate images. In a search engine application, this would result in both the original and modified copy being included in search results as different images, even though the modified copy is a near duplicate of the original image because it appears visually identical to the original image.
Based upon the foregoing, an approach for detecting near duplicate images that does not suffer from limitations of prior approaches is highly desirable.