Large collections of documents typically include many documents that are identical or nearly identical to one another. Determining whether two digitally-encoded documents are bit-for-bit identical is easy (using hashing techniques, for example). Quickly identifying documents that are roughly or effectively identical, however, is a more challenging and, in many contexts, a more useful task.
The World Wide Web is an extremely large set of documents. The Web has grown exponentially since its birth, and Web indexes currently include approximately five billion web pages (the static Web being estimated at twenty billion pages), a significant portion of which are duplicates and near-duplicates. Applications such as web crawlers and search engines benefit from the capacity to detect near-duplicates. For example, it may be desirable to have such applications ignore most duplicates and near-duplicates, or to filter the results of a query so that similar documents are grouped together.