Large collections of documents typically include many documents that are identical or nearly identical to one another. Determining whether two digitally-encoded documents are bit-for-bit identical is straightforward, using hashing techniques for example. Quickly identifying documents that are roughly or effectively identical, however, is a more challenging and, in many contexts, a more useful task.
For example, the World Wide Web is an extremely large set of documents, and has grown exponentially since its birth. Indexed Web corpora currently include approximately five billion to 120 billion web pages, a significant portion (roughly a third, in most surveys) of which are duplicates and near-duplicates. Applications such as web crawlers and search engines benefit from the capacity to efficiently detect, and often suppress, many near-duplicates.
One method for determining duplicate or near-duplicate documents is by using sketches. A sketch is an approximation of a document that may be made up of samples of the document. The Jaccard, or the weighted Jaccard, similarity of two documents may be estimated by comparing the sketches of the documents position by position for equivalence. If the elements of the sketch are unbiased, or only slightly biased similarity estimators, each one matches the corresponding sketch element with probability related to the Jaccard value. While comparing documents using sketches is fast, current methods for generating accurate and reliable sketches are computationally expensive.