Information retrieval systems, such as search engines, run queries against an index of documents generated from a document corpus (e.g., the World Wide Web). The document corpus may have groups of documents that, within each group, have similar content. For example, webpages from the same domain may have much text in common and/or use the same HTML code for their formatting. As another example, the document corpus may have documents that are exactly or almost the same with respect to content and may differ only in their timestamps and Uniform Resource Locators (URLs). Eliminating these duplicate or near-duplicates can help conserve storage space.
A typical strategy regarding duplicates or near-duplicates is to eliminate all but one copy of the duplicates or near-duplicates. Alternately, one of the duplicates or near-duplicates is identified as the representative or canonical instance of the document, and only that one copy of the document is indexed. As a result, the other copies or versions of the document are not accessible via the index. While these strategies help conserve storage space, they also have some drawbacks, particularly in the context of a webpage retrieval system. First, if the duplicates all have different URLs, then elimination of the duplicates may hinder retrieval of the stored copy when the requested URL corresponds to an eliminated duplicate. Another drawback is that it makes the retrieval system susceptible to page hijacking Furthermore, this strategy is difficult to apply in practice to near-duplicates because of the difficulty in finding the optimal threshold degree of duplication for a document to be eliminated.