In various applications, a need exists to identify documents that are duplicates or near-duplicates of one or more other documents in the same document set. In the document review process commonly associated with litigation, for example, identifying and removing duplicate documents can save many hours of attorney time. Techniques that identify and remove exact duplicates within a document database are well known. More recently, techniques for identifying “near-duplicate” documents have been developed. “Near-duplicate” documents are documents that are not (or may not be) exact copies, but have, at most, relatively small differences. Grouping documents that are near-duplicates can help a human reviewer (e.g., an attorney performing document review) to more quickly digest the mass of information that may be present in a large collection of documents. For instance, a large number of documents (e.g., documents held in a company's Sharepoint repository) may more efficiently be reviewed in one sitting, with the reviewer skimming over or ignoring portions that he or she had just reviewed in connection with another document in the same document group.
Near-duplicate grouping may be performed based on the conceptual relatedness of two documents (“conceptual near-duplicate” grouping, e.g., using clustering algorithms), or based on the similarity of the words and word order within two documents (“textual near-duplicate” grouping). In at least some scenarios, textual near-duplicate grouping may be faster than conceptual near-duplicate grouping, and/or may provide more predictable results. Moreover, textual near-duplicate grouping generally produces groups of documents that textually look very much alike, making it simpler to see their relative similarity. To determine the level of similarity between two documents with 100% confidence, a word-by-word comparison generally must be performed. Unfortunately, even if many documents can initially be ruled out based on the similarity threshold and various document characteristics (e.g., document word counts), the word-by-word comparisons for those documents that cannot be ruled out may require unacceptably large amounts of time and/or processing resources. Moreover, while various heuristic techniques, such as determining how many words two documents have in common, have been used to more quickly identify textual near-duplicates, such techniques can be subject to a relatively high rate of false positives. To provide a solution to these problems, innovative processing techniques capable of more accurately and/or quickly identifying textual near-duplicates are needed.