Some enterprises and/or companies desire (or can be required) to manage all digital information including emails such that it is not prematurely destroyed or lost and is easily discoverable (for example, to provide the correct information to a court or outside companies in response to a lawsuit). By way of example, electronic discovery (e-discovery) techniques in the legal community can concern the detailed analysis of e-mails gathered in response to a lawsuit. Existing approaches, however, include primarily manual processes, which are labor and cost intensive.
Existing approaches can also include a brute force approach of comparing all sub-strings of a pair of documents. However, such an approach is computationally prohibitive. Further, a primary focus of existing approaches has been on finding near duplicates, but not on dynamic detection of near duplicates or on an online version of the challenge.