In litigation proceedings, as well as for other functions, often massive amounts of documents must be reviewed. Certain organizational methods for arranging documents exist in the art. Emails are a particular type of document that are useful to review in structures, to help make sense of the proceedings and reduce the number of documents that need to be read.
The need to detect near duplicate documents arises in many applications. Typically this may occur in litigation proceedings. In litigation, often one of the parties initiates discovery proceedings which force the rival party to reveal all the documents at his disposal that pertain to the legal dispute. In order to meet the provisions of the discovery procedure, the disclosing party hands piles of documents, sometimes in order to duly meet the full disclosure stipulations, or in certain other cases, as a tactical measure to flood the other party with huge amounts of documents, thereby incurring the receiving party considerable legal expenses in the tedious task of determining which documents are relevant to the dispute under consideration. In many cases, out of the repertoire of disclosed documents, many are similar to each other. A preliminary knowledge which will group and/or flag documents that are similar to one another would streamline the screening process, since for example, if a certain document is classified as irrelevant, then probably all the documents that are similar thereto, are also deemed irrelevant. There are numerous other applications for determining near duplicate documents, sometimes from among a very large archive of documents (possibly of the order of millions of documents or more).
A common type of document that is examined in litigation procedures is emails. If collected from user accounts of various users in a company, there is likely to be a degree of duplicity between users. Duplicity may occur because the same email is sent to a number of recipients at once, or for other reasons. Also, many times, emails are near duplicates of one another.