The need to detect near duplicate documents arises in many applications. Typical, yet not exclusive, an example being in litigation proceedings. In the latter one or both of the rival parties, initiates discovery proceedings which forces the rival party to reveal all the documents in his disposal that pertain to the legal dispute.
In order to meet the provisions of the discovery procedure, the disclosing party hands piles of documents, sometimes in order to duly meet the fall disclosure stipulations, or in certain other cases, as a tactical measure to flood the other party with numerous amounts of documents, thereby incurring the receiving party considerable legal expenses in the tedious task of determining which documents are relevant to the dispute under consideration. In many cases, out of the repertoire of disclosed documents, many are similar to each other. A preliminary knowledge which will group and/or flag documents that are similar one to the other, would streamline the screening process, since for example, if a certain document is classified as irrelevant, then probably all the documents that are similar thereto, are also deemed irrelevant. There are numerous other applications for determining near duplicate documents, sometimes from among a very large archive of documents (possibly at the order of e.g. millions of documents or more).