The need to detect near duplicate documents arises in many applications. Typical, yet not exclusive, an example being in litigation proceedings, where one or both of the rival parties initiates discovery proceedings which force the rival party to reveal all the documents in his disposal that pertain to the legal dispute.
In order to meet the requirements of the discovery procedure, the disclosing party submits piles of documents, sometimes in order to duly meet the full disclosure stipulations, or in certain other cases, as a tactical measure to flood the other party with numerous amounts of documents, thereby incurring the receiving party considerable legal expenses in the tedious task of determining which documents are relevant to the dispute under consideration. In many cases, out of the repertoire of disclosed documents, many are similar to each other. A preliminary knowledge which will group and/or flag documents that are similar one to the other, would streamline the screening process, since for example, if a certain document is classified as irrelevant, then probably all the documents that are similar thereto, are also deemed irrelevant. There are numerous other applications for determining near duplicate documents, sometimes from among a very large archive of documents (possibly at the order of e.g. millions of documents or more).
As is well known, there exist documents which are contaminated with “noise”, i.e. errors introduced to original documents. Examples of “noise” are error(s) introduced by translation tools such as Optical Character Recognition (OCR) errors, errors introduced due to network transmission problems, errors introduced due to damaged storage media, etc.
Turning for example to translation tools, such as OCR software, they are error prone with error percentage depending on quality of input, language, quality of OCR software, and so forth.
Noisy documents are, for instance, text files that were generated by digitizing or scanning paper (or similar media) and then using translation tools (such as Optical Character Recognition OCR) to translate the digitize image to text files (referred to also as translated documents).
In the context of the invention OCR includes a computer software (and or in other form) designed to translate images of handwritten or typewritten text (usually captured by a scanner) into machine-editable text, or to translate pictures of characters into a standard encoding scheme representing them (e.g. ASCII or Unicode).
Focusing, for example, on translated documents, finding near duplicates among them is more difficult as the OCR errors add “noise” and there is a need to discern between changes that were introduced by noise (say OCR errors), which obviously should be ignored and true differences between the documents which should be taken into account in order to determine to what extent the documents are identical.
In accordance with another example, due to translated errors, true duplicate documents (i.e. two identical source documents) are, in many cases, not likely to remain identical after applying thereto digitizing and OCR.
There is thus a need in the art to provide a tool for identifying near duplicate documents among noisy documents.
There is a further need in the art to provide a tool for identifying true duplicate documents among noisy documents.