This application is directed to document content detection, and more particularly to finding duplicate and near duplicate document content in a large collection of documents. In addition, this application provides a method for automatically highlighting the difference and/or similarities in a visually compelling and easy to see by a human observer.
One area where the present concepts can be applied is in the review of a large number of documents, such as in the context of litigation discovery. A typical litigation case may involve millions of documents containing multiple copies and versions of revised document content. The litigation documents may contain electronic and scanned hardcopy versions obtained from multiple sources and computers. Scanned hardcopy documents may additionally contain handwritten comments and annotations that may be relevant to the particular litigation case. For example, a person's initials or a handwritten margin note may serve as an indication that he or she has read the document and thus was aware of its content at the time. In a typical litigation case, a limited period of time is allocated to the legal discovery team to sift through the millions of documents and find the key documents containing relevant information to the case at hand.
Part of the problem in such a review is that the documents are not typically organized in a manner that facilitates the relevant information search. Duplicate and near-duplicate documents of similar content may be interspaced with many other unrelated documents. Since the sought after information may contain handwritten text and/or annotations for which OCR (Optical Character Recognition) is unreliable, the large quantity of documents must be manually inspected by a legal discovery team, which is a costly, time consuming and prone to error process.