1. Field of the Invention
The present invention generally relates to information management systems and methods, and more particularly to a method and system for detecting of duplicates and near-duplicates in electronic documents and/or content.
2. Discussion of the Background
Email collections, electronically stored documents, and the like, can include duplicate and near-duplicate messages and documents. These collections can be found on the Internet, in corporate Intranets, in other networks, stand-alone systems and also on off-line stored information carriers, such as CD-ROM, DVD, Write Once Read Many (WORM), Backup Tape, etc. For example, duplicates and near-duplicates can form 50% or more of the size of a collection.
These duplicates and near-duplicates are created for many reasons, such as the creation of slightly different versions of a document, different formats of a document (e.g., such as creation of a PDF from a Word file, text, HTML or RTF version of a document or email, EML version of a MSG, etc.), forwarded or copied and blind-copied emails, backups (e.g., tape, CD-ROM, DVD, Internet backups, Application Service Provider (ASP) backups, hosted archives, software service provider backups, etc.), copies to different devices (e.g., other computers, hand-held and other mobile devices, PDA's, etc.). Although such documents may have a format that changes in its binary form, different file properties (e.g., file name, file creation date, file access date, file modification date, file size, file access properties, etc.), and different document properties (e.g., title, author, date, routing, receiving time, category, custom properties, etc., which are sometimes over 100 for certain electronic objects), the actual textual content of such objects is often the same or slightly the same.
Accordingly, such duplicates and near-duplicates create huge problems in applications where large volumes of electronic data have to be searched and reviewed by humans, such as during electronic discovery (e-discovery), law enforcement activities, fraud investigations, security activities, intelligence activities, due diligences, mergers and acquisitions, business intelligence activities, historical research, contract management, project management, human research management, and the like. For example, when there are a large number of duplicate or near-duplicate documents, it takes longer to find the latest version of a given document. In addition, there is a significant risk that an old version will be found and used. Further, for translating documents, the cost of translating duplicate and near-duplicate documents can be very expensive and time consuming. For example, if 50% of the documents are exact or near-duplicates, then the human review of such documents (e.g., often done by specialized lawyers, scare investigators or intelligence analysts, etc.) may not only cost twice as much, but it will also cause undesirable delays, the missing of deadlines, and the like, which can often times break a deal. Therefore, exact and near-duplicates must be removed or at least be detected and optionally moved to the background for increasing document processing efficiency.
Further, deleting and removal of exact-duplicates and near-duplicates reduces storage requirements, and resources needed to build indexes, run text-analytics (e.g., concept extraction, text-mining, optical character recognition, machine translation, speech recognition, document property extraction, file property extraction, language recognition, etc.) and process such documents.
Detecting exact-duplicates can be done reliably by using so-called hashing techniques. Such techniques can employ a combination of the document textual content, and/or properties or binary content that is hashed with a MD-5, SHA-1 or other suitable hashing algorithms. If two documents are exactly the same or if they have exactly the same document properties, then the resulting hash values also must be exactly the same. For example, one character or even one bit difference in a given document will trigger a very different hash code. In addition, an almost similar hash value does not guarantee that two documents are similar. In fact, this often means that the documents actually are completely different. Therefore, hashing cannot be used reliably to identify near-duplicates.
There are a number of algorithms and methods that do allow near-duplicate detection. Typically, such techniques are based on a comparison of a document or a sample of a document with all other documents. Such algorithms can be based on clustering techniques and typically are non-linear in both time and space, which means that if there are N documents, (N×N) calculations and an (N×N) memory will be required. For example, assume that 100 documents will take 10,000 calculation cycles to de-duplicate. For the next 100 documents, one needs 30,000 extra calculations, and the next 100 documents will require 50,000 more cycles, etc. For example, if there are 100,000 documents, the de-duplication of the last 100 documents can take 89,890,010,000 cycles. Since email and hard disk collections can include many millions of documents, such de-duplication processing is computationally unacceptable.
Accordingly, there is a need for a near deduplication algorithm that is linear in time and space. This means that given N documents, N calculations and an N sized memory can be employed. In this case, a collection of 100,000 documents only will require 100,000 calculations in total to de-duplicate. In addition, many conventional algorithms only support English and do not support other languages, require significant training, are not accurate enough, and do not allow for easily understandable user control of the outcome, for example, via the setting of precision and recall values, and measures of similarity.
In the view of the foregoing, there is a need for a system and method that allows the detecting of duplicate and near-duplicate emails (e.g., properties, email body and attachments) and electronic documents or other electronic content (e.g., referred to as objects), the tagging of such potentially duplicate and near-duplicate objects, and the automatic removal or visualization of the duplicate and near-duplicate objects when a object is presented to a end user through a computer system, and the like.