1. Technical Field
The present disclosure generally relates to the field of electronic document indexing, and more particularly, to the parallelization of document indexing in the area of electronic discovery.
2. Description of the Related Art
Document indexing is one of the earliest phases in the electronic discovery lifecycle. It aims to identify and extract all office documents, emails, archives and other unstructured documents from the collected electronic evidence pertinent to a legal case. For each item extracted, it is necessary for searching purposes to extract all of the text contained in the item, and its metadata. This text is stored into a specialized text databases, which facilitates fast keyword searching over very large data sets. Keyword searching, in combination with other metadata specific searches, form the basis of filtering a very large data set into a more relevant subset that is then packaged for manual review or further analysis.
Document indexing of electronic data is traditionally performed on a single machine. Because electronic data is highly unstructured and hierarchal, a document indexing case could consist of a directory containing millions of office documents, a single exchange database file containing millions of email messages, or disk images of machines under investigation. For example, a zip file can contain office documents, an email message can contain attachments, an Outlook PST file can contain email messages and a disk image can contain files of any type. These drawbacks presents a problem since electronic discovery cases are growing rapidly in size, and there is a fundamental limit as to how fast a single machine can index data.