1. Technical Field
The invention relates to electronic data discovery. More particularly, the invention relates to the evaluation of the processing, i.e. full text indexing and archive extraction, status of digital content collected for electronic data discovery purposes. Still more particularly, the invention relates to a method and apparatus for providing collection transparency information to an end user to achieve a guaranteed quality document search and production in electronic data.
2. Description of the Prior Art
Electronic discovery, also referred to as e-discovery or EDiscovery, concerns discovery in civil litigation, as well as Tax, Government Investigation, and Criminal Proceedings which deals with information in electronic form. In this context, electronic form is the representation of information as binary numbers. Electronic information is different from paper information because of its intangible form, volume, transience, and persistence. Such information is typically stored in a content repository. Also, electronic information is usually accompanied by metadata, which is rarely present in paper information. Electronic discovery poses new challenges and opportunities for attorneys, their clients, technical advisors, and the courts, as electronic information is collected, reviewed, and produced. Electronic discovery is the subject of amendments to the Federal Rules of Civil Procedure which are effective Dec. 1, 2006. In particular Rules 16 and 26 are of interest to electronic discovery.
Examples of the types of data included in e-discovery include e-mail, instant messaging chats, Microsoft Office files, accounting databases, CAD/CAM files, Web sites, and any other electronically-stored information which could be relevant evidence in a law suit. Also included in e-discovery is raw data which forensic investigators can review for hidden evidence. The original file format is known as the native format. Litigators may review material from e-discovery in one of several formats: printed paper, native file, or as TIFF images.
Content Repository Uncertainty with File Indexing Status
A typical content repository, i.e. content storage, has certain problems that impair search results and that may cause problems in EDiscovery
Uncertainty with File Indexing Status
Usually, indexing status of a content repository is estimated in the following ways:                Optimistic—The system ignores the fact that some files may not be available in the search results. Even high-end content management systems, such as Documentum (see, for example, http://software.emc.com/), use this approach.        Pessimistic—The content is considered non-indexed and the system does not allow user to search it until a certain long period of time passes after content insertion or update to make sure there is enough time for indexing engine to index the content.        
Some systems try to go beyond these two approaches by warning the user what files are still in the indexing queue.
The optimistic approach is entirely unsafe when it comes to importing or indexing very large files. For example, in Oracle 9i it takes up to several minutes to index a very large document, and it takes several seconds to put large files into indexing queue. This makes the optimistic approach undesirable for EDiscovery. Failure to index files causes incorrect search results for both approaches.
None of the applications on the market implement a comprehensive processing status information solution that combines index-ability, indexing status, and container extraction, e.g. opening of such files as zip files, status information.
An EDiscovery Management Application (EMA) is a content management system responsible for managing collections and holds, which communicates collection and hold requests to data sources, and which collects content from data sources (see related U.S. patent application Ser. No. 11/963,383, filed on Dec. 21, 2007, the entirety of which is incorporated herein by this reference thereto). Some files collected into an EMA content repository during the EDiscovery process must undergo full text indexing to allow their contents to become searchable by the end user. However, the following limitations with this approach to indexing should be noted:
AP: I changed the last sentence because otherwise it sounds like we are criticizing some approach, which we are going to reject. Whereas, these are natural limitation of every indexing process.
                It takes time to perform indexing. During this time files that have not been indexed yet cannot be found through a full-text search.        Indexing may fail for some files. As a result, the user may not be able to find these files through full text search.        Some files may not be indexable because they do not contain text information or because the indexing engine is unable to index these files.Uncertainty with Container Extraction Status        
Extracting files from container files, such as ZIP, CAB, WAR, RAR, EAR file archives, PST, NSF email archives, email message MSG files, and others, collected into an EMA content repository during the collection process creates even more uncertainty when it comes to understanding the processing status of files in the content repository. For example, the following limitations should be noted:                First, it takes time for an EMA to explode the container, i.e. extract files from a container into the content repository. Until the container is exploded, the files cannot be found through any type of search because the content repository does not know about their existence.        Secondly, the extraction may fail for a multitude of reasons, such as an inability to extract files from a password protected or corrupt archive. A user should be able to distinguish between a container that does not contain files and that, therefore does not present any interest from an eDiscovery perspective, and a container that failed to explode but that may contain files that are of interest for eDiscovery.        Finally, files inside a container might become indexable only after they have been extracted from a container. This generates additional delay in file indexing and may result in a user being unable to perform a full text search against files uploaded inside the container.        
In EDiscovery, failing to find and produce files may result in substantial litigation risks and penalties. This is why it is very important to understand the indexing and extraction status of content collection in EDiscovery precisely. For example, the failure of a defendant to locate an email message that was saved by the plaintiff may be treated by the court as negligent misconduct or an attempt to hide evidence, and may result in heavy penalties.
Users Need to Access Processing Status Information in a User-Friendly Form
Both file indexing and container extraction status information should be available to a user performing the file search to allow the user to understand the processing status of the collected content and make decisions on completeness of file search results. Also, because the overall size of the collection may be huge, processing status information must be tailored for the subset of data the user tries to query when a search is performed. Finally, the user should know which files may contain the information specified in the query criteria, although the collection repository cannot search these files; and there should be a way for a user to browse and view files that failed to index, not indexable, or have not been indexed yet and containers that failed to explode or have not been exploded yet manually by viewing the files.
In this context, it would be advantageous to provide collection transparency information to an end user to achieve a guaranteed quality document search and production in electronic data.