This application addresses an invention known as “Best Match First”™ to substantially improve both efficiency and the quality of the processing of very large collections of data, in the context of electronic discovery and related applications, including but not limited to investigation and compliance.
The volume of electronic information in both personal and corporate data stores is increasing rapidly. The data in question includes electronic mail (e-mail) messages, word-processed and text documents, database files, contact management tools, and calendars. As a result, it is safe to assume that the size of data presented in litigation or for investigation will also continue to substantially increase. This in turn has led to an ever increasing demand to efficiently process and subsequently review data sets that are increasingly in the 10 terabyte+ range. While there are many automated categorization and other techniques currently used to prioritize which data should be reviewed first, none take into account that since the data must first be processed before it can be reviewed with large scale collections, a good job of prioritizing the review cannot be done if the processing is not first effectively prioritized.
In normal usage in electronic discovery, and indeed in the field of Information Retrieval generally, data is de-duplicated, text is extracted from non-ASCII formats, and then an inverted word frequency index is built, all in one uninterrupted start-to-finish multi-stage process. This is because text extraction (where there are non-ASCII documents) and the construction of an inverted index are necessary in order to make the collection of documents searchable. However, when dealing with very large amounts of data, the process may be stopped prior to indexing for purposes of prioritizing items or types of items to be processed. Prior to the indexing step, less is known about the document than compared to after the indexing step. However, just the indexing step alone can take many days or even weeks when the size of the data is large enough and/or when hardware and network resources to process the data are limited, and do not permit substantial parallelization of effort. So there is a great benefit in making prioritization decisions based on the partial information that is available at an earlier stage in the process. Furthermore, because the method herein described is highly iterative in nature, the accuracy of these prioritization decisions will increase as data continues to be processed.
In one embodiment of the invention, the properties of each document, including its metadata and extracted text are placed in a special purpose Lightweight File System. One of the properties of this file system is that it can be very easily scanned for specific tokens. The time taken to dump the raw and extracted data into the file system as well as to perform a targeted scan of it represents only a small fraction of the time that it would take to index this data. The data must still be indexed, and for many applications the data is subject to many kinds of post-processing after indexing. This is because the index contains many important derived properties of documents that are necessary for many Information Retrieval applications, such as similarity assessments and complex search operators. But such sophisticated information is not necessary to make prioritization decisions of value. (Note that some variations of the method could likewise break after the indexing stage is complete, but prior to any subsequent post-processing steps such as discussion building as described in U.S. Pat. No. 7,143,091).