Today, many companies and individuals rely on software applications in conducting their daily activities. The software applications include email, word processing applications, internet browsing applications, financial software applications, sales applications, and/or many other types of applications. Software is typically used by individuals to perform a variety of tasks and can involve vast amounts of data being generated, exchanged, manipulated, stored, etc. Periodically, data is subject to electronic discovery and can be requested for review, analysis, etc. such as, during a governmental investigation, a lawsuit, etc.
The data is typically received by way of a data dump and can be stored in a memory location. Typically, the amounts of data that are dumped in response to requests from investigators can measure in hundreds of terabytes and can include hundreds of millions of emails, documents, etc. Searching and/or analyzing such vast amounts of data are a highly difficult and extremely time-consuming task. As part of the analysis, the investigators may need to do a plain search of the data using known or random keywords, determine which data is similar to another data, track lifecycle of data, etc. Most conventional solutions are not capable of performing all of these tasks or perform them in very slow manner. This may be unacceptable to the investigators or those who may be seeking to obtain results of the investigation in an expedited manner. Thus, there is a need to provide a data indexing system that can reduce the amount of data that needs to be analyzed for the purposes of determining similar documents, performing keyword searches, ascertaining lifecycle of data, and/or performing any other analysis.