Business Intelligence (BI) repositories are likely to contain millions of documents. Data from these repositories can be retrieved via a search engine. The search engine typically includes two operations namely searching and indexing. Indexing is a process of exploring the BI repository of documents in real-time, extracting content from the documents and storing the documents in an index. The documents stored in the index are available for searching upon a user's search request. The indexing process is a time consuming and resource consuming process. For instance, indexing a BI repository with a million documents may take a few days or even weeks.
In real-time, resources such as processor and memory available for the indexing process are limited. The resources required for indexing documents exponentially increases with the size of the document. Indexing large BI documents may overload a system to result in out-of-memory conditions, slow down of system processes and may crash system applications. Therefore, indexing large documents increases the indexing time and significantly reduces the number of documents that are available for the user's search request thereby reducing indexing throughput.