Many applications, such as a search engine, may create an index of content to allow fast resolution of various types of queries about the content indexed. For many of these applications, the resulting size of an index created is usually too large to fit into computer volatile memory. As a result, a content index must be stored in persistent storage, such as on disk or other permanent storage devices, as it is built. To do so, small individual sub-indexes may be typically first produced in volatile memory and then persisted in permanent storage. Each of these sub-indexes provides a content index for part of the content to be indexed. After each sub-index is created in volatile memory, it may be merged with a single index persisted in permanent storage to build a master index. To allow for resolving queries while still indexing the documents, each small sub-index may be persisted as non-modifiable such as a read-only file.
Several problems arise from building a content index in this manner. First, this process is inefficient for applications that may periodically index a vast amount of content. As the master index grows with each merge, it becomes increasingly expensive to add each additional smaller sub-index created in volatile memory and indexing performance degrades correspondingly. Second, indexing of content is interrupted while such merging occurs. There are some applications that may index content continuously, such as a search engine for the World Wide Web. Such an application must wait until merging is finished. Third, this sequential process is not scalable beyond a single CPU. As content storage progressively grows, multiprocessor systems need to be efficiently leveraged for indexing content storage.
What is needed is a way for building a content index that may offer better performance than the process of indexing part of the content and then interrupting indexing to merge a sub-index with a master index. The solution should support continuous indexing of content for applications, such as a search engine, that may continuously index content. Additionally, such a system should scale as the number of CPUs used by the system increases. Moreover, such a system and method should promote efficient query performance as the index is built.