1. Field
Embodiments of the invention relate to index partition maintenance over monotonically addressed document sequences.
2. Description of the Related Art
In the current state of the art, text indexing systems are implemented as inverted lists using standard underlying file system storage. Such text indexing systems typically provide adequate performance for the odd million documents or so depending on factors such as document size (i.e., average number of tokens per document), the distribution of words that typically occur within the document corpus, and a host of other factors. A token may be described as a term (e.g., word, number, sequence of logograms, or other contiguous string of symbols) appearing in a document. When, however, one makes an attempt to scale up such text indexing systems to contain a corpus in the order of billions of documents, then, a series of capacity and performance problems occur.
First, the text indexing system runs into typical file system limits and capacity problems, where it is virtually impossible to sustain a single text index larger than the underlying file system. Typical low cost file systems are directly implemented over Just a Bunch of Disks (JBOD) or one or more spindles (disks). Transparent storage scalable file systems exist, however, they demand higher costs, more indirect management, and, typically, limited scalability with respect to the number of participating machines. Also, such a choice may not be feasible in some installations due to the added software virtualization layers causing further I/O performance problems because the text indexing implementations in the field involve a high number of file system metadata changes that such file systems have problems with in general.
Second, the I/O profiles associated with the current offering of text indexing systems is such that the I/O profile directly affects create (i.e., insert or ingest) velocity of the overlying applications using the index at the time when the inverted list implementation within the text index undergoes a hardening operation called an index merge operation. Creation of a document at the text index layers may be described as processing of the document such that the document is inserted or created and indexed within the full text indexing system. Current text indexing systems undergo a serious sequential read and sequential write of almost the entire index, causing serious dips and stalls in the performance of the creation pipeline of the overlying application using the text index. There is another stall in the current product offerings of text indexing systems called the optimize problem, which essentially also stalls the application till the entire inverted list is recreated using the old instance of the inverted lists. This is typically a long duration event that stalls the creation pipeline of the overlying application.
Thirdly, another class of problems includes the term distribution problem. This problem involves the distribution of words within the document corpus being stored within the text index, which is sometimes referred to the term dictionary of the document corpus. It is altogether possible that simply attempting to activate and open the text index with the current product offerings could potentially consume all the memory resources of the hosting system simply to load in memory the first level term index/dictionary. In some cases, it could be virtually impossible to load for indexes that have very large term distributions demanding that the index be split and managed as a single index with a single virtual index view.
Fourth, on the side of search, performance due to very large term dictionaries can degrade.
For example, with reference to a conventional index there are inherent limits to which persistent file structures can actually be hosted in the text indexing systems at runtime. Certain structures, such as the first level term index file, at some point cannot be managed properly in memory due to finite memory that is available to the JAVA™ Virtual Machine (JVM) heap. JAVA is a trademark of Sun Microsystems in the United States and/or other countries. Also, a conventional index may be hosted in a directory and inherently must lie within the storage limits of an underlying physical file system. This implies that the file system storage limits would decide the maximum size of the index. A single conventional index has to lie within certain optimal limits in the posting lists to have reasonable search performance, assuming that the term distribution would reach a certain steady state at some point in the life cycle of the file system. A single conventional index would have a peak creation rate associated with the underlying performance of the file system and storage and available Central Processing Unit (CPU).
Thus, as described, there are a number problems associated with single very large full text indexes. Operationally, such indexes could exceed the file system capacity limits, which causes problems. The performance and throughput limits can also be seriously affected with such single very large indexes as in the case insertion of new documents into it as well as when performing a search or query. For example, dips and stalls in response times are known to occur when there are merge operations or index optimization performed internally to compact and maintain itself.
In conclusion, there is a need for transparently and optimally partitioning and managing text indexes with a single virtual view to an application that utilizes the text indexes.