In the last decade, after many years of focusing on fast retrieval times, the importance of index maintenance has increased due, at least in part, to the dramatic increase in the volume of data. For example, some Web companies track click streams on the Internet. Simply storing click streams in a database is not sufficient. Indexing those click streams is important in order to be able to efficiently query the data. There are numerous other applications that require high data rates into a storage system along with efficient queryability, i.e., using indexes.
A B-tree is a popular index structure. A B-tree typically comprises a root node, multiple branch nodes, and multiple leaf blocks that are referenced by the branch nodes. B-trees are generally efficient data structures to query. However, in terms of maintenance, B-trees exhibit numerous problems. Whenever a new row is added to an indexed object (e.g., a table), a corresponding B-tree is updated, which typically requires at least two disk I/O operations—one read disk I/O operation and one write disk I/O operation. A disk I/O operation is referred to hereinafter simply as a “disk I/O.”
Additionally, a single indexed object typically has numerous B-trees “generated on” the indexed object. For example, an Employee table may include multiple columns (e.g., SSN, Last Name, First Name, Department, Salary) that each have a corresponding B-tree. Because only one B-tree on an indexed object tends to have clustering (locality), updates to keys in other B-trees typically incur random disk I/Os across the leaf blocks of the B-tree. B-tree updates thus become a significant limiting factor for overall database performance because each update operation on a table results in updating all B-trees on the table. For example, if a table is associated with ten B-trees, then 1,000 updates on the table requires approximately 20,000 random disk I/Os.
Many users (whether individuals or organizations) desire to have real-time indexing. A real-time index is an index that is updated in conjunction with, or immediately after, an addition or deletion to an object (e.g., a table) upon which the index is based (referred to herein as an “indexed object”). Thus, a real-time index is one that is immediately updated to reflect changes to the indexed object. Users typically do not want an index that is only current as of last week or even as of yesterday.
Thus, there are at least two issues with real-time indexing: storing changed data real-time and querying the changed data real-time. One proposal to handle a significant amount of updates is to store the updates separate from an index. Periodically, such as during off-peak hours, the index is updated in single (large) batch operation. The off-peak hours are referred to as a “batch window.” However, such an index is not current. In order to query current data, the separate store must also be queried. However, such separate stores are not efficiently queryable.
To compound the problem, “batch windows” are disappearing as businesses become global in nature. Updates to indexed objects are roughly constant throughout a given day. Also, users are increasingly accustomed to services being online all the time. Therefore, temporarily disabling an index even during a short batch window is not advisable in order to attract and maintain a committed customer base.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.