Indexes are data structures that provide relatively fast access to a set of data based on key values. FIG. 1 illustrates an exemplary table 100 with a corresponding B-tree index 102. The table 100 has a name column 104 and a gender column 106. The values in the name column 104 are used as the key of the B-tree index 102. The B-tree index 102 includes branch nodes and leaf nodes.
Branch nodes contain (1) pointers to other nodes and (2) other data that indicates the range of values associated with the nodes to which the pointers point. The range of values associated with a node are referred to herein as "boundary values". For example, node 108 contains pointers 110 and 112. Node 108 also stores the letter "M" to indicate that the boundary values for the names stored in the nodes attached to node 108 begin with the letters "A" through "L" are stored in the nodes attached to pointer 110 while the names that begin with the letters "M" through "Z" are stored in the nodes attached to pointer 112.
The leaf nodes of B-tree index 102 store key values and pointers to the rows of table 100 that correspond to the key values. For example, leaf node 114 contains three entries. The first entry stores the key value "KATE" in the name column and a pointer to the row in table 100 that has the key value "KATE". The second entry of leaf node 114 stores the key value "KRIS" and a pointer to the row in table 100 that has the key value "KRIS". The third entry of leaf node 114 stores the key value "LANE" and a pointer to the row in table 100 that contains the key value "LANE ".
As new items are inserted into the base data container (e.g. table 100) associated with index 102, new entries that correspond to the new data items are added to the index 102. For example, if a record where the data for column 1 is "ANGIE" and the data for column 2 is "F" were added to table 100, a corresponding index entry would be added to leaf node 116 of B-tree 102. The new index entry would include the key value "ANGIE" and a pointer to the new row added to table 100.
FIG. 2 illustrates a system that includes a node 204 and a disk 200. Node 204 generally represents one of possibly many processing units that have access to the one or more disks that contain the database in which table 100 is stored. Node 204 may be, for example, a networked workstation or a cluster of processors and memory components within a multi-processing machine.
Log files keep track of changes made to a disk block in a database for purpose of recovery or reconstruction of transactions. A redo log file is one such log file. A redo log file contains redo log records. A redo log record contains information necessary to reconstruct, or redo changes made by operations such as INSERT, DELETE and UPDATE. Redo log records are generated for each change made to a copy of a disk block stored in a buffer cache 206. The redo log records are temporarily stored in a redo log buffer 208. Just before a transaction commits, all of the redo log records in the redo log buffer 208 associated with the transaction are written to a persistent redo log file 240. In this way, if a database system fails, then the database system can reapply the redo log records stored in persistent redo log file 240 to recover changes made by committed transactions before the failure.
Before an entry may be added to an index, the portion of the index into which the entry is to be added is typically loaded into the dynamic memory of the node that is inserting the entry. For example, assume that a statement 211 in a transaction 210 executing in node 204 specifies the insertion of a row containing the data "ANGIE, F" into table 100. Assume also that disk block 202 stores leaf node 116 of a B-tree index 102. To insert the appropriate entry into index 102, disk block 202 is loaded into buffer cache 206 of node 204. In the illustration, the cached copy of the block is shown as 202'.
Once loaded into the buffer cache 206, the copy 202' of disk block 202 is updated with the appropriate index entry for "ANGIE". A redo log record is written to the redo log buffers 208. Once the transaction 210 is complete, the redo log records for each statement in the transaction are written to persistent redo log file 240. After the redo log buffers 208 are written to persistent redo log file 240, the transaction 210 can commit. At any time, the updated copy 202' of disk block 202 may be stored back to disk 200.
Typically, the closer key values are to each other in the order used by the index, the more likely the index entries for the key values will be stored in the same portion of the index. For example, in the B-tree index depicted in FIG. 1, index entries for "KEN", and "KENT" and "KENNETH" would all be stored in leaf node 114. Consequently, there is a high likelihood that index entries for data items with consecutive key values will be stored in the same portion of an index structure.
One approach to performing index maintenance operations is to immediately update the index 102 as each change is applied to a table associated with the index. According to this approach, an update to index 102 is applied immediately after a corresponding row of table 100 is updated. For example, if records "ANGIE, F", "STACEY, F" and "ART, M" are inserted into table 100, index 102 would need to be updated with three separate update operations.
More specifically, after the record "ANGIE, F" is added to table 100, leaf node 116 of index 102 would be updated with an index entry for the new row. Assume that disk block 202 of disk 200, depicted in FIG. 2, contains leaf node 116. To perform the index update, disk block 202 would be copied from disk 200 and copied as disk block 202' in buffer cache 206. An index entry with the key value "ANGIE" and a pointer to the corresponding row "ANGIE, F" in table 100 would be inserted into disk block 202', and a redo log record would be added to the redo log buffers 208 to record the change made to disk block 202.
Next, the record "STACEY, F" would be inserted into table 100 and index 102 would again be updated. Assume now that disk block 203 of disk 200, depicted in FIG. 2, contains leaf node 120. To perform the index update, disk block 203 would be copied from disk 200 and written as disk block 203' in buffer cache 206. An index entry with the key value "STACEY" and a pointer to the corresponding row "STACEY, F" in table 100 would inserted into disk block 203, and a redo log record would be added to the redo log buffers 208 to record the change made to disk block 203.
Finally record "ART, M" would be inserted into table 100 and index 102 would again be updated. As mentioned above, disk block 202, depicted in FIG. 2, contains leaf node 116. If the disk block 202 had been written back to disk 200 subsequent to the insertion of the index entry for ANGIE, then disk block 202 would have to again be copied from disk 200 and written as disk block 202' in buffer cache 206. Key value "ART" and a pointer to the corresponding row "ART, M" in table 100 would be inserted into leaf node 116 of index 102 and another redo log record would be added to the redo log buffers 208 to record the change made to disk block 202. At any point during or after the index update operation, the redo log records within redo log buffers 208 could be written to the persistent redo log file 240 in disk 200.
The per-update approach to performing index maintenance described above is expensive because the overhead is high. There is a significant amount of I/O as a result of copying the disk block from disk 200 to buffer cache 206 and then from buffer cache 206 back to disk 200. There is also a significant amount of instruction cycles associated with the per-update approach to index maintenance. In the event of a system failure or rolling forward an operation, the haphazard arrangement of the redo log records of the redo log file 240 further exacerbates the I/O problem and ties up system resources.
In general, the number of rows updated by a given statement 211 that references a table increases as the size of the table increases. The larger the number of rows that are updated, the greater the overhead associated with updating indexes and generating and storing the redo log records for the updates to the indexes. Clearly, if a large number of updates to an index were to be made in a relatively short amount of time, the per-update approach can be particularly inefficient: the same disk blocks may be repeatedly swapped back and forth between the buffer cache 206 and disk 200. Further, the more frequently that disk blocks, such as 202 or 203, are updated, the more redo log records will be created. Many of the redo log records may represent changes to the same disk block. Each of these events cause a considerable amount of I/O and a drain on system resources. Thus, there is a need for a more efficient method for updating indexes in a database system.