Databases use a variety of indexes to speed up user access to underlying data. One common technique is to use B-trees for indexes. Generally, a B-tree is a tree data structure that keeps data sorted and allows searches, insertions, and deletions in logarithmic time. In B-trees, internal nodes can have a variable number of child nodes within some pre-defined range. B-trees have advantages over alternative implementations. For example, the number of accesses required to find data in a B-tree can be significantly fewer than when the data is stored in other ways. The fact that fewer accesses are required, in turn, increases overall index performance when stored as a B-tree.
A well-known concurrency limitation issue for B-trees has been termed as “right-growing” indexes. Right-growing an index occurs when monotonically increasing values are inserted concurrently into the index. Such values occur commonly in many applications, e.g., when inserting an ordered list of timestamps or sequence numbers. What happens when monotonically increasing values are inserted in a column that is indexed with a B-tree is that the database server attempts to insert the updated values in the index the order that the database server receives the values. As a result, the insertions always occur at the rightmost leaf block. This produces contention for the leaf block and leads to frequent reorganizations when the right most leaf fills up and must be split.
To illustrate, consider FIG. 1. FIG. 1 illustrates an exemplary table 100 with a corresponding B-Tree index 102. The table 100 has a name column 104 and a gender column 106. The value in the name column 104 is used as the key of the B-Tree index 102. The B-Tree index 102 includes branch nodes and leaf nodes.
Branch nodes contain pointers to other nodes and data that indicate the range of values associated with the nodes to which they point. For example, node 108 contains pointers 110 and 112. Node 108 also stores the letter “M” to indicate that names that begin with the letters “A” through “L” are stored in the nodes attached to pointer 110 while the names that begin with the letters “M” through “Z” are stored in the nodes attached to pointer 112.
The leaf nodes of B-Tree 102 store key values and pointers to the rows of table 100 that correspond to the key values. For example, leaf node 114 contains three entries. The first entry stores the key value “KARL” and a pointer to the row in table 100 that contains the value “KARL” in the name column. The second entry of leaf node 114 stores the key value “KRIS” and a pointer to the row in table 100 that has the key value “KRIS”. The third entry of leaf node 114 stores the key value “LANE” and a pointer to the row in table 100 that contains the key value “LANE”.
As new data items are inserted into the base data, new entries that correspond to the new data items are added to the index. For example, if a record where the data for column 1 is “ANGIE” and the data for column 2 is “F” was added to table 100, a corresponding index entry would be added to leaf node 116 of B-Tree 102. The new index entry would include the key value “ANGIE” and a pointer to the new row added to table 100.
FIG. 2 illustrates a system that includes two nodes 204 and 214 and a disk 200. Nodes 204 and 214 generally represent processing units that have access to the one or more disks that contain the database in which table 100 is stored. Nodes 204 and 214 may be, for example, networked workstations or clusters of processors and memory components within a multi-processing machine.
Before an entry may be added to an index, the portion of the index into which the entry is to be added must be loaded into the dynamic memory of the node that is inserting the entry. For example, assume that a transaction 210 executing in node 204 specifies the insertion of a row containing the data “ANGIE, F” into table 100. Assume also that disk block 202 stores leaf node 116 of a B-Tree index 102. To insert the appropriate index entry into index 102, disk block 202 is loaded into buffer cache 206 of node 204. In illustration, the loaded copy of the block is shown as 202′. The copy 202′ of disk block 202 that is stored in buffer cache 206 is updated with the appropriate index entry for “ANGIE”. At a later time, the updated copy 202′ of disk block 202 is stored back to disk 200.
Typically, the closer key values are to each other in the order used by the index, the more likely the index entries for the key values will be stored in the same portion of the index. For example, index entries for “KEN”, “KENT” and “KENNETH” would all be stored in leaf node 114. Consequently, there is a high likelihood that index entries for data items with consecutive key values will be stored in the same portion of an index structure.
Under many conditions, data is entered into a database in such a way that consecutive entries have consecutive key values. For example, records may be keyed into a database system in alphabetic or numeric order. Even records that do not initially have an order with respect to each other may be assigned key values based on the order in which they arrive. For example, one way to assign a unique identifier to each piece of e-mail in an e-mail system is to assign each e-mail a strictly increasing number based on the order in which the e-mail is received.
When consecutively inserted data items have consecutive key values, the new index entries for the new data items are inserted into the same portion of the associated index. For the purposes of explanation, the portion of an index into which new entries will be inserted is referred to as the “target portion” of the index. For example, while rows that contain names that begin with the letters “A” through “C” are being added to table 100, leaf node 116 will be the target portion of index 102. During the insertion process, the rate at which the target portion is accessed will be relatively high, while the rate at which other portions of the index are accessed will be relatively low.
When only one node (e.g. node 204) is being used to insert data into table 100, the fact that one portion of index 102 is heavily accessed may not have adverse effects on the efficiency of the insertion process. For example, while rows with names beginning with the letters “A” through “C” are being added to table 100, block 202 will remain loaded in buffer cache 206. However, when two or more nodes are used to insert data into table 100, the fact that one portion of index 102 is heavily accessed by both nodes may lead to significant problems.
Specifically, each node must update the most recent version of block 202 to insert an index entry into leaf node 116. Therefore, if the version 202′ of block 202 in that is located in buffer cache 206 has been updated by node 204, the updated version 202′ of block 202 must be written to disk 200 and loaded into buffer cache 216 before node 214 may insert an entry into leaf node 116.
The updated version 202′ of block 202 that resides in buffer cache 216 would then have to be written to disk and loaded into buffer cache 206 before node 204 could insert a subsequent entry into leaf node 116. The transfer of data from the buffer cache of one node to the buffer cache of another node involves a significant amount of overhead, including multiple I/O operations and lock-related communications.
Based on the foregoing, it is clear B-trees suffer from bottleneck problems. This problem is particularly acute on multi-node clustered databases (RAC) where the successively ordered inserts could originate from different nodes, requiring a large amount of inter-node communication for the contended leaf blocks.
As a partial cure to this problem, some databases have begun to use reverse key indexes. A reverse key index does not index a column on the key value itself. Instead it reverses the bits of the key value and stores the data based on the reverse of the key value. Thus, a database converts sequences of values that may be vying for the same block into a set of values that look like random data. That data is then dispersed across multiple blocks. A technique for implementing a reverse key index is described in U.S. Pat. No. 5,956,705, issued Sep. 21, 1999.
Using a reverse key index, index entries associated with the consecutive key values can be inserted into unrelated portions of the index. For example, in reverse key indexing the transformation operation involves reversing the key value. For example, characters in a text string might be reversed. Thus, the key words KEN, KENT and KENNETH would be converted to NEK, TNEK and HTENNEK, respectively. Because the transformed key values begin with the letters “N”, “T” and “H”, index entries for the key values would be inserted into different portions of an index.
Then, when the index is used to process a query, the key values in the query are reversed as well. For example, if a query requires the retrieval of all rows containing the name “KEN”, the word “KEN” in the query is transformed into “NEK” and the index 102 is traversed to find the leaf node that would contain the key word “NEK”. The database server then follow the pointers associated with any index entries for the key work “NEK” to determine the location of the rows that contain the name “KEN”.
Although reverse key indexes solve some of the problems associated with right-growing indexes, they also introduce their own set of problems.
For example, one of the main problems with reverse key indexes is that queries on a reverse key index have limited usefulness. In other words, a query on a reverse key index generally has to be unique and specific to be effective. This means range scan queries and other relational order type queries (e.g., less than, greater than queries) cannot be performed efficiently. The reason is that reverse key indexes lose the notion of ordering in the leaf values. Thus, to perform a range scan, the database system would have to traverse every item in the index to determine if it falls within the specified range (e.g., since there is no ordering).
Another approach to the right-growing problem is to use a hash-partitioned indexed. A hash-partitioned index in which each partition is represented as a B-Tree can be used to reduce right growing leaf contention. However, hash-partitioned indexes have the same inherent problems associated with the reverse key indexes. Values are spread across multiple locations. Thus, to perform a range scans, a database system must scan every partition of the index. As a result, scan performance is not as good as is the case with a single index (e.g., because in a single index scan positioning needs to be done just once). In addition, hash-partitioned indexes have manageability problems, since they require maintaining separate schema objects for each partition.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.