A B-tree is a tree structure which stores data, and allows operations to find, delete, insert, and browse the data. Each data record stored in a B-tree has an associated key. In order to be used for a B-tree, these keys must be orderable according to a predetermined function. For example, the keys may be numeric, in which case the ordering may be from least to greatest. As another example, the keys may be names, in which case the ordering may be alphabetical.
A B-tree is height-balanced, so all leaves are at the same level of the tree. Insertions and deletions of records to the B-tree are managed so that the height-balanced property of the B-tree is maintained. The insertion of a new data record may require the split of a node into two nodes; a deletion may require the deletion of a node. Insertion and deletion procedures must maintain the properties of the B-tree (e.g. height balance) in order to ensure that they result in valid B-trees.
Each B-tree leaf contains one or more of the stored records in one of a disjoint set of ranges of key values, while each index node (non-leaf node) of a B-tree provides access to a range of key values stored in one or more adjacent key ranges contained in data nodes. Each index node of the B-tree stores, for each of its child nodes, an ordered pair consisting of a key value within the range and a pointer to the child node. The key values break the range of key values represented by the node into sub-ranges, and the pointers to a leaf within the sub-range (if the index node is one level above the leaf level) or point to an index node corresponding to that sub-range.
FIG. 1 is a block diagram of an exemplary subtree in a B-tree data structure. As shown in FIG. 1, a sub-tree of a B-tree contains leaves 1010 storing records with the keys shown in those leaves 1010. Leaf nodes are also known as data nodes. Index node 1000 corresponds to the range between 21 and 133. Index node 1000 contains three ordered pairs (index pairs). The first ordered pair contains the key value 21 and first pointer 1020 of index node 1000, which points to index node 1025. A second ordered pair contains the key value 49 and second pointer 1030. This indicates that the pointer in the first ordered pair should be followed to reach any record with a key greater than or equal to 21 (the key value in the first pair) and less than 49 (the key value in the second pair). The key value in the second ordered pair, along with the key value of 93 in the third ordered pair, indicates that any record with a key greater than or equal to 49 and less than 93 will be found in the sub-tree whose root is index node 1035. The third ordered pair, containing third pointer 1040, indicates that any record with a key greater than or equal to 93 will be found in the sub-tree whose root is index node 1045.
It can be seen that an index node will have as many ordered pairs of <key, pointer> as it has child nodes. The range represented by each index node need not be explicitly stored in the index node. In the sub-tree of FIG. 1, node 1035 corresponds to the range of key values v where 49≦v<93, though this range is not be explicitly stored in node 1035 in the example. Any search for key values in the range 49≦v<93, though, will reach node 1035. In addition to being height-balanced, another B-tree constraint concerns the number of nodes which can exist below a given node which is determined by the order assigned to the B-tree. When an additional node is being added below a parent node which already has the maximum number of nodes, the result would violate this constraint. In practice, the order of a B-tree is determined dynamically, when a node of the tree fills up. In this case, a node split occurs, as described below.
To search a B-tree for a record, the search begins at the root node and follows pointers from node to node based on the key value for the record being sought, descending down the tree, until either the key is located or the search fails because a leaf node is reached which does not contain a record with the key being searched for. For example, if the record with key value 113 is being sought, when index node 1000 is reached, the key values are consulted. Since the key value being sought is greater than the key value in the rightmost pair in node 1000, the pointer 1040 from that pair is followed. Node 1045 is reached. When the key values are consulted, it can be seen that pointer 1048 should be followed to find any record with a key value 109≦v≦122. This pointer 1048 leads to the appropriate leaf from leaves 1010 which contains the record for the specified key value. If a record was searched for with a key value of 112, the search would end in the same location, but because no record is found with that key value in the leaf node, the search would return an unsuccessful result.
When a node has the maximum number of key values (when there is not sufficient space for any additional index term or data record), if a new key value must be inserted into the range covered by the node, the node will be split. In order to ensure that concurrent accesses are not reading data from the node during the split, it is necessary to deny concurrent access to the node being changed. Because two nodes will now hold the information previously held by the node being split, an additional link is necessary in the parent node of the node being split. Concurrent accesses to that parent node must therefore be denied while the parent is updated. If the addition of a new key value and pointer in the parent node will overfill the parent node, the parent node will be split as well. It can be seen that node insertions may cause splits recursively up the B-tree. This may require that a node high in the tree be locked while nodes much further down in the tree are being split, and while the split slowly propagates its way up to the locked node. This greatly impairs concurrent access to the tree. The necessity for a number of locks or latches to prevent concurrent accesses to nodes being changed slows access to the information stored in the B-tree by limiting concurrent access.
A Blink-tree is a modification of the B-tree which addresses this issue. FIG. 2 is a block diagram of an exemplary subtree in a Blink-tree data structure. Each non-leaf node contains an additional ordered pair, a side pair, including a side key value and a pointer (termed the “side pointer”) which points to the next node at the same level of the tree as the current node. The side pointer of the rightmost node on a level is a null pointer. Thus, as shown in FIG. 2, the subtree of a B-tree shown in FIG. 1 may be converted into a subtree of a Blink-tree with the addition of side pairs 1107, 1127, 1137, and 1147. Side pointer from side pair 1147, because it is the side pointer of a rightmost node on a level, is null. Side pointer from side pair 1107 is also shown as null, this could indicate that node 1000 is the root node or that it is the rightmost node on a level. The side key value indicates the lowest value found in the next node at the same level of the tree. Therefore, the range of values in a node may be seen by examining the index term for the node in its parent node (which is the lower bound and is included in the range) and the side key value (which is the upper bound but is not included in the range). The purpose of the side pointer is to provide an additional method for reaching a node. Each leaf node also contains a side pointer which points at the next leaf node, such as side pointer 1117.
One benefit of using these side pointers is to enable highly concurrent operation by allowing splits to occur with each atomic action of the split involving only one level of the tree. With B-link trees, in order for a split to occur on a full node the contents of the full node are divided (one atomic action), and a new index term is posted to the parent (second atomic action). This avoids the situation in which multiple levels of the tree are involved in a single atomic action. If a split is occurring in a node at the same time that a search is being performed for a key value in the range for that node, and the node has been split, with the lefthand node replacing the node which has been split, the tree can be traversed to find data even if no index term has yet been inserted into the parent of the node for the righthand node from the new pair. In such a case, the parent node will point to the lefthand node, and if the data is not found in the lefthand node, the side pointer of the lefthand node provides access to the righthand node. Thus a node split need not be a single atomic operation with the parent and child nodes both inaccessible until the split is completed.
In B-trees and Blink-trees, latches are used in order to provide mutual exclusion when a node split or node deletion is occurring. A latch is a low-cost, usually short-duration lock, one which does not include deadlock control. Hence, it is not necessary to access a lock manager in order to acquire or release a latch. Latches are therefore more lightweight than locks; they typically require only tens of instructions, not hundreds like locks. They prevent access of incorrect or outdated data during concurrent access of the data structure by allowing only an updater holding the latch to use the resource that has been latched.
Because no deadlock control exists for latches, a partial ordering is imposed on latches. The holder of a latch on a parent node may request the latch for a child node of that parent node. Latches can propagate downward. However, the holder of a latch on a child node can not request the latch for the parent without first releasing its latch on the child; latches do not propagate upwards. In this way, the deadlock situation in which the holder of a latch for parent node A is requesting a latch for child node B at the same time that the holder of a latch for child node B requests a latch for parent node A is avoided. In a standard B-tree, the latch must be maintained for the node being updated, and for the parent of that node (and possibly for multiple ancestors up the tree, even perhaps to the root), so the pointers and key values in the parent can be modified to reflect the change. If the latch is not maintained for the parent, the tree can become inconsistent. The latches must typically be maintained for all the nodes on the path to a leaf node that may need to be updated because of a node split at the leaf.
In a Blink-tree, however, a latch is not required on the parent node (and any further ancestors) while the child node is being split. As described above, where the child node has been latched for the node split, the parent latch need not be held during the child node split, while the new nodes have been created but the parent node for these new nodes has not yet been updated. A node split therefore need not be an atomic operation that includes posting the index term to the parent, but can be divided into two parts (“half splits”), the first “half split” where a child node is split, moving some data from an old node to a new node, and setting up a side link from the old node to the new node. After such a “half split” the Blink-tree will be well formed. A subsequent second “half split” posts an index term to the parent node.
However, there is a risk that several changes (node deletes, described below, and splits) will occur, and that when the parent node is changed to reflect the new child node, that that child node will no longer exist. To guard against this requires that the existence of the child node be re-verified, which requires re-visiting the left-hand (originally full) node and ensuring that the side pointer for that node still references the right-hand (new) node. Additionally, when a node split occurs, the path to the node being split is remembered. There is a risk that when the key value and pointer for the split is to be added to the remembered parent node, that parent node no longer exists because it may have been deleted. Guarding against this requires a tree re-traversal which is resource intensive. Thus, the prior art methods of B-link node splitting incur extra execution costs, which in turn limit concurrency and throughput, and increase the complexity of the implementation.