A tree is a common way of storing data in a computer so that the data cm be used efficiently. A well-designed tree allows a variety of operations to be performed while using as few resources as possible: namely the amount of time it takes to execute an operation and the required memory space. A tree, like a family tree, is so named because a physical representation of it looks like a tree, even though the tree is generally shown upside down compared with a real tree; with the root at the top and the leaves at the bottom.
A tree utilizes a set of linked “nodes” to store the data. A node may contain one or more data objects, a condition, and/or represent a separate tree of its own. Each node in a tree has zero or more “child” nodes, which are nodes connected to it and below it in the tree. A node that has a child is referred to as the child's “parent” node. A node has at most one parent. The topmost node in a tree is called the “root” node. Nodes at the bottommost level of the tree are called “leaf” nodes. Nodes in between the root and leaves are called internal nodes or referred to simply as child nodes and referenced by their level (like generations in a family tree, counting from either the root or the leaves).
FIG. 1A shows a graphical representation of a simple tree. At the uppermost position is root node 102. Root node 102 contains data “D” and has two child nodes, nodes 104 and 106, which respectively contain data “B” and “F.” In this example, child nodes 104 and 106 make up the first level of child nodes (the first level below the root node). Node 104 has two child nodes 108 and 110, which respectively contain data “A” and “C.” Thus node 104 is the child node of root node 102 and the parent node of child nodes 108 and 110. Similarly, node 106 has two child nodes 112 and 114, which respectively contain data “E” and “G.” Nodes 108, 110, 112, and 114 make up the second level of child nodes.
FIG. 1B shows a graphical representation of the tree implemented as an array a list of indexed elements. The representation of an array in FIG. 1B makes the visual relationships between data a little more difficult to grasp than the tree in FIG. 1A, but it serves as better visual example of how the tree can be implemented in the context of storing the data in memory. In this example, each element 120-132 is equivalent to a node in the tree of FIG. 1A. By using an indexed list, each element can contain the data for each node as well as “pointers” to its child nodes. The pointers are the address/index of the child nodes so the list may be accessed following the tree structure rather than only sequentially from left to right.
Root node 102 is now stored in array element 120 which uses the index “0.” Element 120 contains the data “D” for the root node as well as pointers to the first level child nodes 104 and 106 which are represented as elements 122 and 124. The pointer from element 120 (root node) to element 124 (child node) allows a computer to skip past element 122 and go directly to element 124, using its index “2,” when traversing the tree, shown in FIG. 1A, to the right. Similarly, node 106 is stored in array element 124 which contains data “F” and has pointers to element 130 with index “5” and element 132 with index “6.”
In order to retrieve data “G” from a tree without knowing its exact location, a computer can start at the root node 102, traverse to the child node on the right 106, and again to another child node on the right 114. In terms of the array, element 120 will point to the index for element 124, and element 124 will point to the index for element 132. At most, any data will be two steps away from the root node in this tree. If the list in the array was searched sequentially from left to right, it would take six steps to reach “G.” As the amount of data grows, so do the savings in steps and thus computational time.
A common type of tree structure is a “B-tree.” The format of data within B-tree is based upon a global order, restraints on the amount of data in each node, and the amount of child nodes each parent may have. B-trees are also required to remain balanced: all of the leaf nodes are on the same level. FIG. 1A shows a simplified version of a B-tree. The global order for FIG. 1A's B-tree is alphabetic order. The leftmost node contains “A” and as you move right, the data progresses alphabetically to “G” in the rightmost node. Each parent node lies between (contains a median value of) its child nodes in terms of the global order: parent node 104 contains “B” which comes after “A” in child node 108 on the left and before “C” in child node 110 on the right. The tree in FIG. 1A is also balanced: all leaf nodes 108, 110, 112, and 114 are on the second level of child nodes.
B-trees also operate under the assumption that there is a meaningful separation between data objects. Letters of the alphabet each have their own distinct value and are easily separated from one another. In the example above of searching the tree in FIG. 1A for “G” without knowing its location, a separation between each letter is required. Starting again at the root node 102, “D” is found. If the global order is alphabetic order, letters that come before “D” are going to be found in child nodes to the left and letters that come after “D” are going to be found m child nodes to the right, “G” comes after F, so we traverse to the right child node 106, where “F” is found. Again, letters before “F” will be to the left and letters after “F” will be to the right. Finally, we traverse to the child node 114 an the right, and “G” is found.
Considerable study and research has been expended in the design of systems to store, manage, and manipulate multidimensional/spatial data (hereinafter “spatial data”). Spatial data is a naturally occurring feature of numerous application spaces, and since it frequently involves extremely large datasets, index-based access methods for spatial data have been extensively studied. Despite this research, the trees and other methods for manipulating data in common use exhibit poor scalability across a wide range of environments, thereby limiting applicability to relatively narrow problem spaces.
Classic B-tree data structures owe much of their scalability to the assumption that there is a meaningful global order to the dataset in a single dimension and that a natural partition exists between any arbitrary set of records such that it is trivial to distribute the records across the nodes of the tree while preserving the global order. A problematic feature of many spatial datasets is that there may be no natural partitions between records, the probability of which increases as the number of records increases. Furthermore, multidimensional spatial datasets tend to be very resistant to the notion of having a global order in a single dimension. Numerous proposals for globally ordering spatial data for the purposes of storing it in B-tree data structures have been made, none of which generalize well in practice due to the necessary semantic lossiness of dimension reduction. Consequently, current spatial data structures tend to preserve the dimensionality of the data in their representations to preserve generality, but do so using strategies that adversely effect scalability.
The primary strategy used for indexing spatial data is that typified by the R-tree and its derivatives. R-trees split coordinate space with hierarchically nested, overlapping bounding rectangles. Each non-leaf node within the tree has a variable number of entries, up to some pre-defined limit. Each entry represents a way to identify a child node and a bounding rectangle that contains ail of entries within that child node. The actual data objects (or pointers to them) are stored in leaf nodes along with the bounding boxes for those data objects.
Unlike B-trees, where logically adjacent nodes never overlap, R-trees solve the problem of a spatial dataset having no natural partitions by allowing logically adjacent nodes in the tree to contain overlapping data. If a spatial data record straddles the bounds of two nodes, one of the two nodes is selected and adjusted so that the node's bounds logically contain that data object. As a natural consequence, the bounds of the two nodes that the spatial data, overlapped now overlap each other.
FIGS. 2A and 2B show graphical representations of sample spatial data objects and their organization within a simple R-tree. In this example, each node of the R-tree in FIG. 2B, 250-262, may contain up to three entries. Bounding rectangles 202 and 204 are the highest in the hierarchy and the combination of bounding rectangles 202 and 204 contain all of the bounding rectangles and data, thus they are stored in root node 250. Entry 202 within root node 250 contains a way to identify a child node 252 and the bounding rectangle 202 shown in FIG. 2A which contains the entries of child node 252: 206 and 208. Entry 204 within root node 250 contains a way to identify a child node 254 and the bounding rectangle 204 shown in FIG. 2A which contains the entries of child node 254; 210 and 212. Entry 206 within node 252 contains a way to identify a child node 256 and the bounding rectangle 206 shown in FIG. 2A which contains the entries of child node 256; 214 and 216. This structure continues in this manner throughout the first level of child nodes. The second level of child nodes, 256-262 all contain the actual data objects and their respective bounding rectangles 214-230.
While this strategy works reasonably well for handling arbitrary sets of spatial objects, it has multiple significant scalability drawbacks. First, searching the tree for a single object may require traversing multiple branches of the tree in eases where nodes overlap, and the probability of node overlap increases as datasets grow larger. Second, the amount of CPU time required to service a query is very sensitive to the size distribution of spatial data records in the indexed spatial data set; a small number of atypically large geometries can substantially increase the number of geometries that must be evaluated when servicing art average query. Third, update concurrency and write scalability tend to be relatively poor because logically adjacent nodes can overlap with independently variable bounds. When an object is added to, deleted from, or modified within a tree it will affect the size of the bounding rectangles and the structure of the tree as each node is limited to a predetermined amount of entries. For example, in FIGS. 2A and 2B, if a data object was inserted within bounding rectangle 210, data in leaf node 260 would need to be rearranged as leaf node 260 already contains the maximum amount of entries. An update of this type will require propagating the changes through the tree which can “lock” other updates out of large portions of the tree. This both prevents high-concurrency update techniques common in B-trees (e.g., Lehman/Yao B-trees) and can potentially cause excessive locking in the upper nodes of the tree structure as node bounds are modified. Despite these limitations on scalability, R-tree based algorithms are the most common general-purpose indexed access methods for spatial data in use.
Another important strategy used for indexing spatial data is that typified by the Quad-tree algorithm. Quad-trees are most often used to partition two-dimensional space by recursively subdividing it into four quadrants or regions, decomposing coordinate space into smaller “buckets.” Each bucket has a maximum capacity and when capacity is reached, the bucket is partitioned again. Like the R-tree, leaf nodes contain the actual data objects while each internal node only contains its defined bounds and a way to identify child nodes that represent partitions of its bounds. Unlike the R-tree, Quad-trees preserve the strict logical adjacency of individual nodes in the tree by recursively partitioning the coordinate space into ever-smaller nodes as they become full and replicating spatial data records across every node they logically intersect.
FIGS. 3A and 3B show graphical representations of sample spatial data objects (similar to FIG. 2A) and their organization within a simple Quad-tree. Bounding rectangle 300 of FIG. 3A contains the entire coordinate space and thus as the root node 300 in FIG. 3B. In this example, the limit on data objects bounded per rectangle/node is three. As there are more than three data objects within the bounds of 300, it is partitioned into four bounding rectangles 302, 304, 306, and 308. Bounding rectangle 308 contains (at least partially) more than three data objects (324, 328, 330, 332, and 334), so it too is partitioned into four bounding rectangles 310, 312, 314, and 316.
In FIG. 3B, root node 300 and internal node 308 contain their respective bounding boxes and pointers to (or other identification of) their respective child nodes/partitioned quadrants. Leaf nodes 302, 304, 306, 310, 312, 314, and 316 all contain their bounding boxes and the actual data contained (wholly or partially) by those bounds. Data object 322 is an example of data that overlaps the partition and is partially contained by two leaf nodes. As a result, a copy of data object 322 is copied into both nodes 302 and 306. Similarly, data object 332 straddles two partitions and is partially contained by bounding rectangles 310, 312, 314, and 316: thus a copy of data object 332 is stored in each of nodes 310, 312, 314, and 316.
While trees that use a replication strategy, such as Quad-trees, tend to have moderately good update concurrency and a CPU efficiency that is much closer to the performance of a B-tree than a R-tree, the potential for replicating single spatial data records across very large numbers of nodes as the dataset grows larger is so pathological that it is generally considered a very poor choice for indexing non-point geometries. The pathological replication of data objects can be seen in relation to FIGS. 3A and 3B. The addition of only a few more data objects in bounding box 308 would require bounding boxes 310-316 to each be partitioned. While still dealing with a small dataset, data object 332 would be copied into eight different nodes. As a consequence of this pathological replication, Quad-trees (and similar trees that replicate overlapping data) are used almost exclusively for the narrow case of geometry datasets that are guaranteed to never require replication; e.g., data composed entirely of points, such as raster graphics.
Almost ail current spatial indexing methods can be categorized as a derivative of one of these two basic strategies, and share a number of features such as the utilization of a balanced tree structures and the focus on position in the coordinate space as the sole organizing feature. Some recently proposed methods attempt to mitigate the impact of large geometries on the performance of spatial data structures, both for R-trees and Quad-tree variants (see Hanan Samet, ACM Computing Surveys (CSUR), Volume 36, Issue 2 (June 2004) pages 159-217), but with primary result being better multi-resolution feature extraction rather than improved general scalability.