A list of references is set forth at the close of the present disclosure; individual references from the list are referred to with abbreviations in brackets (i.e., [FK99], [TN91], etc. . . . ).
Current database or repository systems use two main approaches for storing XML documents. The first approach maps an XML document to a relational table where each row represents an edge in the document's (abstract) XML tree [FK99]. XML processing on stored documents is implemented via existing relational operators. The second approach, Native XML Storage, views the document as an XML tree. It partitions the XML tree into distinct records containing disjoint connected subtrees [KM00]. Records are then stored in disk pages, either in an unparsed, textual form, or using some internal representation.
For database systems that support operators for navigating between data items via link (pointer) traversal, clustering of related data items is beneficial in physically laying data. Here, two nodes are related if they are connected via a link, and examining one of them is likely to soon lead to examining the other. For example, data clustering has been shown to be effective for hierarchical databases such as IBM's IMS [Sch77], and for object-oriented databases (OODBs) [TN91, TN92, GKKM93].
XML documents are often processed using query languages such as XSLT, XPath, or XQuery [WC3]. These languages use the XPath navigational operators for traversing paths of abstract XML trees. In native XML storage systems, this results in navigations across stored XML records that are similar to those in hierarchical or object-oriented databases. In practice, such traversals are often aided by path indexes that alleviate the number of path navigations across stored records. However, path indexes cannot completely eliminate such path traversals as XML query patterns are often complex and it is expensive to maintain multiple path indexes to cover the entire XML document. Furthermore, many processing schemes employ deferred access and may need additional traversals for revalidation as the data may have been updated. Finally, generic physical layout schemes can scatter XML nodes across many records. For such physical layouts, conventional I/O optimization techniques like prefetching or buffering may not be effective for complex XML queries. Therefore, it is beneficial, especially for large XML documents, to cluster together related XML nodes and store them in the same disk page.
As disk pages have finite capacity, not all related XML nodes can fit in a single disk page. Therefore, one needs to decide how to assign related XML nodes to disk pages. A key difference between clustering objects in OODBs and clustering related XML nodes is that in OODBs, size of an object is known a priori from its class specification, whereas for XML documents, in the absence of XML schema, node sizes are known after they are parsed. Even when an XML schema is available, sizes of text nodes are known only at runtime.
The problem of assigning related XML nodes to disk pages may be viewed as a tree partitioning problem. The tree to be partitioned is a clustering tree, namely an XML tree augmented with node and edge weights. Roughly, the edge weights model the XML navigational behavior (higher edge weights mean that the connected XML nodes are more strongly “related”). Node weights are the (text) sizes of the XML nodes. The problem is to partition the set of nodes of the clustering tree into node-disjoint subsets (called clusters) so that a cluster induces a connected subtree, each cluster fits into a disk page, and the total of the intra-cluster edges' weights (called the partition's value) is maximized. (Conversely, the total weight of the inter-cluster edges (the partition's cost is minimized.) Intuitively, a higher value partition implies fewer disk accesses.
A widely accepted approach to the tree partitioning problem has been Lukes' dynamic programming-based tree partitioning method [Luk74]. This method operates on a tree in a bottom-up manner and uses an iterative procedure that splits increasingly larger subtrees into a collection of partitions. Each partition is a set of non-disjoint clusters, each cluster satisfying the total weight constraint (i.e., disk page size). For each subtree, and for each feasible cluster weight, Lukes' method finds a partition with the maximum value. The method completes when the subtree being partitioned is the entire tree itself. The final winning partition is a partition of the entire tree with the maximum value. The method of Lukes, however, suffers from excessive memory and running time usage which are artifacts of dynamic programming. Thus, there is a need to improve upon Lukes' method.