The present invention relates to a data storage system such as a hard disk that stores database-type data in pages, to incremental physical data clustering to such pages, and, more particularly, but not exclusively to such clustering in native XML databases and other repository systems that store XML documents using a native format.
Current database or repository systems use two main approaches for storing XML documents. The first approach maps an XML document to a relational table where each row represents an edge in the document's XML tree. Existing relational operators are used for traversing over XML stored documents. The second approach, native XML Storage, views the XML document as a tree. The entire XML tree is partitioned into distinct records containing disjoint connected subtrees. These records are stored on disk pages, either in an unparsed, textual form, or using some internal representation.
In native XML Database systems, document processing is dominated by path-dependent navigational XPath queries which are aided by path indices that reduce the number of navigation steps across stored XML records. Thus, disk-resident XPath processors employ a mixed, i.e., part navigational, part indexed, processing model. Therefore, smart clustering of related data items is beneficial. Here, two document nodes are related if they are connected via an edge, and examining one of them is likely to soon lead to examining the other. Data clustering has been shown to be beneficial for hierarchical databases such as IBM's IMS, and for object-oriented databases (OODBs).
A practical algorithm, called XC, clusters XML documents using a tree partitioning approach. XC uses XML (which usually means XPath) navigational behavior, as recorded as edge weights, to direct its document partitioning. XC is based on Lukes' tree partitioning algorithm (see below), but in contrast to Lukes' algorithm, which is an exact algorithm, XC is an approximate algorithm. That is XC trades off partitioning precision for time and space. This enables XC to exhibit linear-time behavior without significant degradation in partitioning quality over the exact optimal solution. However, performing clustering based on navigational behavior as encoded in the parent-child edge weights is not sufficient. It misses the fact that often children of a parent are accessed successively. This means that to reduce the number of page faults, affinity among siblings nodes should also be taken into account.
XS, an extended version of the XC algorithm clusters an XML document taking into account navigational affinity among the sibling nodes. Kanne and Moerkotte also present algorithms for partitioning XML documents by using sibling edges. However, their algorithms do not take workload information into account.
Many data repository systems have evolving workload and access patterns. Consider for example, a data repository that contains tour information. The access pattern to such data changes during the course of the year. In the winter, information about ski vacations is more relevant than information about trek vacations, while in the summer most people will look for seashore/lakeside vacation rather than for a ski vacation. A spa vacation by contrast is probably attractive year round.
In fact, the workload of an XML document may change significantly during operation. This leads to changes in navigation behavior which necessitates data rearrangement. We therefore need a system that is able to adjust the document data placement to changing access patterns while maintaining data placement quality. This has to be done efficiently in terms of both time and space.
The naive solution of full reclustering, upon each change in access pattern, is impractical as it requires reading and clustering the entire document, and these are complex and slow operations.