The present invention relates to a device and method for incremental clustering of indexed XML data in a paged data storage system and, more particularly, but not exclusively to such clustering in the case of a basic tree structure co-existing with indexing.
We consider the problem of partitioning an XML document for storage on disk. Currently there are two main approaches for storing XML documents. The first approach maps an XML document to a relational table where each row represents an edge in the document's XML tree. Existing relational operators are used for traversing over XML stored documents. The second approach, native XML Storage, intends to store an XML document as a tree. The entire XML tree is partitioned into distinct records containing disjoint connected subtrees. These records are stored on disk pages, either in an unparsed, textual form, or using some internal representation. In native XML Database systems, document processing is dominated by path-dependent navigational XPath queries which are aided by path indices that reduce the number of navigation steps across stored XML records. Thus, disk-resident XPath processors employ a mixed, i.e., part navigational, part indexed, processing model. Therefore, smart clustering of related data items is beneficial. Here, two document nodes are related if they are connected via an edge, and examining one of them is likely to lead to examining the other.
Data clustering has been shown to be beneficial for hierarchical databases such as IBM's IMS, and for object-oriented databases (OODBs).
An algorithm, XC, that clusters XML documents using a tree partitioning approach, is presented in R. Bordawekar and O. Shmueli. Flexible workload-aware clustering of xml documents. In XSym, pages 204-218, 2004. XC uses XML navigational behavior, recorded as edge weights, to direct its document partitioning. This behavior may be determined by XPath processing or some other processing methodology.
XC is based on Lukes' tree partitioning algorithm. However, performing clustering based on navigational behavior as encoded in the parent-child edge weights is not sufficient. It misses the fact that often children of a parent are accessed successively. This means that to reduce the number of page faults, affinity among sibling nodes should also, be taken into account. XS, an extended version of the XC algorithm, clusters an XML document taking into account navigational affinity among sibling nodes. Kanne and Moerkotte present algorithms for partitioning XML documents by also using sibling edges. However, their algorithms do not take workload information into account.
The workload, namely which queries and in what frequency and importance level, plays an important role as the data that is accessed is frequently workload determined. Hence, it is important for the physical organization to match the workload. However, the workload may change, which means that ideally the data physical organization needs to change as well. A practical algorithm, PIXSAR, which is based on XS, is presented in U.S. Provisional Patent Application No. 61/054,249 whose priority is claimed herewith. PIXSAR incrementally clusters XML documents while taking into account navigational affinity among sibling nodes. It makes decisions on the fly, and selectively reclusters parts of the augmented document tree that experience significant changes in access behavior.
The main parameters used by PIXSAR are the radius which determines the pages that are to be reclustered (intuitively, this parameter reflects the maximum distance of pages that are affected by a change in a document), and the sensitivity of reclustering triggering.
However, in addition to the XML augmented tree, there are also indices. Most database and repository systems use path indices that reduce the number of navigation steps across stored XML nodes. Thus, disk-resident XPath processors employ a mixed, i.e., part navigational, part indexed, processing model.
The kind of index we consider is based on a XPath expression and it consists of index entries pointing to XML target nodes. Using such index entries, one jumps directly to target nodes. Often, target XML nodes are accessed in temporal proximity and hence, for paging reasons, it is beneficial to store them on the same disk page. In other cases, such temporal proximity is absent and hence co-storing is not optimal.
A problem however arises in that the indexing demands a tree structure for the document that is different from that defined by the father daughter and sibling nodes. While PIXSAR can carry out reclustering based on a tree structure and edges within that tree structure defining relationships between nodes, what is it to do when two or more competing tree structures are being used together to access the same data?
It is noted that in the known art, notwithstanding the presence of indexing, all the problems are explored on the basic tree. This is true of Lukes' algorithm, the XC algorithm and primary NATIX algorithms. As time passed, the solutions were broadened to a tree augmented with sibling edges, thus the XS algorithm and NATIX new algorithms. However there is no teaching of what to do in the case of competing trees superimposed on the same data.