Structure searches or containment searches, as well as value searches, are typically required in tree-structured document query processing, such as in Extensible Markup Language (XML) query processing. Searching structural or containment relationships, specifically parent-child or ancestor-descendant relationships, within a tree-structured XML document, is critical to answering many general queries against the document.
For example, in an XML document containing one and more phone call records, a containment query such as “//phone-call//Asian-News” is intended to find all the phone call records discussing “Asian News.” However, finding out all the containment relationships that exist in a tree-structured document is very time consuming. A straightforward solution would require the traversal of the entire document tree. Clearly, it is not always practical to traverse a large tree-structured document. Hence, it is very important to have an efficient method for processing containment queries. Structural joins, or containment joins, are “set-at-a-time” operations that find all occurrences of the ancestor-descendant relationship between two different element sets in a tree-structured document.
In order for structural joins to work, each element in the tree-structured document is assumed to be labeled with a pair of numbers (start, end). These two numbers can represent the start and end position of the element in the document, see, e.g., C. Zhang et al., “On Supporting Containment Queries in Relational Database Management Systems,” Proceedings of ACM SIGMOD 2001. However, in general, they need not be the absolute positions. They can be the relative positions so long as the interval represents the region of an element occurrence in the document. Hence, the (start, end) intervals are also called region-encoded intervals. Inverted lists can be built on all the elements, with each list containing all the region-encoded intervals of an element in the document. The region-encoded interval labeling of elements and the creation of inverted lists need only be done once for each tree-structured document.
It is known that changes or updates may occur to a tree-structured document. When updates occur, element re-labeling might be needed because the positions of elements may change as a result. However, the invention does not focus on element re-labeling. Rather, the invention focuses on techniques for performing structural joins between two element sets. Each element in the set is represented as an interval.
The structural relationship between two element nodes can be determined by the region-encoded intervals, where each element is assigned with a pair of numbers (start, end) based on its position in the XML document tree. With such a region-encoding scheme, the following holds: For any two distinct elements u and v, (1) the region of u is either completely before or after v, or (2) the region of u either completely contains v or is contained by the region of v. In other words, if there is any overlap between two intervals, the overlap is complete containment.
A structural join finds all occurrences of a structural relationship between two element sets in a document, where each element is represented as an interval with two numbers. More formally, given two input lists, AList of potential ancestors (or parents) and DList of potential descendants (or children), where each element in the lists is at least of the format (start, end), a structural join reports all pairs (a,d), where aεAList and dεDList, such that a.start<d.start<d.end<a.end. In other words, a structural join reports all pairs (a,d), where aεAList and dεDList, such that interval a contains interval d.
There are existing approaches for performing structural joins with two input interval lists. Among them are: (a) C. Zhang et al., “On Supporting Containment Queries in Relational Database Management Systems,” Proceedings of ACM SIGMOD 2001; (b) D. Srivastava et al., “Structural Joins: A Primitive for Efficient XML Query Pattern Matching,” Proceedings of IEEE International Conference on Data Engineering, 2002; (c) S.-Y. Chien et al., “Efficient Structural Joins on Indexed XML Documents,” Proceedings of VLDB, 2002; and (d) H. Jiang et al., “XR-tree: Indexing XML Data for Efficient Structural Joins” Proceedings of IEEE International Conference on Data Engineering, 2003.
Most of the existing approaches assume either that both element lists are sorted or both element lists have indexes built on them. The goal is to skip unnecessary interval comparisons. In the XR-Tree approach, each input element list has an XRTree index and both element lists are sorted. The XRTree is a rather complex balanced-tree index structure. It maintains in each of its internal nodes a stab list, containing all elements stabbed by at least one key in the node. The focus is to skip elements that will not result in a joined output pair. However, the requirements of sorting the two input lists and maintaining two complex XR-Trees, one for each list, have significant drawbacks. First of all, sorting the two input lists can take a lot of time. Secondly, it is rather costly to construct two XR-Tree indexes, making it infeasible to build the indexes on-the-fly. Hence, the XR-Tree indexes must be pre-built offline. Offline index building has a clear disadvantage, i.e., because of storage constraints, not all elements in an XML database can be indexed. These drawbacks are particularly severe when the input lists are large in size.
Recently, a perfect binary tree encoding approach has also been proposed to perform structural joins without the requirement of sorted input lists or indexed input lists, see, e.g., W. Wang et al., “PBiTree: Coding and Efficient Processing of Containment Joins,” Proceedings of IEEE ICDE 2003. In contrast to performing structural joins from two interval lists, the PBiTree approach first embeds an XML document data tree into a perfect binary tree and assigns proper labels from the binary tree to each of the elements in the XML document. By so doing, it transforms the problem of interval joins (or θ-joins) into equi-joins. Then, the approach relies on traditional database equi-join operations to perform the final joins. The need to use database operations, which usually involve many disk input/output (I/O) operations, can still be inefficient.
Hence, a need is recognized to perform efficient structural joins of two interval lists which are neither sorted nor pre-indexed.