The following prior art references are cited numerically in this application:    [1] S. Al-Khalifa, H. Jagadish, N. Koudas, J. Patel, D. Srivastava, and Y. Wu. Structural joins: A primitive for efficient xml query pattern matching. Proceedings of the 18th International Conference on Data Engineering (ICDE 2002), p. 141.    [2] J. Bremer and M. Gertz. On distributing xml repositories. International Workshop on Web and Databases (WebDB), Jun. 12-13, 2003, San Diego, Calif.    [3] N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: Optimal xml pattern matching. Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, p. 310-321.    [4] S. Chien, Z. Vagena, O. Zhang, V. Tsotras, and C. Zaniolo. Efficient structural joins on indexed xml documents. Proceedings of the 28th International Conference on Very Large Databases (VLDB), p. 263-274, 2002.    [5] World Wide Web Consortium. Xquery 1.0: An xml query language, August 2001. <http://www.w3.org/TR/xquery/.>    [6] Reference deleted    [7] Marcus Fontoura, Jason Zien, Eugene Shekita, Sridhar Rajagopalan, and Andreas Neumann. High performance index build algorithms for intranet search engines. Proceedings of the 30th International Conference on Very Large Databases (VLDB), p. 245-256, 2004.    [8] H. Garcia-Molina, J. UlIman, and J. Widom. Database System Implementation. Prentice Hall, 2000.    [9] K. Goldman and J. Widom. Dataguides: enabling query formulation and optimization in semistructured databases. Proceedings of the 23rd International Conference on Very Large Databases (VLDB), p. 436-445, 1997.    [10] G. Salton and M. J. McGill. Introduction to modern information retrieval. McGraw-Hill, 1983.    [11] L. Gun, F. Shao, C. Botev, and J. Shanmugasundaram. Xrank: Ranked keyword search over xml documents. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, p. 16-27.    [12] H. Jiang, H. Lu, W. Wang, and B. C. Ooi. Xr-tree: Indexing xml data for efficient structural join. Proceedings of the 19th International Conference on Data Engineering (ICDE 2003), p. 253-263.    [13] H. Jiang, W. Wang, and H. Lu. Efficient processing of xml twig queries with or-predicates. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, p. 59-70.    [14] H. Jiang, W. Wang, H. Lu, and J. Yu. Holistic twig joins on indexed xml documents. Proceedings of the 29th International Conference on Very Large Databases (VLDB), p. 273-284, 2003.    [15] R. Kaushik, P. Bohannon, J. Naughton, and H. F. Korth. Covering indexes for branching path queries. Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, p. 133-144.    [16] R. Kaushik, P. Bohannon, J. Naughton, and P. Shanoy. Updates for structure indexes. Proceedings of the 28th International Conference on Very Large Databases (VLDB), p. 239-250, 2002.    [17] R. Kaushik, R. Krishnamurthy, J. Naughton, and R. Ramakrishnan. On the integration of structure indexes and inverted lists. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, p. 779-790.    [18] T. Milo and D. Suciu. Index structures for path expressions. Proceeding of the 7th International Conference on Database Theory, 1999, p. 277-295.    [19] M. Olson, K, Bostic, and M. Seltzer. Berkeley DB. Proceedings of the USENIX 1999 Annual Technical Conference, June 1999.    [20] I. Tatarinov, S. Viglas, K. Beyer, J. Shanmugasundaram, E. Shekita, and C. Zhang. Storing and querying ordered xml using a relational dbms. Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, p. 204-215.    [21] H. Wang, S. Park, W. Fan, and P. Yu. ViST: A dynamic index method for querying xml data by tree structures. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, p. 110-121.    [22] Xmark: The xml benchmark project. <http://monetdb.cwi.nl/xml/index.html.>    [23] C. Zhang, J. Naughton, D. DeWitt, Q. Luo, and G. Lohman. On supporting containment queries in relational database management systems. ACM SIGMOD Record, v. 30, issue 2, June 2001, p. 425-436.
In recent years, XML has become the standard format for data exchange across business applications. Its widespread use has sparked a large amount of research, focused on providing efficient query processing over large XML repositories. Processing XML queries has proven to be challenging, due to the semi-structured nature of the data and the flexible query capabilities offered by languages such as XQuery [5].
XML queries often include both value and structural constraints. For example, the XQuery expression:
//article//section[
//title contains(‘Query Processing’) AND
//figure//caption contains(‘XML’)]
returns all article sections that are titled “Query Processing” and have a figure containing the caption “XML”. We can represent this query with the node-labeled tree shown in FIG. 1. Nodes are labeled with element tags and text values. In this query, the structural predicate spans five elements and multiple text values in a complex twig pattern.
Structural joins are a core operation for any XML query processor and typically account for the bulk of the query processing cost [1]. As a result, a large body work has focused on efficient algorithms to process binary structural joins [1, 4, 23], and more recently, holistic path/twig joins [3, 14]. These algorithms are all index-based, relying on an inverted index for positional information about elements, and cursors are used to access the inverted index.
XML data is commonly modeled by a tree structure, where nodes represent elements, attributes and text data, and parent child edges represent nesting between elements. Elements and text values are associated with a position in the document. Most existing XML query processing algorithms rely on begin/end/level positional encoding (or BEL), which represents each element with a tuple (begin, end, level) based on its position in the tree. Another less-used alternative is Dewey encoding (e.g., [11, 20]), defined as follows: If we assign to each element a value that is the element's order among its siblings, then the Dewey location of element e is the vector of values of all elements on the path from root to e, inclusive. With both BEL and Dewey encoding, structural relationships between two elements can be easily determined given their positions [20]. FIG. 2 illustrates both encodings over a sample XML document.
Structural predicates can also be viewed as a tree, where the label of each “query node” is defined by the element tag or text value represented by the node. Path queries (e.g., “//a//b//c”) and binary structural predicates (e.g., “a//b”) are degenerate cases of the general twig pattern of structural predicates. An XML database is simply a collection of XML documents. As stated in [3], matching a structural predicate against an XML database is to find all distinct occurrences of the tree pattern in the database. A match for a pattern Q over database D is a mapping from nodes in Q to nodes in D such that both structural and value-based predicates are satisfied. The answer to Q, where Q has n nodes, can be represented as an n-ary relation where each tuple (d1, d2, . . . , dn) consists of the database node IDs that identify a distinct match of Q in D.
Inverted Indices Index-based approaches to evaluating structural queries in XML (e.g., [4, 12, 14]) are based on an index over the positions of all elements in the database. By far the most common implementation of this index is an inverted index [10], which is frequently used in information retrieval and XML systems alike (e.g., [1, 3, 4]).
Briefly, an inverted index consists of one posting list per distinct token in the dataset, where a token may represent a text value or element tag. Each posting list is a sorted list of postings with format (Pos,Data). There is one posting per occurrence of the token in the dataset. Pos represents the position of the element occurrence, and Data holds some user-defined data, which for now we assume is empty. The list is sorted by Pos. Stepping through each posting in a list will provide us with the positions of every element (or text value) with a given tag in the dataset, in order of appearance. As with most IR systems, we assume each posting list is indexed, typically with a B-tree, such that searching for a particular position in the posting list is efficient.
Each node q in a twig pattern is associated with an element tag or text value; hence each node is associated with exactly one posting list in the inverted index. To process a structural predicate, we retrieve and scan one posting list per node. For example, to process the query shown in FIG. 1, we need eight posting lists—five for each query node representing an element, and three for query nodes representing text values (the ‘Query’ and ‘Processing’ text values each have their own posting list). We call the current position of the scan operator the cursor over the posting list. In particular, we will use Cq to denote the cursor over the posting list of query node q.
Performing a structural join involves moving these cursors in a coordinated way to meet the ancestor-descendent and parent-child constraints imposed by the query. Three basic operations over the cursors are required for the majority of index-based structural join algorithms [14]:                advance( )—advances the cursor to the next position in the posting list.        fwdBeyond(Position p)—advances the cursor to the first element whose position is greater than or equal to p.        fwdToAnc(Position p)—advances the cursor to the first ancestor of p at or following the current cursor position, and returns TRUE. If no such ancestor exists, it stops at the first element e that has a greater position than p, and returns FALSE.To maintain optimality bounds of existing algorithms, cursors may only move forwards, never backwards. Note that indices (e.g., a B-tree) are required over posting lists to efficiently implement fwdBeyond( ) and fwdToAnc( ); again, such indices over the posting lists are common.        
Path Index A path index is a structural summary of the XML dataset (e.g., [9, 15]). In its simplest conceptual form, a path index is a list of (path, pathID) entries, where there exists one entry per unique path in the dataset. Each path is assigned a unique path ID (or PID). FIG. 3 shows the path index for the dataset modeled in FIG. 2.
We say a PID qualifies for a given pattern if the associated path matches this pattern. Over the path index, we define the function GetQualifyingIDs:path pattern→{PID}, which maps a path pattern to the set of qualifying PIDs. For example, over the path index in FIG. 3, a call to GetQualifyingIDs(“//R//B”) will return the set {3, 6, 7}. The actual implementation of GetQualifyingIDs( ) is fairly straightforward and is covered nicely in [17].
Given a path index, every position in the inverted index is now associated with a PID. Again, we refer readers to [17] for a discussion on how to integrate PIDs into the inverted index and use them during query processing. In brief, every posting in the index contains the PID of the corresponding element. Integration of PIDs into the index incurs an overhead on index size and build time, addressed below.
Ancestor Information With ancestor information, we can efficiently obtain the ancestors of any given element. There are many possible approaches to augmenting the index with ancestor information. One elegant approach is to use Dewey position encoding, rather than the popular BEL encoding. As illustrated in FIG. 2, the Dewey positions of all ancestors of an element are encoded in the prefixes of that element's position. In contrast, although BEL encoding allows us to easily determine whether a given element is an ancestor of another given element, it does not allow us to immediately produce the positions of all ancestors given a single element. We note that other encodings such as [2] also provide ancestor information; we use Dewey encoding for its relative simplicity and popularity compared to these other approaches.
The problem of exploiting indices to enhance XML join algorithms has been studied in [4, 12, 14, 17, 21]. Reference [21] presents the ViST index structure and algorithms for twig join processing via subsequence matching.
References [4, 12] use indices over postings lists to speed up processing of binary structural joins. They use B-trees to speed up the location of descendants of a given element, and [12] uses a specialized XR-tree to speed up the location of ancestors. Also, the specialized XR-tree index structure only provides partial ancestor information: given an element tag T and an element e, it returns all ancestors of e with the tag T.
Reference [14] presents an improved holistic twig join algorithm over [3] that exploits indices (such as B-trees).
Reference [17] introduced the problem of integrating inverted indices and path indices to answer XML joins.