“Extensible Markup Language” (XML) is a textual notation for a class of data objects called “XML Documents” and partially describes a class of computer programs processing them. A characteristic of XML documents is that they use a hierarchical structure to organize information within the documents. This hierarchical structure may be represented using a rooted-tree data structure with node representing the “elements” of the XML document. Element nodes have a tag name, may be associated with named attributes, and may have relationships to other nodes in the tree, where such relationships may refer to “parent” and “child” nodes. In addition, element nodes may contain data in various forms (specifically text, comments, and special “processing instructions”).
XML Document Trees.
An XML document can be represented as a labeled tree whose nodes represent the structural components of the document—elements, text, attributes, comments, and processing instructions. Element and attribute nodes have labels derived from the corresponding tags in the document and there may be more than one node in the document with the same label. Parent-child edges in the tree represent the inclusion of the child component in its parent element, where the scope of an element is bounded by its start and end tags. The tree corresponding to an XML document is rooted at a virtual element, called the root, which represents the document itself. Hereinafter, XML documents will be discussed in terms of their tree representations. One can define an arbitrary order on the nodes of a tree. One such order might be based on a left-to-right depth-first traversal of the tree, which, for a tree representation of an XML document, corresponds to the document order. The memory footprint of an XML document can be large. XML processors may not be able to handle large documents due to the memory requirement of storing the entire document. As a result, in processing XML, reducing the memory overhead of an XML document is of great importance.
XPath.
“XML Path Language” (XPath) is a query language for creating an expression that selects nodes of data from an XML document. XPath is used to address XML data using path notation to navigate through the hierarchical structure of an XML document. XPath queries allow applications to determine if a given node matches a pattern, including patterns involving its location in the XML document hierarchy.
XPath has been widely accepted in many environments, especially in database environments. Given the importance of XPath as a mechanism for querying and navigating data, it is important that the evaluation of XPath expressions on XML documents be as efficient as possible.
XPath Axes.
Given an order on a tree, we can define a notion of a forward and backward relation on a tree. A relation R is a forward relation if whenever two nodes x and y are related by R, x precedes y in the order on the tree. Similarly, a relation is a backward relation whenever x is related to y, x follows y in the order on the tree. For example, assuming the document order for a tree representation of an XML document, the child and descendant relations are both forward relations, whereas the parent and ancestor relations are both backward relations.
An XPath expression over the tree representation of an XML document is evaluated in terms of a context node. The context node is a node in the tree representation of the document, and is well known to those of skill in the art of XML. If the context node is the root node of the document, the XPath expression is said to be an absolute XPath expression, otherwise, it is known as a relative XPath expression. Starting at a context node, an XPath expression specifies the axis to search and conditions that the results should satisfy. For example, assume that the context node is an element node c in the tree representation of an XML document. The XPath expression “descendant::x” specifies a descendant axis, where searching begins at the context node, and produces a sequence of all element nodes that are descendants of the node c and are labeled “x”. One can combine XPath expressions to form larger XPath expressions. For example, the XPath expression “descendant::x/ancestor::y” specifies that starting from the context node c, find all element nodes that are descendants of c with label x, and for each such node, find all ancestor nodes with label y.
XML Processing.
In traditional XML processing, a tree representation of an XML document that is to be processed is built in memory. When the document is large, this construction of the tree representation, for example, as an instance of the familiar Document Object Model (DOM), may be prohibitively expensive in both time and memory. For large documents, XML processing may fail due to the large memory requirements of the document. In main-memory XML processors, one of the primary sources of overhead is the cost of constructing and manipulating main-memory representations of XML documents.
The cost of construction of an in-memory data model instance of an XML document can be reduced significantly if only those portions that are relevant to the processing are instantiated. This insight is the basis for projection, an optimization introduced by Marian and Simeon. See Marian and Simeon, Projecting XML Documents, Proceedings of the 29th VLDB Conference, Berlin, Germany (2003). Given a set of XPath expressions and an XML document, a projected document is constructed such that the result of the execution of the set of XPath expressions on the projected document is the same as that of the execution of the set on the original document. For example, FIG. 1 depicts the tree representation of an XML document; the boxes in the figure with thick borders denote its projection with respect to the XPath expression “//Title”, which selects for the projection all elements with the tag “title” in the document, in this case, elements 140 and 180. The projected document is usually substantially smaller than the original document. As a result, the in-memory construction time is lower than it might be otherwise. Moreover, as a side-effect, the smaller size of the projected document results in lower query evaluation times on the projected document than a similar evaluation on the original document. The root node of the tree is root 100. A “catalog” 110 is the next node. One branch begins at node 120 and another at node 160. The “book” nodes 130 and 170 begin another pair of sub-branches: a branch under “book” 130 includes “title” 140 and “compilers” 150 and the branch under “title” 180 includes the “algorithms” node 190.
The drawback to current techniques of projection is that they cannot handle complex XPath expressions. Current techniques are only defined for queries using child and descendant axes—other XPath axes such as parent and ancestor are not supported by the current schemes. Therefore, a need exists to overcome the problems with the prior art as discussed above, and particularly, for a way to project XML documents efficiently when XPath expressions contain axes other than child and descendant.