1. Field of the Invention
The present invention generally relates to the querying of hierarchical documents, for example XML documents queried using the XPath language. More specifically, the invention determines whether pre-computed information stored in structures such as materialized views and indexes can accelerate query processing.
2. Description of the Related Art
XML is becoming increasingly popular for representing and exchanging large amounts of data, particularly with respect to the internet. XML documents are one type of self-describing structured document. Consequently, there is a pressing need to persistently store and efficiently query these documents. To address this need, W3C has proposed an XML query language, XQuery [30]. Simultaneously, ANSI and ISO are defining SQL/XML, a new part of the SQL standard that extends SQL to handle XML data [26]. SQL/XML defines XML as a new SQL data type, and provides a set of functions to create, manipulate, and query XML data.
XPath [29] is a widely accepted W3C standard language for navigating and extracting fragments of XML documents, for both the XQuery and the SQL/XML query languages, and is employed by other XML related technologies such as XSLT and XPointer. The XPath 2.0 data model is based on the notion of sequences, which are ordered collections of zero or more items. An item might be one of the seven types of nodes [31], or a simple value, as defined in the XML Schema [28] specification. XPath 2.0 is designed to be embedded in a host language such as XQuery, XSLT or SQL/XML. It is a reference based language. Hence, subsequent expressions on the results an XPath expression may traverse the document both in reverse and forward direction.
The fundamental constructs of this language are path expressions, which are used to locate nodes in an XML tree (in this application, we only consider XML in the form of trees, and do not address idrefs). An XPath expression consists of a set of steps, and each step has an axis, a test, and probably some predicates, which can include conjunction and disjunction and can be arbitrarily complex. XPath defines a full set of axes for traversing XML trees in forward (child, descendant, attribute, etc) or reverse (parent, ancestor, etc) direction. A node test is either a node kind test, or a name test and must be true for the node to be selected. A predicate can be a conjunction or a disjunction of comparison predicates, nested path expressions, or any other complex expression. The result of a path expression is a sequence of node references.
SQL/XML [10, 26, 25] is a new extension to the SQL standard which proposes a new data type called XML and several functions to create, search, manipulate and extract XML. It supports a new SQL data type with a set of functions that operate on this type. A valid instance of the XML data type can be a well-formed XML document with its prolog, an XML element node, textual content, or it can be a forest of element nodes. SQL/XML defines several functions to create XML data, as well as to publish relational data as XML. It also provides mechanisms to check and enforce DTD or schema compliance.
SQL/XML [25] has also outlined a set of functions to query and manipulate XML data, to extract XML values, and to convert XML values into text. XMLContains is a scalar boolean function which takes an XML value, and a XPath expression as input, and returns true if the result of the XPath expression when executed on the input XML value is a non-empty sequence, and returns false otherwise. Similarly, XMLExtract is also a scalar function that takes the same set of arguments as XMLContains, but returns an XML value which is the result of the input XPath expression when executed on the input XML value. XMLExtract function actually extracts the result and creates a new XML data value. The result of an SQL/XML query is an instance of the SQL data model, which now allows XML as a valid data type.
In addition to XPath's utility for querying XML data, XPath expressions are also used to describe XML indexes. Accessing XML data often requires complicated navigation into the document, resulting in computationally expensive query processing. As a result, optimization of XPath expressions is vital to efficient processing of XML queries. Regular path queries are the main building blocks of XPath expressions. Rewriting and optimization of regular path queries has been studied in [14, 4, 5]. However, this work only considers linear paths, and hence those techniques are not applicable to complex XPath expressions involving nested predicates and branching.
Several recent papers explored indexing XPath expressions over XML data [24, 23, 17, 8, 12, 20, 15]. Much of this work assumed that every node is indexed, and ignored index maintenance costs. Some previous indexing work directly addressed indexing patterns. For example, in [19] the T-index was defined with a non-branching path expression and a matching algorithm was proposed, which was subsequently extended by [18].
In [18], “Containment and Equivalence for an XPath Fragment”, ACM PODS 2002, Jun. 3-6, 2002, Madison Wis., p. 65-76, which is hereby incorporated by reference, authors Gerome Miklau and Dan Suciu note that optimization of XPath expressions can be accomplished using an algorithm for containment. In other words, if it is known from the XPath expression describing a document fragment that the fragment contains the data required to answer a query, that knowledge can be exploited to avoid expensive navigation and query processing. For example, if the XPath expression indicates the document fragment contains P=/a//b, then this fragment can be used to answer the query Q=/a/b, but cannot be used to answer the query Q=/a//d. XPath query containment has also been studied by [9, 21].
Miklau and Suciu found that even for XP{//,*,[ ]}, which is a subset of XPath containing descendant edges, star nodes, and branching, query containment is co-NP complete. They proposed a representation for XPath expressions and a sound but incomplete algorithm which uses tree mappings. They do not distinguish between next steps and predicates, and their representation is unfortunately not able to express disjunction, or comparison predicates, or any axis other than child or descendant.
Neven [21] has shown that adding disjunction does not increase computational complexity of the containment problem, but did not provide any algorithms for deciding the containment. Neven also proved that even with a very simple form of negation, the problem becomes undecidable.
An improved method of exploiting information regarding XPath expression containment is therefore needed to more efficiently query XML documents.