In recent years, database systems that allow storage and querying of eXtensible Markup Language data (“XML data”) have been developed. Though there are many evolving standards for querying XML, most of them usually include some variation of XPath. XPath is a language that describes a way to locate and process items in XML documents by using an addressing syntax based on a path through the document's logical structure or hierarchy. The portion of an XML document identified by an XPath “path expression” is the portion that resides, within the structure of the XML document, at the end of any path that matches the path expression.
A query that uses a path expression to identify one or more specific pieces of XML data is referred to herein as a path-based query. The process of determining which XML data corresponds to the path designated in a path-based query is referred to as “evaluating” the path expression.
Unfortunately, even database systems that have built-in support for storing XML data are usually not optimized for handling path-based queries, and the query performance of the databases systems leaves much to be desired. In specific cases where an XML schema definition may be available, the structure and data types used in XML instance documents may be used to optimize path-based queries. However, in cases where an XML schema definition is not available, and the documents to be searched do not conform to any schema, there are no efficient techniques for path-based querying.
Some database systems may use ad-hoc mechanisms to satisfy path-based queries that are run against documents where the schema of the documents is not known. For example, a database system may satisfy a path-based query by performing a full scan of all stored XML documents. While a full scan of all documents can be used to satisfy all path-based queries, the implementation would be very slow due to the lack of indexes.
Another way to satisfy path-based queries involves the use of text keywords. Specifically, many database systems support text indexes, and these could be used to satisfy certain path expressions. However, this technique can only satisfy a small subset of path-based queries, and in particular, cannot satisfy path-based querying.
Consequently, XML indexes that can be used to quickly evaluate a path-based query have been developed. An example of such an XML index is described in U.S. patent application Ser. No. 10/884,311, entitled “INDEX FOR ACCESSING XML DATA”, filed by Sivasankaran Chandrasekar et al., on Jul. 2, 2004, the entire contents of which is hereby incorporated by reference for all purposes as if fully set forth herein. XML indexes enable XML data to be searched without a sequential scan of all of the XML data. XML indexes are typically built using all available XML data in the database.
However, XML documents that are being stored in a database may include several paths that will never be used in a path-based query. For example, document-oriented XML data may include formatting elements that will typically not be used in path-based queries. Therefore, any XML path-based index that indexes all paths in XML documents stored in a database will needlessly include data that will not be used. As more paths are indexed, and the index grows, execution of queries that use such an index are likely to become slower.
U.S. patent application Ser. Nos. 11/059,665 and 11/401,613 describe how only those path expressions that are more likely to be the subject of a path-based query are indexed. Such indexes are referred to herein as path-subsetted indexes. A path-subsetted index thus indexes a proper (or strict) subset of the XML nodes in a document. Path-subsetted indexes are defined in at least two ways. In the case of an INCLUDE path-subsetted XML index, the subset of XML nodes to be indexed is specified using a set of one or more path expressions. All XML nodes that fall within the sub-tree rooted at any node matching one of the path expressions in the set are indexed.
An EXCLUDE path-subsetted XML index is defined in a similar fashion. An EXCLUDE path-subsetted XML index is defined by specifying a set of one or more path expressions. The index does not index any XML node that is within the sub-tree rooted at any node matching any of the path expressions in the set.
Hereinafter, path expressions that correspond to indexed nodes are referred to as “subsetted paths.”
An XML index is typically used for node identification as well as fragment extraction. Node identification is the process of identifying nodes matching a certain criteria (e.g., nodes whose corresponding path expression is equal to a subsetted path). Fragment extraction is the process of constructing document fragments. Because fragment extraction requires namespace patching, an INCLUDE path-subsetted XML index may also index all nodes in a subsetted path from the document root to any indexed XML node.
A path expression in a query may not be “satisfiable” by a path-subsetted XML index. A path expression is “satisfiable” by a path-subsetted XML index if all XML nodes that match the path expression are indexed in the path-subsetted XML index.
With a path-subsetted index, new documents that are being added to a database may be quickly and efficiently parsed such that only path expressions that match a “path subsetting” rule are added to the index. In addition, an incoming path-based query may be quickly and efficiently examined to determine whether the specified path could be satisfied by an index before attempting to evaluate the path expression using the index.
However, database systems typically normalize received path-based queries before the database server determines whether an index may be used to process the path-based queries. Under a typical normalization phase, a complex path expression is decomposed into multiple “mini” path expressions. The database system then determines whether each of the “mini” path expressions is “satisfiable” by an index. If any of the “mini” path expressions are not satisfiable by an index, then an index is not used to retrieve data that satisfies any of the “mini” path expressions.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.