In recent years, database systems that allow storage and querying of extensible Markup Language data (“XML data”) have been developed. Though there are many evolving standards for querying XML, all of them include some variation of XPath. XPath is a language that describes a way to locate and process items in XML documents by using an addressing syntax based on a path through the document's logical structure or hierarchy. The portion of an XML document identified by an XPath “path expression” is the portion that resides, within the structure of the XML document, at the end of any path that matches the path expression.
A query that uses a path expression to identify one or more specific pieces of XML data is referred to herein as a path-based query. The process of determining which XML data corresponds to the path designated in a path-based query is referred to as “evaluating” the path expression.
Unfortunately, even database systems that have built-in support for storing XML data are usually not optimized for handle path-based queries, and the query performance of the databases systems leaves much to be desired. In specific cases where an XML schema definition may be available, the structure and data types used in XML instance documents may be used to optimize XPath queries. However, in cases where an XML schema definition is not available, and the documents to be searched do not conform to any schema, there are no efficient techniques for path-based querying.
Some database systems may use ad-hoc mechanisms to satisfy XPath queries that are run against documents where the schema of the documents is not known. For example, a database system may satisfy an XPath query by performing a full scan of all stored XML documents. While a full scan of all documents can be used to satisfy all XPath queries, the implementation would be very slow due to the lack of indexes.
Another way to satisfy XPath queries involves the use of text keywords. Specifically, many database systems support text indexes, and these could be used to satisfy certain XPaths. However, this technique can only satisfy a small subset of XPath queries, and in particular, cannot satisfy path-based querying.
None of these mechanisms fulfill the need for a quick and efficient process of evaluating path-based queries of XML documents. An XML index that can be used to quickly evaluate a path-based query is needed.
Database indexes enable data to be searched without a sequential scan of all of the data. Indexes are typically built using all available data in the database. However, XML documents that are being stored in a database may include several paths that will never be used in a path-based query. For example, document-oriented XML data may include formatting elements that will typically not be used in path-based queries. Therefore, any XML path-based index that indexes all paths in XML documents stored in a database will needlessly include data that will not be used. As more paths are indexed, and the index grows, execution of queries that use such an index are likely to become slower.
It would be beneficial to be able to selectively index only those paths that are more likely to be the subject of a path-based query when building a path-based XML index. In particular, there is a need to quickly and efficiently parse new documents that are being added to the database such that only paths that match a “path subsetting” rule are added to the index. In addition, there is a need to quickly and efficiently check to see if an incoming path-based query could be satisfied by an index before attempting to evaluate the path expression using the index.
Based on the foregoing, there is a clear need for a system and method for managing an XML index by specifying paths for inclusion in the index, as well as a system and method for determining whether a given path expression is a path that is indexed by the index.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.