Structured documents have nested structures, i.e., structures that define hierarchical relationships between elements of a document. Documents written in Extensible Markup Language (XML) are structured documents. Typically, a structured document can be represented by a data model comprising a plurality of hierarchical nodes. The term “node” is used in the Direct Object Model (DOM)-sense, which is a standard XML construct well known to those skilled in the art. In that construct, each node corresponds to an element of the XML document. Each node of the XML document can be described by a path that defines the hierarchical relationship between the node and its parent node(s). Every path begins at a root node corresponding to a root element and follows the hierarchical structure defined by the XML document. Throughout this description, the term “node” is used interchangeably with the term “element.”
XML supports user-defined tags for customized descriptions of nested document structures and associated semantics. Accordingly, XML allows a user to design a customized markup language for many classes of structured documents. For example, a business can easily model a complex structure of a document, such as a purchase order, in an XML document and send the document for further processing to its business partners. This ability to define custom tags provides tremendous flexibility to users designing their documents.
As more and more business applications create and use structured documents, the challenge is to store, search, and retrieve these documents. Database management systems (DBMS) are available that are configured to receive and store structured documents in their native format. For example, EMC Documentum xDB, developed by EMC Corporation, is a high-performance and scalable native XML DBMS that can store and manage structured documents in their native format, i.e., as a nested data model. Moreover, the XML DBMS can allow database structures to be easily modified to adapt to changing information requirements.
In addition to receiving and storing structured documents, the XML DBMS also is configured to process a search query and to retrieve document(s) satisfying the query. To facilitate efficient searching, data in the structured documents is usually indexed and stored in an index. A typical index for an XML DBMS is based on a path-value model that includes a single specified XML path and an attribute key. For example, a path-value index can be defined by a single path and a sequence of keys that can be elements or attributes, and sub-paths to specific elements. Each path-value index for every key and sub-path to a key must be explicitly defined down to the element or key. Moreover, composite path-value indexes, i.e., varying combinations of single indexes, must also be explicitly defined.
Typically, each path-value index is represented as a separate b-tree index with separate keys stored in the index along with separate node pointers stored at the leaf level. A node pointer points to a document that includes the defined path and value. When the path-value index is a composite path-value index, the order in which the path-value indexes are listed affects the manner in which the values are stored in the b-tree index.
Path-value indexes present several challenges that can burden the DBMS and a database administrator. Presently, the data administrator or developer must examine characteristics of the data in a library, manually create path-value indexes, test queries, and create additional path-value indexes as needed. Path-value indexes are inflexible because many have to be defined to service a range of queries that include different combinations of keys, i.e., elements and/or values. In addition, as the number of indexes increases, system overhead increases. Moreover, for a path-value index to be used, all of its keys must be explicitly defined in the index. Because all elements must be listed in the definition, new elements introduced by a user, e.g., when the user defines a custom tag, cannot be indexed unless the database administrator defines a new path-value index corresponding to the new element.