Large collections of XML (eXtensible Markup Language) documents are increasingly prevalent in the enterprise. Information about the structure of specific types of XML documents may be specified in documents referred to as “XML schemas”. For example, the XML schema for a particular type of XML document may specify the names for the data items (tags) contained in that particular type of XML document, the hierarchical relationship between the data items contained in that type of XML document, data types of the data items contained in that particular type of XML document, etc.
XML elements are delimited by a start tag and a corresponding end tag. For example, in the following XML fragment, <Author> is a start tag and </Author> is an end tag to delimit an element.
<book>My book      <publication publisher=”Doubleday”             date=”January”></publication>      <Author>Mark Berry</Author>      <Author>Jane Murray</Author></book>
The data between the element start and end tags is referred to as the element's content. An element's content may include values and other elements. In the case of the Author element, the content of the element is the text data value Mark Berry. In the case of the Book element, the content includes the text data value My Book and the elements publication and two Author elements. A data value may comprise one or more text words. An individual word may be used as a searchable keyword. For example, “Berry” may be a keyword that is searched for independent from searching from keyword “Mark” even though they may be part of the same text data value. An element is herein referred to by its element name. For example, the element delimited by the start and end tags <publication> and </publication> is referred to as publication.
Node Tree Model
An important standard for XML is the XQuery 1.0 and XPath 2.0 Data Model. (see W3C Working Draft 9 Jul. 2004, which is incorporated herein by reference) One aspect of this model is that a XML document is represented by a hierarchy of nodes that reflects the hierarchical nature of the XML document. A hierarchy of nodes is composed of nodes at multiple levels. The nodes at each level are each linked to one or more nodes at a different level. Each node at a level below the top level is a child node of one or more of the parent nodes at the level above. Nodes at the same level are sibling nodes. In a tree hierarchy or node tree, each child node has only one parent node, but a parent node may have multiple child nodes. In a tree hierarchy, a node that has no parent node linked to it is the root node, and a node that has no child nodes linked to it is a leaf node. A tree hierarchy has a single root node.
In a node tree that represents a XML document, a node can correspond to an element. The child nodes of the node correspond to an attribute or another element contained in the element. The node may be associated with a name. For example, the name of the node representing the element book is book. For a node representing the attribute publisher, the name of the node is publisher.
For convenience of expression, elements and other parts of a XML document are referred to as nodes within a tree of nodes that represents the document. Thus, a node representing an element may be referred to by the element name, and a node value may be referred to as the element value. For example, referring to ‘My book’ as the value of the node with the name book is just a convenient way of expressing that the value of the element associated with node book is My book. The name of an element, attribute, or node is also referred to herein as a tag name.
The path for a node in a XML document is the series of nodes, starting from an ascendant node in a XML document to arrive at a particular node further down in the hierarchy. For example, the path from the root of XML document to node publication is represented by ‘/book/publication’.
Path Expressions
XML documents may be searched by using an XML query language such as XQuery/XPath. XML Query Language (“XQuery”) and XML Path Language (“XPath”) are important standards for a querying data in XML documents. The primary syntactic construct in XPath is an expression, which is evaluated to yield an object. XPath expressions are described in Section 3 (“Expressions”) of “XML Path Language (XPath)” (version 1.0). A path is a location of a node within an XML document hierarchy; a path expression is a representation (a way of expressing or specifying) that location. Constructing a path expression may require that the user know the structure of the document. Thus, when the collection of XML documents does not have a schema that expresses their structure, or there is not one common schema to which all XML documents in the collection conform, it can be difficult using XQuery/XPath to formulate a query to find information in those documents.
Order Key
An order key is a compressed representation of a node's hierarchical position and ordering within an XML document. The order key may be represented using a Dewey-type value. The order key of a node may be created by appending a value to the order key of the node's immediate parent, where the appended value indicates the order, among the children of the parent node, of that particular child node. The following description refers to the hierarchy shown in FIG. 1A. The order of peer nodes (having the same parent) is read from left to right. The left-most node is position 1. In the example, node F is a child of a node C that is a child of a node A. F has the order key “1.2.3.” The final “3” in the order key indicates that the node F is the third child of its parent node C. Similarly, the “2” indicates that node C is the second child of node A. The leading 1 indicates that node A is the root node (i.e. has no parent).
Approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.