The number of businesses exchanging information electronically is proliferating. Businesses that exchange information have recognized the need for a common standard for representing data. Extensible Markup Language (“XML”) is rapidly becoming such a common standard.
XML describes and provides structure to a body of data, such as a file or data packet. The XML standard provides for tags that delimit a body of data into sections referred to as XML elements. Each XML element may contain one or more name-value pairs referred to as attributes.
Referring to FIG. 1, it shows XML document 101, provided to illustrate XML and XML elements. XML elements are delimited by a start tag and a corresponding end tag. For example, XML document 101 contains the start tag <A> (corresponding to <A id=“0”> and the end tag </A> to delimit an element, and </D> and </D> to delimit another element. The data between the elements is referred to as the element's content.
An element is herein referred to by its start tag. For example, the element delimited by the start and end tags <A> and </A> is referred to as the A element.
Element content may contain various types of data, which may include attributes, other elements, and text data. Attributes of an element are represented by attribute name-value pairs. An attribute name-value pair specifies the attribute's name and value. For example, A contains the attribute name-value pair id=‘0’, specifying an attribute name of id and an attribute value of the string literal ‘0’.
The elements A and C contain one or more elements. Specifically, A contains elements C and D, and C contains element D. An element that is contained by another element is referred to as a descendant of that element. Thus, C and D are descendants of A.
An XML document, such as XML document 101, is an example of an information hierarchy. An information hierarchy is a body of data items that are hierarchically related. In an XML document, the hierarchically related data items include elements and element attributes. By defining an element that contains attributes and descendant elements, an XML document defines a hierarchical tree relationship between the element, its descendant elements, and its attribute.
Because an XML document is an information hierarchy, each element contained therein may be located by following a “path” through the hierarchy to the item. Within an XML document, the path to an element begins at the root of the tree and down the hierarchy of elements to eventually arrive at the element of interest. For example, the path to D consists of elements A and C, in that order.
A convenient way to identify and locate a specific item of information stored in an information hierarchy is through the use of a “pathname”. A pathname is a concise way of uniquely identifying an item based on the path through the hierarchy to the item. A pathname is composed of a sequence of names. In the context of an XML document, the names in a pathname are elements or element attributes. For example, ‘/A/C/D’ identifies element D.
XML Storage Mechanisms
Various types of hierarchical storage mechanisms are used to store XML documents. One type stores an XML document as a text file in a file system.
Other types of hierarchical storage mechanisms store the parts of an XML document in a relational or object-relational database system. For example, an entire XML document may be stored in a blob (binary large object), or the parts of an XML document may be stored in different rows in one or more relational tables, each row containing one or more parts of an XML document. An XML document may also be stored as a hierarchy of objects in an object-relational database; each object is an instance of an object class and stores one or more elements of an XML document. The object class defines, for example, the structure of an element, and includes references or pointers to objects representing the immediate descendants of the element.
Storing XML documents in a database system has many advantages. Database systems are well suited for storing large amounts of information. Queries may be used to retrieve data that matches complex search criteria. The data may be easily and efficiently retrieved from a relational database system. However, database systems are not configured to retrieve efficiently, if at all, data for queries that request data identified by the data's location within an information hierarchy.
One way for a query to identify the requested data's location within an information hierarchy is through the use of a string that conforms to the standard prescribed in the document XML Path Language (XPATH), version 1.0 (W3C Recommendation 16 November 1999). The XPATH standard defines a syntax and semantic for addressing parts of a document. For example, the query “/A/B” requests the subtree descending from a descendant of A with the element name B. The query “/A/B/@id” requests the attribute id of a descendant of A with the element name B.
A query that requests data based on a position within a hierarchy is referred to herein as a hierarchical query. A hierarchical query that uses a string that conforms to XPATH to identify the location within a hierarchy of the requested data is referred to herein as an XPATH query. The process of retrieving the data requested by a hierarchical query is referred to herein as hierarchical retrieval.
One approach to hierarchical retrieval is to retrieve all the rows that store part of a XML document, construct an in-memory representation of the complete XML document, and then search and traverse the tree to get the requested data. XML documents can be quite large. The processing required to build an in-memory representation of a large XML document can be expensive and the amount of memory needed to store the in-memory representation can easily exceed available memory resources on a computer.
When all XML documents are stored as a set of objects in an object-relational database system, another approach can be used for hierarchical retrieval. The objects used to store the XML document can be traversed by following the references or pointers defining the hierarchical relationship between the objects. The advantage of this approach is that not all the objects used to represent an XML document need to be loaded into memory; only objects that are traversed need be loaded.
A disadvantage of this approach stems from the fact that object-relational database systems are limited in the number of object classes they can effectively handle. Representing an XML document with objects requires defining object classes for each type of element in the XML document. An object-relational database might have to be configured to define many object classes for many XML documents and store very many objects. Generally, an object-relational database system can efficiently and effectively handle only up to a threshold number of object types, a threshold that can be easily exceeded when using an object-relational database system to store XML documents.
Yet another approach is to store the data that defines each parent-child relationship in the hierarchy of a XML document, and use the data to determine which data to return for an XPATH query. For example, a table stores a XML document, each row of the table storing the content of an element. The table includes a column called parent, which stores a primary key identifier identifying the row representing the parent of an element. To retrieve data specified by an XPATH query, a query that conforms to the Structure Query Language (“SQL”) can be formulated using, for example, a connect-by clause to identify the requested data.
The connect-by clause allows a user to issue SQL queries that request data based on the data's location within an information hierarchy. The data is returned by a relational database system in a way that reflects the hierarchical organization. The connect-by clause is used to specify the condition that defines the hierarchical relationship, which in the current example, is the hierarchical relationship defined by parent and the primary key identifier. The disadvantage of the approach is that it requires many join operations, especially when the query requests data based on a hierarchical location that includes many levels.
Based on the foregoing, it is clearly desirable to devise an approach for organizing and storing XML data, or any form of hierarchical data, that allows portions of an information hierarchy to be accessed more efficiently.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.