Many database systems now support storage and querying of eXtensible Markup Language data (“XML data”). XML data can take the form of a collection of XML documents. The collection of XML documents may be stored in an aggregate form in a database system. In the aggregate form, the XML documents can be stored as LOBs (“Large Binary Objects”) in such a base structure as a LOB table. An example of one such XML document may be as follows:
po.xml<purchaseOrder>  <reference>GHB3567</reference>  <shipAddrCity>Oakland</shipAddrCity>  <billAddrCity>Berkeley</billAddrCity>  <lineItem itemNo = 34>   <itemName>CPU</itemName>   <itemQuantity>3</itemQuantity>   <itemUnitPrice>123.45</itemUnitPrice>  </lineItem> <lineItem itemNo = 68>   <itemName>Memory</itemName>   <itemQuantity>3</itemQuantity>   <itemUnitPrice>23.45</itemUnitPrice>  </lineItem></purchaseOrder>
Once the collection of the XML documents is stored in the database system, queries can be run against the database to retrieve XML documents or fragments thereof. SQL language has been expanded to support XQuery type of queries against XML data in a database system. For example, given a database system that stores the collection of XML documents including po.xml., a query such as Q1 below can be submitted to the database system:
Q1Select extract(value(v), “/purchaseOrder/lineItem”)from PurchaseOrder vwhere existsnode(value(v), “//lineItem[ItemQuantity > 40 and ItemUnitPrice > 39.95]”) =1
Assming a LOB table that stores the collection of XML documents is named PurchaseOrder, this query Q1 finds XML documents (a subset of the collection of XML documents) stored in the PurchaseOrder table each of which has at least one lineItem with unit price more than 39.95 and quantity more than 40. This query returns all lineItem fragments (even though only one satisfies the predicate expressions in the “WHERE” clause) out of each of the XML documents found.
Given Q1, the database system can prepare a first query plan in which each record (that either contains or maps to a XML document) of the PurchaseOrder is retrieved and subsequently searched based on XQuery expressions associated with the query Q1. Since the XQuery related search is directly performed on the XML documents retrieved from the database system, the database system provides little value-added services other than storage and document retrievals. Furthermore, since every XML document must be retrieved from the database system, committed to memory and processed to evaluate XQuery expressions, this first query plan of Q1 would be quite slow.
As an alternative, the collection of XML documents can be stored in a shredded form in the database system. In this form, base structures in the database system can be defined in such a way as to capture hierarchical relationships among nodes in a XML document. Accordingly, when an XML document is submitted to the database system for storage, it is shredded into node values. The node values shredded are then stored in their respective base structures (e.g., a column in a base table).
When the collection of XML documents is stored in the shredded form, or object relationally, Q1 can be efficiently executed to find out all the qualified XML documents. For example, a (nested) table lineItem may be setup to capture all the sub-nodes under the node “lineItem” contained in the XML documents. Furthermore, the PurchaseOrder table in the shredded form may store a foreign key column “setid”, which corresponds to a primary key column “nested_table_id” in the (nested) lineItem table. The “nested_table_id” column contains a unique identifier associated with each set of lineItem sub-nodes contained in the collection of XML documents.
One can create B-tree numeric index(es) on ItemQuantity and ItemUnitPrice columns of the lineItem table. The database system can prepare a second query plan that take advantage of the index created, as represented by a rewritten query Q2 as follows:
Q2Select (select xmlagg(xmlelement(“lineItem”,xmlattributes(itemno as “itemNo”), xmlforest(itemName,itemQuantity, itemUnitPrice))) From lineItem li2 Where li2.nested_table_id = v.setid)from PurchaseOrder v, lineItem liwhere exists( select 1 from li where li.nested_table_id = v.setid and  li.ItemQuantity > 40 and li.ItemUnitPrice >  39.95);
This query accesses the lineItem table to find all lineItem sets stored that match the value predicate expressions with respect to ItemQuantity and ItemUnitPrice. Thus, selection of the XML documents that match the predicate expressions does not have to be performed directly on the XML documents, but rather performed on a much smaller set of data that stores in the lineItem table. Furthermore, because there is a value index on ItemQuantity and ItemUnitPrice columns of the lineItemTable, the table access on the lineItemTable does not require a full table scan. Compared with the first query plan, this second query plan would be much faster.
A disadvantage of this shredded storage approach is that it requires a priori (e.g., prior to compilation of a query) knowledge about a XML schema that describes the collection of XML documents. If, however, the XML schema did not exist or were unknown, the database system would not be able to define properly a set of base structures that would reflect all data types and structural relationships the collection of XML documents could embody.
Even if the schema did exist and were known, still not all the data types or structural relationships in the collection of XML documents would be useful in many queries. For example, in cases where a user is only interested in a limited number of nodes in a collection of XML documents, a corresponding XML schema probably defines too many extraneous data types and too many complex, extraneous structural relationships about which very few queries care. Shredding a collection of XML documents and storing resultant node values entails wasted efforts with little advantage in return.
XML Path Indexes can speed up XQuery types of queries that contain XPath expressions. Under this approach, instead of accessing the collection directly, a query accesses an XML Path Index that is associated with a collection of XML documents. An XML Path Index is a table, separate from the base structures storing the collection of XML documents. The XML Path Index logically indexes the collection of XML documents. In the present example, where the collection of XML documents includes po.xml, such a table may comprise a plurality of columns as indicated in Table 1.
TABLE 1XML Path Index Table (PO_PATH_IDX_TAB) for PurchaseOrderDocIdPathValueOrderedKeyXmlLocatorDid1/PurchaseOrder1Ptr1Did1/PurchaseOrder/referenceGHB35671.1Ptr2Did1/PurchaseOrder/ShipAddrCityOakland1.2Ptr3Did1/PurchaseOrder/BillAddrCityBerkeley1.3Ptr4Did1/PurchaseOrder/lineItem1.4Ptr5Did1/PurchaseOrder/lineItem/@itemNo341.4.1Ptr6Did1/PurchaseOrder/lineItem/itemNameCPU1.4.2Ptr7Did1/PurchaseOrder/lineItem/itemQuantity31.4.3Ptr8Did1/PurchaseOrder/lineItem/itemUnitPrice123.451.4.4Ptr9Did1/PurchaseOrder/lineItem1.5Ptr10Did1/PurchaseOrder/lineItem/@itemNo681.5.1Ptr11Did1/PurchaseOrder/lineItem/itemNameMemory1.5.2Ptr12Did1/PurchaseOrder/lineItem/itemQuantity31.5.3Ptr13Did1/PurchaseOrder/lineItem/itemUnitPrice23.451.5.4Ptr14Did2. . .. . .. . .. . .
Each row of the XML Path Index table represents a node contained in a XML document that is associated with a document identifier (DocId), e.g., “Did1”. A complete set of rows in the XML Path Index table having a particular value of the Doc ID represents a logical hierarchical representation of nodes contained in a XML document that is associated with the particular value of the Doc ID. Specifically, the DocId column as shown stores a unique identifier for each of the XML documents in the collection. The Path column stores a path identifier for a node in a XML document. The Value column stores a value for the node if any. This column may contain a null value if the node does not have a value. The OrderedKey column stores hierarchical order information as represented using a Dewey-type value. Specifically, the OrderedKey of a node can be created by appending a value to the OrderedKey of the node's immediate parent, where the appended value indicates the position, among the children of the parent node, of that particular child node. The XmlLocator column stores logical pointers each of which points to fragments in the associated XML document which fragments correspond to the node.
Given the XML Path Index table, the database system can prepare a third query plan that takes advantage of the XML Path Index table created, as represented by a rewritten query Q3 as follows:
Q3select (select XMLAgg(MKINI(P3.XmlLocator)) from PO_PATH_INDX_TAB p3 where p3.DocId = v.DocId  and p3.path = “/PurchaserOrder/lineItem” order by p3.orderedkey)from purchaseOrder vwhereexists( select 1 from PO_PATH_INDX_TAB p1, PO_PATH_INDX_TAB p2 where p1.path =“/PurchaseOrder/lineItem/itemQuantity”  and to_number(p1.value) > 40 and p2.path =  ““/PurchaseOrder/lineItem/itemUnitPrice” and  to_number(p2.value) > 23.45 and  parent_key(p1.OrderedKey)=  parent_key(p2.OrderedKey) and p1.DocId =  p2.DocId and v.DocId = p1.DocId);
Here MKINI( ) in the query Q3 is a function or macro that maps a logical pointer of a row stored in the XmlLocator column to a location in a XML document that contains a node associated with the row.
Using the XML Path Index table to query the desired XML data, the database system avoids a need to retrieve every XML document in the collection of XML documents, or perform ad-hoc XPath-based queries directly on contents of the retrieved XML documents. Thus, many if not all XPath-based queries associated with the original query Q1 can be answered by looking at values contained in the XML Path Index table, instead of directly accessing the base structures that stores XML documents.
However, this approach (using XML Path Index) has its own disadvantages. As shown in the XML Path Index table (Table 1), since each node value (such as @itemNo, itemName, itemQuantity, itemUnitPrice, etc.) is stored as a row in the XML Path Index table PO_PATH_INDX_TAB, Q3 has to do a self-join. In fact, the number of self-joins is proportional to the number of properties under lineItem that are specified in the predicate expressions. Furthermore, even if there is a B-tree numeric value index on the value column of PO_PATH_INDX_TAB, irrelevant data in the XML Path Index table has to be read to determine whether a particular row is of a node of a right type (such as itemQuantity and itemPrice in this example) specified in the predicate expressions before actually evaluating the predicate expressions.
Because of these limitations, the existing techniques are not as efficient in accessing XML documents in a database system as would be desired. As a result, a better mechanism that would improve accessing XML documents in a database system is needed.