Relational Database Management Systems (“DBMS”) typically support a wide range of data types. For example, a DBMS allows users to store and query scalar data type values such as integers, numbers, and strings. Some DBMSs have the added ability to support more complex data types, for instance, the ability to support Extended Markup Language (“XML”) documents and other XML type data. Those DBMSs that include XML support allow users to define tables, or columns in a table, as an XML type (e.g. as XMLType). This added support facilitates the inclusion of more sophisticated data in DBMSs. However, with the addition of complex data types, new techniques had to be created to handle storage and access issues.
I. XML Type Storage Techniques
XML data does not naturally lend itself to conventional physical storage models in a DBMS. Thus, a variety of storage techniques can be used to manage the storage of such data. For example, two models for storing XML type data in a database include storing the data object-relationally or storing the data in aggregate form.
A. Object-Relational Storage
Storing XML type data object-relationally involves defining a complex database representation, defined in terms of data types, handled by the database system, referred to herein as database types, to represent XML documents. Database types include, for example, native database types, such as integer and VARCHAR (“variable length character string”), or object types defined for a database system using a DDL statements (data definition language statements).
Within a database representation of an XML document, various database objects are defined to represent and store elements and various components of an XML document. For example, each element of an XML document may be represented by a column in a table. A given XML document is stored in a row of the table. A particular element value of the XML document is stored in the row's column that represents the element.
Another example of a database representation is an object class defined by an object relational database system. Each object of the object is an instance of an object database system. An object that is an instance of the object class defines, for example, the structure corresponding to an element, and includes references or pointers to objects representing the immediate descendants of the element.
The underlying structures that comprise a database representation (e.g. table, object table, columns, object columns) are referred to as base database objects or structures. When an XML document is submitted to the DBMS for storage, it is shredded into element values, which are stored in corresponding components of the base structures. For example, an XML-based resume document would be shredded into various elements such as name, email, telephone number, work experience, and other elements. After the document is shredded, a new row is created in a resume table and each value from the shredded resume is placed in the column corresponding to the value's element.
In some instances, there may be multiple values for a particular type of element in a document. For example, a resume document has as an element “working history.” Under this particular element type, the applicant has his previous jobs. In other words, the working history element consists of a list of jobs. In such a case, a separate table may be created for that particular type of element and a reference to the separate table is added to the resume table.
To further illustrate, consider, for example, the purchaseOrder document 100 in FIG. 1. The purchaseOrder document 100 is an example of an XML document that may be submitted to a database server. The purchaseOrder document 100 is an XML document that defines a purchase order with the following elements: Reference (reference number), ShipAddrCity (shipping address city), BillAddrCity (billing address city), and LineItems. A LineItem element has additional sub-elements of ItemName, ItemQuantity (e.g., 3 CPUs and 3 Memory chips), and ItemUnitPrice.
Assuming that each element in the purchaseOrder has a corresponding column in the table schema, the document shown in FIG. 1 may be shredded and placed in an object-relational table. The scalar values from the document are placed in the columns based on their element type. Table 2 illustrates an example of an object-relational table for purchaseOrder. Note that the table for purchaseOrder is an XML type table.
TABLE 1Parent table for Purchase OrderPurchaseOrderReferenceShipAddrCityBillAddrCityLineItemRid1GHB3567OaklandBerkleyLineItem TablePtr (e.g. LineItemTable at Rid1 andRid)
In Table 2, each value from the purchaseOrder document 100 is placed in a column based on its element type. For example, reference number “GHB3567” is placed in the reference column, shipping address city “Oakland” is placed in the ShipAddrCity column, etc. At a point during the shredding, the DBMS detects that multiple values are listed for lineItems. Accordingly, a second table is used to store the lineItems. A reference is added to the purchaseOrder table, which points the secondary lineItem table. Table 3 illustrates an example of the lineItem object-relational table for purchaseOrder document 100.
TABLE 2Child table for Purchase OrderLineItemItemNoItemNameItemQuantityItemUnitPriceRid134CPU3123.45Rid268Memory323.45
Unfortunately, many XML documents and other forms of complex data do not conform to any pre-defined schema. For instance, resumes typically include some common features such name, address, and telephone number, but they often also include features that are not necessarily included in every resume. Those less common features may include, for example, items such as hobbies, publications, internship data, research projects, and references. In an object-relational model, when the DBMS attempts to import an XML document that includes data that does not conform to the defined schema, the document (or at least important data in the document) may be ignored, deleted, dumped, etc. To address this problem, developers may attempt to create a column for every different type of element or feature that may possibly be submitted to the DBMS. This strategy, however, is not very practical, because even with their best efforts, database developers may not account for every potential variation on an element name or value type. Moreover, the more columns defined in the table, the more space each new row in the table consumes. When a new row is added, the DBMS typically allocates space for an entire row, whether data is placed in each column or not. This has the potential to waste a lot of space.
As an alternative, aggregate storage techniques, such as CLOBs and BLOBs, may be used to store XML type data.
B. Aggregate Storage (LOB-Based Storage)
In aggregate storage, a complex database representation is not used to represent a XML document. XML documents are stored in aggregate form in a LOB (“large binary object”), for example, CLOBs (Character Large Objects) and BLOBs (Binary Large Objects). A base structure for storing a collection of XML documents stored in aggregate storage form (“aggregate form”) may be, for example, a collection of XML documents stored in a LOB table.
Aggregate storage is useful for storing complex data because it can be used to store data regardless of format and/or schema. For instance, when adding an XML resume to a LOB-based table, the resume is stored as one large chunk of data without performing any parsing or shredding of the content. Tables using LOB-based storage to store complex data typically contain very few actual values extracted from the complex data. Instead, the tables consist of references to the aggregate storage locations. In the resume example, when a new resume is added to LOB-based storage, a row is added to a resume table and includes a reference to point to the aggregate storage location where the new resume was stored.
These aggregate storage techniques avoid the overhead associated with analyzing, parsing, and shredding documents. Moreover, they help save storage space, since LOB-based storage makes better use of the underlying storage. In this way, they behave differently than object-relationally stored data.
The main drawback to aggregate storage techniques is that queries on the data are much more inefficient, time consuming, and resource intensive than object-relational techniques. Part of the reason for that is that the mechanisms that enhance access to object-relationally stored data are unavailable for XML type data stored in aggregate form. Other forms of storage, such as serialized tree storage, object storage, have the same problem.
II. Queries on Complex Data
Once complex data, such as an XML document, has been stored in a DBMS, the stored data may be queried. For example, given the object-relational database illustrated in Tables 1 and 2, a user may issue the following XMLTable SQL query (Query 1) to find out what the customer has ordered in purchaseOrder 100.
Query Example 1select v.ItemNo, v.ItemName, v.ItemUnitPricefrom purchaseOrder, XMLTable(‘//lineItem’ passing object_valueColumnsItemNo number path ‘@itemNo’,ItemName varchar2(40) path ‘itemName’,ItemQuantity number path ‘itemQuantity’,ItemUnitPrice number path ‘itemUnitPrice’) vwhere v.itemQuantity = 3 and (v.itemName like ‘C%’ or v.itemNamelike ‘M%’);
The query is designed to return the ItemNo, ItemName, and ItemUnitPrice of any row that meets the conditions set out in the last line of the query (e.g., that has an itemQuantity equal to 3 and that starts with “C” or “M”). Query Example 1 returns the results table illustrated in Table 3:
TABLE 3Results of QueryItemNoItemNameItemUnitPrice34CPU123.4568Memory23.45
Query Example 1 is executed efficiently because each of the values for ItemNo, ItemName, and ItemUnitPrice is stored in its own separate column in the purchaseOrder table. Performing the query involves the DBMS walking through each specified column to find the requested results. It should be noted that since there is a nested storage table for lineItems in Table 1, the DBMS follows the listed links to extract the values for the itemNo, itemName, and itemUnitPrice.
The XML purchaseOrder document 100 is stored object-relationally, so the entire realm of relational indexing technology and query optimization becomes directly applicable for XML operations. For example, relational-style B-Tree and bitmap indexes can be created on the relational columns.
Now consider what happens when XML is stored in aggregate form. If a similar query was executed on a table referencing aggregate storage, the DBMS would access the purchaseOrder table, find a reference to the aggregate storage location where purchaseOrder 100 is stored, access the aggregate storage location, and analyze the entirety of the stored data to see if the purchaseOrder has an ItemNo, ItemName, and ItemUnitPrice that meet the specified parameters of the query. For a large collection of XML documents stored this way, this process is very inefficient. The query analyzes the whole document rather than just the relevant pieces of information.
In some instances, XML indexes have been developed to improve query performance on data stored on aggregate form or other forms such as tree storage. These indexes use some variations of name-value pair storage model.
Briefly, a typical name-value pair index is organized as a physical path table containing rows corresponding to elements in the indexed documents. Each row of the path table consists of a document identifier, an order key that represents the hierarchical position of the node within the document, a path identifier corresponding to the named path from the root to the element, along with the element value. Table 4 below illustrates a typical index path table for purchaseOrder 100.
TABLE 4Path Table for PurchaseOrderElementDoc IDPath IdentifierValueOrder KeyRid1/PurchaseOrder/ShipAddrCityOakland1.2Ptr toLOBfragmentRid1/PurchaseOrder/BillAddrCityBerkeley1.3Ptr toLOBfragmentRid1/PurchaseOrder/lineItem1.4Ptr toLOBfragmentRid1/PurchaseOrder/lineItem/@itemNo341.4.1Ptr toLOBfragmentRid1/PurchaseOrder/lineItem/itemNameCPU1.4.2Ptr toLOBfragmentRid1/PurchaseOrder/lineItem/itemQuantity31.4.3Ptr toLOBfragmentRid1/PurchaseOrder/lineItem/itemUnitPrice123.451.4.4Ptr toLOBfragmentRid1/PurchaseOrder/lineItem1.5Ptr toLOBfragmentRid1/PurchaseOrder/lineItem/@itemNo681.5.1Ptr toLOBfragmentRid1/PurchaseOrder/lineItem/itemNameMemory1.5.2Ptr toLOBfragmentRid1/PurchaseOrder/lineItem/itemQuantity31.5.3Ptr toLOBfragmentRid1/PurchaseOrder/lineItem/itemUnitPrice23.451.5.4Ptr toLOBfragmentRid2. . .. . .. . .. . .
As seen in the above layout of the path table, all of the elements of stored XML documents (e.g., the scalar values from XML documents, such as itemNo, itemName, itemQuantity, and itemUnitPrice) are stored horizontally as rows. To run a query like Query Example 1, the index is probed to see if it contains the requested scalar values. This query process is more efficient than the brute force technique described above; however, to find the requested values a massive number of rows may need to be scanned. These rows illustrated in Table 4 may index many elements not pertinent to evaluating a query. As a result, a massive amount of irrelevant data is scanned. Thus, even if, a XML index path table is used, the resulting index still suffers from the inefficiency of indexing a lot of rows for irrelevant data.
Thus, there is a need in the art for a solution to efficiently answer XML table queries regardless of the underlying storage technique.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.