Relational Database Management Systems (“DBMS”) typically support a wide range of data types. For example, a DBMS allows users to store and query scalar data type values such as integers, numbers, and strings. Some DBMSs have the added ability to support more complex data types. For instance, some DBMSs have the ability to support Extended Markup Language (“XML”) documents and other XML type data. Those DBMSs that include XML support allow users to define tables, or columns in a table, as an XML type (e.g., as XMLType). This added support facilitates the inclusion of more sophisticated data in DBMSs. However, with the addition of complex data types, new techniques had to be created to handle storage and access issues.
I. XML Type Storage Techniques
XML data does not naturally lend itself to conventional physical storage models in a DBMS. Thus, a variety of storage techniques can be used to manage the storage of such data. For example, two models for storing XML type data in a database include storing the data object-relationally or storing the data in aggregate form.
A. Object-Relational Storage
Storing XML type data object-relationally involves defining a complex database representation, defined in terms of data types, handled by the database system, referred to herein as database types, to represent XML documents. Database types include, for example, native database types, such as integer and VARCHAR (“variable length character string”), or object types defined for a database system using a DDL statements (data definition language statements).
Within a database representation of an XML document, various database objects are defined to represent and store elements and various components of an XML document. For example, each element of an XML document may be represented by a column in a table. A given XML document is stored in a row of the table. A particular element value of the XML document is stored in the row's column that represents the element.
Another example of a database representation is an object class defined by an object relational database system. Each object of the object class is an instance of an object database system. An object that is an instance of the object class defines, for example, the structure corresponding to an element, and includes references or pointers to objects representing the immediate descendants of the element.
The underlying structures that comprise a database representation (e.g. table, object table, columns, object columns) are referred to as base database objects or structures. When an XML document is submitted to the DBMS for storage, the XML document is “shredded” into element values, which are stored in corresponding components of the base structures. For example, an XML-based resume document would be shredded into various elements such as name, email, telephone number, work experience, and other elements. After the document is shredded, a new row is created in a resume table and each value from the shredded resume is placed in the column corresponding to the value's element.
In some instances, there may be multiple values for a particular type of element in a document. For example, a resume document has as an element “working history.” Under this particular element type, the applicant has his previous jobs. In other words, the working history element consists of a list of jobs. In such a case, a separate table may be created for that particular type of element and a reference to the separate table is added to the resume table.
To further illustrate, a purchaseOrder document 100 in FIG. 1 may be considered. PurchaseOrder document 100 is an example of an XML document that may be submitted to a database server. PurchaseOrder document 100 is an XML document that defines a purchase order with the following elements: Reference (reference number), ShipAddrCity (shipping address city), BillAddrCity (billing address city), and LineItems. A LineItem element has additional sub-elements of ItemName, ItemQuantity (e.g., 3 CPUs and 3 Memory chips), and ItemUnitPrice.
If each element in the purchaseOrder has a corresponding column in the table schema, then the document shown in FIG. 1 may be shredded and placed in an object-relational table. The scalar values from the document are placed in the columns based on their element type. Table 2 illustrates an example of an object-relational table for purchaseOrder. The table for purchaseOrder is an XML type table.
TABLE 1Parent table for Purchase OrderPurchaseOrderReferenceShipAddrCityBillAddrCityLineItemRid1GHB3567OaklandBerkleyLineItem TablePtr (e.g. LineItemTable at Rid1 andRid)
In Table 1, each value from the purchaseOrder document 100 is placed in a column based on that value's element type. For example, reference number “GHB3567” is placed in the reference column, shipping address city “Oakland” is placed in the ShipAddrCity column, etc. At a point during the shredding, the DBMS detects that multiple values are listed for lineItems. Accordingly, a second table is used to store the lineItems. A reference is added to the purchaseOrder table, which points the secondary lineItem table. Table 2 illustrates an example of the lineItem object-relational table for purchaseOrder document 100.
TABLE 2Child table for Purchase OrderLineItemItemNoItemNameItemQuantityItemUnitPriceRid134CPU3123.45Rid268Memory323.45
Unfortunately, many XML documents and other forms of complex data do not conform to the pre-defined schema. For instance, resumes typically include some common features such name, address, and telephone number, but they often also include features that are not necessarily included in every resume. Those less common features may include, for example, items such as hobbies, publications, internship data, research projects, and references. In an object-relational model, when the DBMS attempts to import an XML document that includes data that does not conform to the defined schema, the document (or at least important data in the document) may be ignored, deleted, dumped, etc. To address this problem, developers may attempt to create a column for every different type of element or feature that may possibly be submitted to the DBMS. This strategy, however, is not very practical, because even with their best efforts, database developers may not account for every potential variation on an element name or value type. Moreover, as more columns are defined in the table, each new row in the table consumes more space. When a new row is added, the DBMS typically allocates space for an entire row, regardless of whether data is placed in each column of that row. This has the potential to waste a lot of space.
As an alternative, aggregate storage techniques, such as CLOBs and BLOBs, may be used to store XML type data.
B. Aggregate Storage (LOB-Based Storage)
In aggregate storage, a complex database representation is not used to represent a XML document. XML documents are stored in aggregate form in a LOB (“large binary object”). For example, XML documents may be stored as CLOBs (Character Large Objects) and/or BLOBs (Binary Large Objects). A base structure for storing a collection of XML documents stored in aggregate storage form (“aggregate form”) may be, for example, a collection of XML documents stored in a LOB table.
Aggregate storage is useful for storing complex data because aggregate storage can be used to store data regardless of format and/or schema. For instance, when an XML resume is added to a LOB-based table, the resume is stored as one large chunk of data without performing any parsing or shredding of the content. Tables using LOB-based storage to store complex data typically contain very few actual values extracted from the complex data. Instead, the tables consist of references to the aggregate storage locations. In the resume example, when a new resume is added to LOB-based storage, a row is added to a resume table and includes a reference to point to the aggregate storage location where the new resume was stored.
These aggregate storage techniques avoid the overhead associated with analyzing, parsing, and shredding documents. Moreover, these techniques help save storage space, since LOB-based storage makes better use of the underlying storage.
The main drawback to aggregate storage techniques is that queries on the data are much more inefficient, time consuming, and resource intensive than queries made on data stored using object-relational techniques. Part of the reason for this drawback is that the mechanisms that enhance access to object-relationally stored data are unavailable for XML type data stored in aggregate form. Other forms of storage, such as serialized tree storage or object storage, have the same problem.
II. XML Table Index
Based on the foregoing problems and disadvantages, a new XML index structure has been developed, which is described in U.S. patent application Ser. No. 11/394,878, entitled TECHNIQUES OF EFFICIENT XML META-DATA QUERY USING XML TABLE INDEX, filed on Mar. 31, 2006, which is incorporated herein by reference as if fully disclosed herein. This new XML index structure is referred to herein as an “XML table index.”
An XML table index is an index that includes a table that indexes a collection of XML documents. An XML table index may itself be indexed by secondary indexes (e.g., B-tree index). In an embodiment, an XML table index is created for XML documents stored in aggregate storage. An XML table index projects out data from a collection of XML documents into column form instead of row form. For example, rather than storing values for several different elements in the same column, but in separate rows (e.g., as an XML index path table), an XML table index stores values in separate columns, one for each element, with fewer rows.
When the data are projected into column form, query techniques used for optimizing access to object-relationally stored XML type data may also be used for XML type data stored in aggregate form. Moreover, when an XML table index is generated, both the overhead of shredding the XML documents into object-relational format and the need for a well-defined XML schema are avoided. XML data may remain in aggregate form, but with the creation of XML table indexes, queries for frequently requested data run more efficiently. Thus, the principles of object-relational storage may be applied to XML documents stored in aggregate storage.
To generate an XML table index for an example Purchase Order document, a “create index” command is executed. An example “create” statement is illustrated below (Create Statement 1).
CREATE INDEX POIndex ON PurchaseOrder(object_value)IndexType is XDB.XMLTableIndexPARAMETERS(‘XMLTABLE PO_TAB_INDEX “/PurchaseOrder//lineItem”  ColumnsItemNo number path ‘@itemNo’,ItemName varchar2(40) path ‘itemName’,ItemQuantity number path ‘itemQuantity’,ItemUnitPrice number path ‘itemUnitPrice”);