Extensible Markup Language (XML) is quickly becoming the de facto standard for exchanging corporate data via structured documents, whether internally with business partners or via public applications across the Internet. In fact, the World Wide Web Consortium (W3C) has endorsed XML as the standard for document and data representation.
Widespread use of XML has led to the storage of XML data in many different ways. XML data exchanged today can be stored in a relational database or some other data format. In this regard, modern relational databases are capable of storing XML data “instances” within their columns, just as if the instances were any other type of data. Each instance will conform to a particular schema, which provides a format and for the data supplied by an instance.
With more data passed around as XML, and more systems designed to produce it, developers need a way to provide user access to XML instances that conform to a wide variety possible schemas. A tool that has been employed to facilitate user access to instances conforming a variety of schemas is the user-created cache. This tool provides similar function to the use of cache in other settings. The schema cache allows users to identify and store schema namespace Uniform Resource Identifiers (URIs). As a result, the identified schemas are more or less readily accessible to users when they come upon an XML instance that conforms to a schema whose namespace is stored in cache. If many schema namespace URIs are stored, there are techniques employed (generally known as schema location) that disambiguate between schemas that may have namespace URIs with similar properties. This tool does not, however, help users identify the schema to which any particular instance will conform. Nor does it help in searching for instances conforming to various types of schemas.
Developers also require ways to query XML sources for instances that conform to various schemas. One of the first tools that could be used to query these XML data sources was called XML Path Language (XPath). XPath was designed to allow navigation within an XML file by forming simple queries of a single file. Since XPath was designed to navigate and query a single XML data source, using XPath effectively to query multiple data sources requires the developer to perform complex XML document merges using XSLT 1.0 or custom programs. The XPath approach is similar to how some companies create data warehouses today—data from multiple sources is pulled together and transformed into an identical format in a central warehouse repository. Managers can then use that repository's tools to query the data.
XQuery was designed to solve this problem by allowing complex queries across not only multiple XML documents, but also between XML documents, relational databases, object repositories, and other unstructured documents. Going forward, XPath will focus on navigation capabilities (i.e. linking between documents or accessing a specific portion of a document.) in both XQuery and XSLT. This would create a powerful tool to search, aggregate, and present data from disparate sources using a unified query language (XQuery) and a powerful transformation and display formatting language (XSL).
While exciting developments and advances have been made in the realm of querying XML data, there is a need for further advance, especially towards storing, accessing, searching and retrieving XML data in relational databases in a reliable and flexible manner. As companies try to organize and manage an increasing volume of digital information, database systems are becoming a more critical business requirement. Relational database management systems (RDBMS) are widespread, and many companies organize their business around such a system. There are many commercial providers of relational database systems, including MICROSOFT®, IBM®, ORACLE®, SYBASE®, and others. There are also “open source” relational databases available. Relational databases are used for a multitude of operations, and relational database systems have been custom-tailored to fit every need, from keeping track of the inventory of a small business to running Web sites such as AMAZON.COM®.
Queries of relational databases containing XML are limited, however, by the way that XML data are stored in such databases. As mentioned above, XML data are typically stored as “instances,” each of which conforms to a “schema”. An XML schema provides identification and organization for the data supplied by an XML instance. Specifically, a schema identifies the fields and the relationships between the fields. Because each instance supplies data that is organized according to a specific schema, attempts to mismatch an XML schema and an XML instance will result in computing errors. As a result, XML instances have historically been validated in relational databases according to the schema to which they conform, i.e., currently, a dimension, such as a column, of a database can only be typed according to a single XML schema. Thus, only instances conforming to the particular schema can be, at present, placed in any single column of a relational database. An XML data instance that does not conform to the schema type then results in an error, notifying the developer or system that the XML data instance includes an error.
While enforcing the typing of XML instances in relational database columns according to a single schema can be advantageous in a static system, such enforcement creates a barrier for dynamically changing or evolving systems, i.e., the requirements of relational databases to satisfy business needs frequently change, and the single schema may no longer be congruent with the way XML data is received, accessed or searched in the system. Importantly, it also constrains the freedom of users to store XML instances of differing schema types in the same column. For example, consider the situation where a distributor of books and Digital Versatile Disks (DVDs) (such as AMAZON.COM®) wants to use a relational database to store product information. Using existing technologies, it is very likely that book information, or book instances (e.g., Title, Author, Publisher, Copyright, etc.), will conform to one schema while DVD instances (e.g., Title, Director, Actors, Actresses, Copyright, etc.) will conform to another schema, i.e., it is likely that two separate database dimensions will be used to represent books and DVDs, one column typed according to a book schema, and another column typed according to a DVD schema. Therefore, AMAZON.COM® could not search for both books and DVD's in the same column. Multiple columns will have to be queried, generating greater search complexity, a corresponding increase in computational time and bandwidth, as well as additional opportunity for user error.
Accordingly, a heretofore unaddressed need exists in the industry to address the aforementioned deficiencies and inadequacies in the art.