XML has emerged as the dominant standard for data representation and exchange over the Internet. Its nested, self-describing structure provides a simple yet flexible means for applications to model and exchange data. For example, a business can easily model complex structures such as purchase orders in XML form and send them for further processing to its business partners. As another example, all of Shakespeare's plays can be marked up and stored as XML documents so that information such as the beginning of a new section, or the names of the speakers, can be semantically captured as XML tags. In fact, there are already many industry proposals to standardize XML document structures for domains as diverse as electronic commerce and real estate.
With a large amount of data represented as XML documents, it becomes necessary to store and query these XML documents. For example, a business that receives XML purchase orders may need to store these purchase orders, and later query them to see which items need to be shipped.
To address the problem of storing and querying XML documents, there has been some work done on building native XML database systems. These database systems are built from scratch for the specific purpose of storing and querying XML documents. This approach suffers from the potential disadvantage that native XML database systems do not harness the sophisticated storage and query capability already supported in existing relational database systems.
To overcome this limitation, there have been techniques proposed for storing and querying XML documents using relational database management systems (RDBMSs). However, most of them concentrate on storing and query XML document without knowledge of the schema associated with the XML documents. See for example, A. Deutsch, M. Fernandez, D. Suciu, “Storing Semi-structured Data with STORED”, Proceedings of the SIGMOD Conference, Philadelphia, Pa., May 1999 and D. Florescu, D. Kossman, “Storing and Querying XML Data using an RDBMS”, IEEE Data Engineering Bulletin, 22(3), pp. 27–34, 1999. There is one known technique that exploits schema information in the form of Document Type Definitions (DTDs), for storing and querying XML documents. See J. Shanmugasundaram, et. al., “Relational Databases for Querying XML Documents: Limitations and Opportunities”, Proceedings of the VLDB Conference, Edinburgh, Scotland, September 1999, which is hereby incorporated by reference and referred to hereafter as VLDB99. DTDs specify the structure of XML documents. See World Wide Web Consortium, “Extensible Markup Language (XML) 1.0 (Second Edition)”, W3C Recommendation, October 2000, www.w3c.org/TR/REC-xml for more information on DTDs.
These techniques generally work by following a specific set of steps. The first step is relational schema generation, where relational tables are created for the purpose of storing XML documents. The next step is XML document shredding, where XML documents are “stored” by shredding them into rows of the tables that were created in the first step. The final step is XML query processing, where XML queries over the “stored” XML documents are converted into SQL queries over the created tables. The SQL query results are then tagged to produce the desired XML result.
A brief introduction to the XML Schema standard is now provided, and the various issues regarding generating relational schemas are discussed. XML Schema is a World Wide Web Consortium (W3C) recommendation for defining the structure and content of XML documents. See “XML Schema Parts 0, 1 and 2”, W3C Candidate Recommendation, October 2000, at www.w3c.org/TR/xmlschema-1, www.w3c.org/TR/xmlschema-2, and www.w3c.org/TR/xmlschema-3 for more information on XML Schema. The XML Schema standard is strictly more expressive than DTDs, and includes several useful features such as typing, inheritance, equivalence classes, and integrity constraints, which are not present in DTDs. Specifically, the main features of the XML Schema specification that distinguish it from DTDs are:    1) XML Schemas are specified in XML syntax.    2) XML Schemas have enhanced data types, as opposed to only character strings supported in DTDs.    3) XML Schemas have support for namespaces.    4) XML Schemas separate the name of an XML element from the name of its type. This is done using the notion of local namespaces. As a result, two distinct XML elements occurring in an XML document can have the same name and different types.    5XML Schemas support inheritance, so new types can be defined by extending or restricting existing types. An instance of a derived type can occur whenever an instance of its ancestor type can appear in an XML document.    6) XML Schemas allow the creation of equivalence classes among elements. Each equivalence class has an exemplar and any element in the class can replace the exemplar in instance XML documents.    7) XML Schemas support identity constraints. These are more powerful than the IDs and IDREFs supported in DTDs. Identity constraints allow constraints to be specified on any element or attribute, regardless of its type. Constraints can be locally scoped and the constraints can be based on a combination of element and attribute content.
There has been some recent work on storing XML in an RDBMS using XML Schema information. In P. Bohannon et al., “From XML Schema to Relations: A Cost-Based Approach to XML Storage”, IEEE ICDE, 2002, the authors propose a cost-based approach for creating a relational schema using an XML schema, XML statistics and an XML Query workload. While their approach handles the differentiation between elements and types in XML Schema, it does not handle (i) simplifying complex XML Schema types, (ii) handling recursion in the schema, or (iii) handling inheritance and XML constraints.
In S. Davidson et al., “Propagating XML Constraints to Relations”, IEEE ICDE, 2003 (to appear), the authors propose a framework for refining the relational design of XML storage based on XML key propagation. They use the key information while deciding the relational schema, but do not use the DTD or XML Schema information.
There has also been some work on preserving the semantics of the XML data while storing it in an RDBMS. In D. Lee et al., “Constraints-Preserving Transformation from XML Document Type Definition to Relational Schema”, International Conference on Conceptual Modeling/the Entity Relationship Approach, 2000, semantic constraints are derived from the DTD and are translated into equivalent relational constraints. In Y. Chen et al., “Constraint Preserving XML Storage in Relations”, WebDB, 2002, the authors preserve the semantic information implied by the key/keyref information in the relational schema through the use of constraint relations.
Whatever the precise merits, features, and advantages of the references cited above, none of them achieves or fulfills the purposes of the present invention. A method of storing XML documents in a relational database system by generating relational schemas that exploit the additional features of XML Schema, to answer XML queries efficiently, is therefore needed.