In recent years, XML is increasingly being used to represent various kinds of content. XML has been used to represent structured data and semi-structured data as well as unstructured documents. In addition, XML documents are being stored and managed within a database system, where the XML data in the documents can be queried.
In many cases, the XML documents stored in a database system can be quite large. It is not uncommon for an XML document to require hundreds of megabytes of storage. Furthermore, the number of XML documents stored in a database system may also be very large, numbering into the millions. In general, database systems are not able to provide efficient support for querying, managing and updating such large collections of large XML documents.
XML documents that are stored and managed in a relational database are typically stored as unstructured serialized data in some form of a LOB (Large Object) datatype. For example, an XML document may be stored in a CLOB (Character LOB) or a BLOB (Binary LOB) column in a relational table. Unfortunately, there are several problems that arise when there are large numbers of large LOBs in such tables. In particular, executing queries against the stored XML data or updating the stored XML data are problematic.
Most known methods for querying XML data include some variation of XPath. XPath is a language that describes a way to locate and process items in XML documents by using an addressing syntax based on a path through the document's logical structure or hierarchy. The portion of an XML document identified by an XPath “path expression” is the portion that resides, within the structure of the XML document, at the end of any path that matches the path expression.
A query that uses a path expression to identify one or more specific pieces of XML data is referred to herein as a path-based query. The process of determining which XML data corresponds to the path designated in a path-based query is referred to as “evaluating” the path expression. To evaluate a path-based query, the database system finds all fragments in all XML documents stored in the database system that match the path expression.
If the schema of the stored documents is not known, a database system may use ad-hoc mechanisms to evaluate path-based queries. For example, a database system may satisfy an XPath query by performing a full scan of all stored XML documents to find all fragments in all documents matching a given XPath. While all path-based queries can be evaluated with a full scan of all stored XML documents, the implementation would be very slow, even if only a small number of documents actually match the path expression.
Database indexes enable data to be searched without a sequential scan of all of the data. However, even with secondary indexes and XML indexes, the performance of a path-based query can be quite poor because the indexes themselves can become very large. For example, when an index is implemented using a B-tree, a large number of entries can cause significant degradation in index performance as the level of the B-tree increases.
In addition to poor query performance, updating an XML document in a database system can be difficult. For example, when a user updates a small portion of a large XML document that is stored in a database system, typically the entire document needs to be updated with the new values. In addition to significant performance impact, this also generates a large amount of database logging information needed to maintain the transactional property of databases. Furthermore, when a user updates an XML document stored in a CLOB column, the entire document is “locked” until the transaction is committed. That is, no other user is allowed to update the same document during this period of time, even if the other user desires to update a completely different or unrelated portion of the XML document. The constraints imposed by locking severely limit the concurrency of XML-based applications.
If an XML document conforms to a known, well-defined schema, techniques to “shred” the document into relational database tables, columns and rows are known. Shredding allows the structure and data types used in XML documents to optimize XPath queries, as queries can take advantage of well-known relational database techniques if the data is in relational database tables, rows and columns. In addition, updating XML data in relational database tables is straightforward. However, while XML shredding provides a solution to some of the XML data management issues described above, known shredding techniques have several limitations.
Known shredding techniques require a well-defined schema. In the absence of a well-defined schema, known shredding processes cannot determine what tables, rows and columns in which to place the XML data. In addition, known shredding techniques may not work if the XML documents conform to many different schemas.
Furthermore, the table, row, and column format is rigidly determined by the schema, and placement of data within the tables, rows and columns is inflexible. Generally, all data in an XML document is automatically shredded into tables according to the document's schema. The number of tables and columns is tightly correlated to the complexity of the schema with respect to the number of element definitions, etc. Typically, each section of an XML document is stored in a separate table. In the case of complex schemas, known shredding techniques result in an unmanageable large number of tables with numerous columns. The proliferation of tables presents serious query and update problems.
With known shredding techniques, it is not possible to “shred” only the portion of the XML documents that are more likely to be used in queries. If an XML document is shredded, all XML data in the document is shredded according to the document's schema.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.