1. Field of the Invention
The invention relates to information storage and retrieval systems, methods and articles of manufacture. More particularly, it relates to handling information contained in a markup language document using database tools and techniques.
2. Description of the Related Art
The Extensible Markup Language (XML) is a type of markup language using tags to designate data. XML was created as a data exchange and representation standard that provides techniques for storing complex data structures in a hierarchical manner and in a way suitable for exchange over the Internet. An XML document can be a file or a data stream containing nested elements, or nodes, starting with a root node. Nested below the root node, in a hierarchical fashion, such as in a parent-child relation, are other nodes. Nested below those other nodes can be further nested nodes.
Methods of integrating XML data with other data generally fall into two groups. In one group, the XML data is copied from its original location and stored, persistently, in a centralized database. In the other group, the XML data is brought to the centralized database only in response to specific application requests and is stored persistently outside the centralized database, in one or more external stores. There are various tradeoffs between the two approaches. The second approach has certain advantages over the first, including 1) avoiding the need to replicate in the centralized database special functionality of backend sources of the XML data, and 2) having current data found in response to queries, since that data comes directly from the source. However, with both of those conventional approaches the entire XML schema is mapped to a single table, and accordingly, the output from the XML source is flat. When the XML data is flattened into a single table, data values can be repeated in many tuples. For example, in an XML document holding information concerning customer names and orders they place, when the XML data is flattened into a single table a customer name will appear with every order associated with the customer, thereby repeating the customer name many times.
With either of these conventional approaches the XML schema is mapped into a flat space prior to a query operating on the XML information. All the data requested by the query must pass through the database management system, and with the data flattened into a single table this can be a large volume of data due to the repeating information. Such a flat mapping operation can be expensive in that it can take a long time to map the data into the flat space, and in the process can consume a large amount of memory. Further, the number of operations performed over the XML data values is increased because those operations must be performed over the repeating data values in the single table. Further still, with the data flattened into a single table, a query optimizer cannot be used to unnest the nested XML elements in a just-in-time manner.
Accordingly, there is a need to extract XML data from a data source into a plurality of tables in a just-in-time manner to reduce the volume of data that must pass through a database management system