Structured documents are documents which have nested structures. Documents written in Extensible Markup Language (XML) are structured documents. XML is quickly becoming the standard format for delivering information over the Internet because it allows the user to design a customized markup language for many classes of structured documents. For example, a business can easily model complex structures such as purchase orders in XML format and send them for further processing to its business partners. XML supports user-defined tags for better description of nested document structures and associated semantics, and encourages the separation of document content from browser presentation.
As more and more businesses present and exchange data in XML documents, database management systems (DBMSes) have been developed to store, query and retrieve these documents which are typically stored on direct access storage devices (DASD), such as magnetic or optical disk drives for semi-permanent storage. Some DBMSes, known as relational databases, store and query the documents utilizing relational techniques, while other DBMSes, known as native databases, store the documents in their native formats.
As stated above, one attractive feature of XML is that it allows the user to design a customized markup language for many classes of structured documents. The user can select element and attribute names that are relevant to and descriptive of the particular class of document. While this provides the user with great flexibility, it also presents problems for database processing. Evaluating strings corresponding to element names is costly because the strings are of varying length and the database processor, at a minimum, must perform length checks. Such length checks add costs during runtime and also complicate program coding. In addition, processing variable length strings complicates memory management in the database. Moreover, in order to store XML documents in their native format in a database, the element and attribute names, along with other strings, must be stored on disk. For large documents, the amount of disk space required to store a collection of such documents can be quite large and expensive.
To alleviate the storage problem, it is common to compress a document in order to reduce the amount of storage space required to store it. A typical compression method involves replacing certain strings with numbers and storing mapping information in a file specific or document specific table. The mapping table is stored in front of the file or document. While this method reduces the size of the document, it also presents several disadvantages in the storage and processing of such documents. First, storing the mapping information in each file/document requires additional disk space for each file/document. Second, because each compressed file/document is associated with its own mapping table, the numbers associated with the strings in one compressed file/document do not correspond to the same strings in a different compressed file/document. Accordingly, because the numbers are not consistent throughout the database, the numbers cannot be used for purposes beyond document compression. Instead, the compressed documents must be decompressed before they can be processed. Once the document is decompressed, i.e., the numbers are replaced with the associated strings, the query processor is still required to evaluate strings.
Accordingly, a need exists for an improved method and system for processing structured documents stored in a database. The method and system should reduce the size of a structured document for storage, while supporting homogeneous document processing. The present invention addresses such a need.