1. Field of the Invention
The present invention is directed to the storage, management and indexing of structured documents, such as documents written in XML (eXtensible Markup Language) format.
2. Description of the Prior Art
By way of background, large-volume transaction-oriented enterprises, such as financial institutions, wholesale distributors, and retail chains often store the large quantities of data in a structured document format, such as XML. A typical structured document organizes the document's data elements and attributes as a set of nodes that are logically ordered in a hierarchical tree structure. Maintaining documents in this way contributes to their portability and ease of use.
Structured documents can be made accessible to an enterprise by associating them within database operations. However, with large volumes of transactions, using databases to work with structured document data presents challenges for performance, scalability, and simplicity. In particular, no existing single solution or implementation addresses two major issues associated with structured document processing, namely, (1) how to efficiently store voluminous structured documents, and (2) how to achieve fast access to arbitrarily selected parts of the structured document hierarchy, such as a node or sets of nodes within a document.
One category of existing system used for XML document storage and management system relies on XML-to-relational mapping. In this case, XML documents are shredded (disassembled) into a set of relational tables managed by a traditional RDBMS (Relational DataBase Management System). Some disadvantages of this approach include:                XML-to-relational mapping is a complex task. It is difficult to represent XML document hierarchy in the relational database model because the two models are very different.        Often, the original document needs to be transformed, typically through XSLT (eXtensible Style Sheet Transformations) to inject artificial keys that are then used by the RDBMS to preserve the one-to-one and one-to-many relationships among the document's various nodes. This transformation processes is resource intensive and it changes the structure of the original document. Recreating the original document requires a reverse process that removes the injected nodes.        Disassembly of complex documents may require dozens or even hundreds of relational tables connected by complex referential integrity (RI) rules.        Once shredded, an XML document cannot be easily reassembled.        
Another category of existing XML document storage and management system stores XML documents in an internal parsed format. There are hybrid (XML/relational) databases as well as native XML databases that implement this approach. In order to facilitate fast access, these systems offer indexing capabilities. Typically, an index is created over one specific XPath (XML Path language) expression that selects just a small subset of nodes in the document. All other nodes are not indexed and thus sequential access is required. Some of the disadvantages of solutions that exist in this area include:                Parsed XML structures may have a large footprint. The text nodes are stored within the document. Each instance of a document is represented by a separate (typically a DOM (Document Object Model)-like structure. Often, the XML documents are stored in a designated column in a table.        Data modification (insert/delete/update) of existing documents is difficult and very resource intensive because the complex internal representation needs to be modified for each document. Potentially there may be millions of such documents stored in the database.        Indexes are built over parts of documents. The index structure is separate from the structure where the document is stored, so that redundant data is stored and maintained. Tuning the database performance requires that multiple indexes be built over the same document. Maintaining these indexes is costly in terms of system resources. This technique is also not well suited for systems with ad hoc queries that may involve nodes for which no indexes exist.        
Accordingly, a need exists for an improved technique for the storage, management and indexing of structured documents. What is required is a solution that allows structured documents to be stored, organized, and searched using minimal storage and processing resources while providing superior query response performance.