1. Field of the Invention
The invention relates to a system and method for processing queries directed to structured documents. In addition, the invention relates to a system and method for processing a set of queries against an extensible markup language (XML) document.
2. Description of the Related Art
Hypertext markup (HTML) documents have become one of the most common forms of data interchanged over the Internet. HTML provides a document with a mechanism to describe how the document relates to other documents, through hyperlinks. HTML also provides mechanisms for describing how to visually present data including text formatting and lists or tables. Many internet applications require the automated exchange of documents containing data between two or more computers. A common document format that allows for the description of the logical structure and interrelationships of the data within a document is thus required. However, HTML does not provide a general mechanism for an HTML document to express the logical structure and interrelationships of the underlying data represented by the HTML document.
To address this shortcoming, extensible markup language (XML) has been developed. XML provides a mechanism to represent data in way that retains the logical structure and interrelationship of the underlying data. Thus, an XML document, rather than merely being a human readable representation of data, comprises a database. Moreover, an XML document may be constructed to conform to a document type declaration (DTD). A DTD is a formal description of a particular type of document. It sets forth what elements the particular type of document may contain, the structure of the elements, and the interrelationship of the elements. XML documents, particularly those which conform to a well-known or standardized DTD, thus provide a convenient means of data exchange between computer programs in general, and on the Internet in particular.
One typical method of processing XML documents is based on performing queries against the XML documents to locate information within the documents. XPath is a standardized language for expressing XML queries. See e.g., JOHN W. SIMPSON, XPATH AND XPOINTER (O'Reily, 2002), herein incorporated by reference in its entirety. XPath queries are a string of characters which represent hierarchical descriptions of elements and attributes for which an XML document is to be searched. An XPath query expression includes one or more path components, or subexpressions. The structure of an XML document may be represented by a directed graph or a tree in which the elements of the document are nodes. Thus, the result of an XPath query is generally a set of nodes within the directed graph.
One model for performing XPath queries is based on the Document Object Model (DOM) standard. Typically, DOM processes an entire XML document to produce a tree representing each of the elements in the document and the interrelationship between those documents. An XPath query can be processed to produce a finite automaton, a form of state machine. The finite automaton processes the graph of the DOM model to find a result for the corresponding XPath query. Both deterministic finite automata (DFA) and nondeterministic finite automata (NFA) may be produced for controlling the processing of DOM models.
However, for large XML documents, processing using DOM may not be practical due to the necessary memory and related resource constraints required by DOM. For example, due to the overhead of the textual formatting of attributes and elements, XML documents typically consume an amount of memory that is on the order of 10 times greater than the amount of memory necessary to represent underlying data in a compact binary format. Moreover, a DOM tree of an XML document typically requires an amount of memory that is on the order of 10 times greater than the amount required for the XML document itself. Thus, processing of large XML documents may require disproportionately large amounts of memory.
Moreover, server applications, such as, for example, web servers or email servers, may need to process many large XML documents at once. In these server environments, the large memory requirements of DOM trees also negatively impact processing performance in at least two ways. First, if the amount of physical memory is exhausted, system performance may be slowed as documents are paged out to slower storage, such as disk drives. Second, most modern computer processors operate at peak efficiency only when they are consistently performing operations using data that is in a cache memory. Cache memory is typically much more limited than the physical memory of a server. If a server is concurrently processing several large XML documents using DOM, little of each document may remain in the cache memory. The resulting high level of cache misses while processing XPath queries tends to severely degrade overall system performance in systems processing large XML documents.
Another system and application program interface (API) for processing XML is SAX (Simple API for XML). SAX presents the XML document as a serialized stream of events to be processed using handler functions rather than a DOM tree that is processed using, for example, a DFA. SAX thus requires only a stack, having a memory requirement that varies with the depth of the structure of elements in the XML document, rather than a tree, having a memory requirement that varies with the larger number of elements in the XML document. However, SAX provides only stream-style sequential access to the contents of a document. Moreover, its event-based structure is more difficult for programmers to use and applications written to use SAX tend to either perform only simple serial processing, or become complicated and difficult to maintain.
As XML usage increases, the need for efficient processing of XML queries, including XPath queries, also increases. One solution is to offload processing of XML queries to dedicated content processors. However, the memory requirements of DOM processing, and the difficulty of using SAX models have made cost effective implementation of content processing for XML queries difficult. Thus, simpler, yet resource efficient systems and methods of processing XML documents are needed.