XML (EXtensible Markup Language) is rapidly becoming a widely-used industry standard for exchanging business data. Various interfaces have been developed for applications to parse and access XML data when XML documents are received. Efficient parsing of XML documents is becoming more important as the size and volume of XML documents increases.
An XML parser takes as input a raw serialized string and performs certain operations on it. Typically, a parser checks the syntactic well-formedness of the XML data, e.g. making sure that the start tags have matching end tags, and that there are no overlapping elements. Some parsers also implement validation against a Document Type Definition (DTD) or the XML schema to verify the structure and content. The parsing output provides access to the content of the XML document via programmatic application programming interfaces (APIs).
One specific type of parser that has been developed is a Document Object Model (DOM) parser. A DOM parser uses a tree-based parsing technique that builds a parse object tree in memory. It allows complete, dynamic access to an entire XML document through an object-oriented API. Because the XML document is represented in memory as an object tree, DOM parsers preserve and allow dynamic access to the XML document structure and content. A DOM parser is capable of supporting XPath, a preferred technique for selecting and retrieving data from XML documents. XPath allows for retrieval of XML data based not only on its content, but also on the XML document structure.
However, as XML documents become increasingly large, current solutions for parsing XML documents based on DOM parser tree creation and traversal face serious performance issues. Significantly, known DOM parsing solutions require an entire XML document to be parsed at one time, as partial parsing is not possible. In addition, loading the entire document and building the tree structure in memory is computationally expensive, especially for larger documents. In practice, DOM trees have required up to 10 times the memory of the original document. DOM parsers do not perform or scale well when processing large XML documents because of their high memory cost.
In addition, current DOM-based XML data retrieval techniques redundantly traverse the DOM tree when processing multiple XPath expressions. This is not efficient, especially for a large XML document with hundreds of XPath expressions. The redundancy can result in scalability issues for a system in which many large XML documents are processed.
“Streaming” refers to techniques for transferring data such that the data can be processed as a steady and continuous stream, and is an ideal solution for efficiently retrieving data from large documents. A streaming protocol that can handle large documents allows for fast processing as well as scalability.
Streaming-based XML processing techniques that use a fixed amount of memory, such as SAX (Simple API for XML) and StAX (Streaming API for XML), have been developed. SAX and StAX parsers require less memory than DOM parsers, but they do not maintain the hierarchical structure of XML documents. That is, while known SAX and StAX parsers allow pieces of XML documents to be accessed, the structure is lost in the processing. Without the document structure, known SAX and StaX parsers cannot support XPath-based XML data retrieval.
The ability to perform XPath-based XML data retrieval can be very important in certain situations. For example, in an industry in which a very large number of documents are handled, such as a news organization, for example, some users may only be interested in a portion of the XML data found in an XML document. Support for XPath-based data retrieval is needed to be able to selectively retrieve XML data of interest.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.