1. Field of the Invention
The invention generally relates to processing mark-up language data, and more particularly to a technique for reducing parsing time of documents by using intra-document indices to improve querying streams of XML data.
2. Description of the Related Art
As XML (extensible mark-up language) continues to gain popularity as a format for storing, sharing, and manipulating data, new tools and systems are being introduced to increase its flexibility. One important feature necessary to provide robust XML data processing applications involves the ability to query XML data. More specifically, with the growing popularity of streamed applications over networks such as the Internet, facilities for efficiently querying streams of XML data will become more and more critical.
Most of the XPath and XQuery implementations today process queries by traversing an in-memory representation of the document using the Document Object Model (DOM) interface. In DOM, at any point, the processing can move in any direction in the XML tree from the current node to its children, its parent or any of its siblings. While this makes the implementation easier, the requirement that the whole document be saved in memory is a major drawback of this approach, leading to large memory consumption (decreased concurrency) and high latency (the document needs to be processed before the first answer is produced). In order to overcome these limitations, streamed implementations based on the Simple API for XML (SAX) interface are emerging.
The recently developed TurboXPath processor, available from International Business Machines, NY, USA, evaluates single-document XQuery queries over streams of XML data using SAX. TurboXPath has demonstrated to reduce both the memory consumption and the latency by orders of magnitude. Nevertheless, experiments have demonstrated that XML parsing (producing SAX events from an XML document stream) is responsible for 60 to 95 percent of the overall processing time. One of the reasons for the high overhead of the parsing is that the parsers produces events for all document pieces, regardless if they are relevant for processing the query.
Conventional parsing techniques aimed at reducing parsing time when processing XQuery queries over streams of XML documents produce events for all document pieces, regardless of query relevance. U.S. patent application Ser. No. 10/413,244 filed on Apr. 14, 2003, the complete disclosure of which is herein incorporated by reference, describes a technique applicable to querying streams of XML documents. However, while the co-pending application is certainly useful and beneficial for the purposes for which it was intended, its solution involves buffering streamed fragments that meet a particular evaluation criteria and constructing these fragments in order to satisfy queries, and is not necessarily aimed at reducing parsing time. Therefore, there is a need for a novel technique that can reduce parsing time in the context of processing XQuery queries over XML documents.