1. Technical Field
Embodiments of the invention relate generally to information processing and more particularly to partial parsing and modification of Extensible Markup Language (XML) documents.
2. Prior Art
XML refers to (World Wide Web Consortium) W3C standard for creating markup languages that describe the structure and interrelationships of data. XML is not a single, predefined markup language rather a metalanguage (a language for describing other languages). Last few years, XML has become lingua-franca of the internet and World Wide Web (WWW). It has become the most common mechanism for structured data representation, exchange and storage.
In the aforementioned XML applications, it is critical that the data contained in XML documents be processed. There are several ways in which XML documents can be processed, modified and data retrieved therefrom. Several languages such as XPath, XSLT and XQuery allow performing queries on XML documents to locate information items, process and modify XML documents. XPath refers to a language standardized by W3C for querying XML documents. It treats an XML document as a logically ordered tree of nodes and provides a means to locate and identify XML elements and attributes.
In traditional approaches to XML modification, a Document Object Model (DOM) is followed. In following the DOM approach, the XML document is converted to a tree format with the help of a DOM parser and this DOM tree is stored in memory. While this approach works in case of smaller documents, it has severe limitations when it comes to processing of larger XML documents especially because the size of the document to be stored is usually 7-10 times the size of the original XML document. Thus, in case of large documents, following the DOM approach is a constraint in terms of memory, time, cost and application performance. Further, DOM allows modification of XML documents only if a complete in-memory data structure is formed. Thus, a DOM approach to modify an XML document has its limitations in instances where owing to memory limitations, a complete DOM tree cannot be stored in memory.
In order to address the challenges posed by DOM approach, alternative approaches such as Simple API for XML (SAX) were developed. In contrast to DOM approach, SAX approach does not require loading of the complete XML document into memory. Rather SAX refers to presenting the document as a serialized stream of events. In other words, SAX is event driven and relies on a programmer to specify a particular event upon the happening of which event, XML processing happens. However, SAX approach has its own limitations as well. In following a SAX approach, the ability to navigate back and forth within in the XML document in order to make modification is restricted. This is a severe limitation of a SAX approach.
In the aforementioned approaches, the entire XML document needs to be parsed in its entirety for a modification of a portion of the XML document, regardless of how minor the modification is. This parsing of the entire document including paring of portions of an XML document that do not require modification leads to unnecessary usage of the Central Processing Unit (CPU).
Further, parsing and in-memory representation of an XML document requires significant amount of memory usage. Especially in DOM approach, if the XML document which needs to be modified is larger than, say 100 KB, memory requirements can be significantly large for a minor modification. Further, modifying an XML document using DOM API is highly programmatic and requires code changes for every new type of modification. Another problem arises in these approaches when the XML document has to be de-serialized and serialized. Serialization is involved while creating the in-memory data structure from the XML document and de-serialization is involved in converting the in-memory data structure back to the XML document. Both serialization and de-serialization are costly as well as time consuming.
Hence, there is a need to provide a method and system for parsing and modifying XML documents efficiently.