XML (eXtensible Markup Language) is a standard language used for sharing and storing information across different technology systems. XML uses a plurality of tags to markup or describe content. XML allows data to be linked to each of the plurality of tags thereby enabling manipulation and extraction of data of comparison purpose. A typical XML document contains a tree based data structure that stores data in a structured format. Conventional techniques of comparing XML documents include parsing, loading the XML document in the form of a collection data structure such as a tree or a hash, in a memory and subsequently performing multiple traversals over the collection data structure that is materialized into the memory. A typical drawback that results in loading the collection data structure of the whole XML document into the memory is the limited scalability of the application. Further, size of the collection data structure and hence size of the XML document that can be processed in-memory gets limited by the memory available. As a result conducting in-memory processing for XML documents involving a large amount of data is usually not affordable. Further, in cases where the XML document is smaller in size, materialization of the entire XML document in the memory results in inefficient resource allocation. Furthermore multiple traversals through the collection data structures, requires additional processing capacity and time.
Alternatively, certain algorithms for comparing two XML documents as known in the art, cause one of the XML documents to be parsed entirely into a tree based data structure and the tree based data structure is loaded into a memory. The other XML document is then parsed multiple times to identify the differences between the two XML documents. Parsing an XML document multiple times, results in excessive consumption of time, as the size of the XML document increases. Hence while comparing XML documents of larger sizes, the known techniques in art, fall short with respect to, efficient utilization of memory and processing capacity.
Hence, there is a need for an alternative method and system for comparing XML documents by providing significant performance gains in terms of memory utilization and processing power. The alternative method must provide significant performance gains when comparing XML documents of larger size in order to enable machines with inferior processing power and memory to compare XML document of larger sizes. Thus a method for parallel parsing and materializing a portion of the XML documents into the memory is proposed.