This invention relates to a method and apparatus for a markup language parser. In particular this relates to a single direction parser and method for parsing an extended markup language (XML) document.
Information technology deals with increasing amounts of data and more efficient methods for processing large amounts of data are always in demand.
Data can be represented in structural (also known as hierarchical) form using tags in XML documents. There are at least two processing methods for processing structured data in such documents.
Tree based processing is a first processing method whereby structured data is extracted and populated as a tree structure in memory. This tree is a closed representation of structured data as it maintains the structure and hierarchy and all relevant information that can be extracted from XML. A document object model (DOM) is a standard for this way of processing. A common representation of a tree structure represents the elements as records with pointers to children, parents, or both, or as items in an array, with relationships between them determined by their positions in the array. In general an element in a tree will not have pointers to its parents, but this information can be included (expanding the data structure to also include a pointer to the parent) or stored separately. Alternatively, upward links can be included in the child element data, as in a threaded binary tree.
Event based processing is a second processing method whereby XML data is represented by a series of events. Each event represents a small portion of the data. Processing is done either in a push mode, like in Simple Application Programming Interface (API) for XML or SAX, where the parser reads the XML data and calls the client with the events or in a pull mode, like in Streaming API for XML (StAX), where the client calls the parser to get the next event from the XML data.
Both processing methods have limitations and issues. Event based processing with its low memory consumption does not maintain the hierarchical structure of the data and thus the client application needs to relate events to each other in order to understand this hierarchy. The default tree based processing, consumes a lot of memory because all information in the XML data is populated as a tree in memory. When an XML data set is large and the available memory is low, tree based processing cannot be used. Several publications have attempted to address the limitation of tree based processing.
One publication describes a scalable DOM implementation that reduces the memory consumption by making some references between the tree elements weak while the client application will have strong references to the elements it needs. If an element is only referenced by weak references, the garbage collector can release this element from memory when the application is running low in memory.
One publication describes memory efficient data processing for analyzing the operations that need to be performed on a data structure and for deciding which data should be loaded in memory.
One publication describes a method for loading large XML documents on demand. In this publication, portions of the document are loaded in the memory and other portions are stored in a database in a way that is transparent to the client application.