XML is a known document encoding standard that facilitates the exchange of information between computer systems. Among other things, XML prescribes a standard way of encoding named hierarchical information.
XML documents can be processed in many ways. However, one common processing technique involves extracting XML documents' information content and creating a memory-resident representation of that information. One commonly used model is the DOM or the Document Object Model, which is governed by a W3C® standard. Each component in an XML document is represented by a discrete DOM object.
One drawback of typical DOM implementations is that the entire object model must reside in memory. Although this approach works well for smaller XML documents (e.g., that can be loaded in memory or a reasonably sized virtual memory), handling very large documents in this way can become cumbersome. For instance, loading large documents may result in high virtual memory demand and poor system performance. This approach also effectively places an upper limit on the size of any XML document that can be processed that depends, for example, on the amount of available memory (real and/or virtual). Additionally, for environments that use dynamic allocation (such as, for example, Java-based environments), this situation results in large numbers of discrete heap-resident objects, which can potentially adversely affect the performance of memory recovery for the objects when they are no longer in use (e.g., through garbage collection). Those skilled in the art know that system performance often degrades sharply when document size becomes unmanageable.
Yet another drawback of the memory-resident approach is that it can be very difficult to share a parsed document in a federated system where processes may not have convenient access to shared memory.
Of course, it would be desirable to implement an approach to XML parsing that performs consistently well under varying circumstances such as, for example, simultaneously processing a small number of very large documents, simultaneously processing a large number of small documents, and/or the like. The need to meet these desires becomes yet more important as the system scales up to an Enterprise-class server sized system.
The inventor has observed that solutions to the XML memory problem tend to fall into one of three categories, namely, adapting the application to an existing DOM model in some pragmatic way, using some model other than DOM, or implementing an improved DOM implementation that improves on prior implementations.
Pragmatic adaptation to DOM oftentimes includes allocating large amounts of memory to the process and simply tolerating poor performance; designing or redesigning the application to limit the size of the document; and/or resorting to some type of ad-hoc partitioning of the documents. Some products currently available by the assignee of the instant application employ the ad-hoc partitioning approach for processing very large documents. For example, the assignee's Integration Server provides an iterating parser that generally makes it possible to maintain a parsing window that typically avoids loading the entire document, subject to certain constraints of course. As another example, the assignee's Trading Networks decomposes large documents into smaller documents in a very specialized approach. It is noted that adapting an application to use ad-hoc partitioning can be very difficult, depending on the size, complexity, processing requirements, and other features of the application and/or environment. Similarly, obtaining maximum performance using ad-hoc partitioning also can be difficult.
Using models other than DOM typically involves a streaming approach in which information is processed during one pass of the document. A streaming push model, such as SAX, can be very efficient from a performance point of view. Unfortunately, however, such models oftentimes are difficult to program. A streaming pull model, such as the AXIOM (TRADEMARK) model used by Axis, is easier to use in many situations, but still does not lend itself well to situations that require non-document-order processing. If it is not possible to process data in document-order, the user generally must enable AXIOM (TRADEMARK) caching, which stores processed data in a cache so that it may be subsequently reused. This cache, however, is a memory-resident pool of objects and, as a result, its behavior can still degrade into the DOM-like pattern, depending on the usage pattern. The current Integration Server product requires that the entire model be memory resident in order to convert the XML document into an Integration Server Document, so the streaming approach does not improve native Document processing as greatly as is desirable.
Another non-DOM implementation is Ximpleware's VTD-XML. See, for example, U.S. Pat. No. 7,761,459, the entire contents of which are hereby incorporated herein by reference. This implementation is believed to completely avoid the creation of DOM objects. It instead uses a location cache (e.g., a Binary File Mask or BFM) to maintain information about the document that resides somewhere else in memory or on disk. The VTP API apparently allows a program to access the document contents from a combination of information in the BFM and the original document without requiring object instantiation. Ximpleware claims that this approach significantly improves performance. Yet there are drawbacks associated with this approach. For example, many third-party products are designed to work with the DOM API. Without the DOM API, this approach is a programming island, requiring custom programming for each application. Moreover, although the '459 patent provides for updates (e.g., adding new components) by allocating empty records in the Binary File Mask, there is no description is provided for the circumstances under which the empty records become filled, or how performance might be affected by a large number of insertions into the original document.
Other non-DOM approaches include customized applications that do not directly utilize DOM. For example, U.S. Pat. No. 8,131,728 (apparently assigned to IBM®), which is hereby incorporated herein by reference in its entirety, describes a technique for extracting the structural information from an XML document and encoding the structural information as a memory-resident index with indexes into the original source data document. The application, a high-speed utility for loading and unloading a Configuration Management Database, processes the smaller memory-resident index rather than the larger source document. Unfortunately, however, the '728 patent (e.g., at col. 5, lines 7-23) suggests that the index is memory-resident, which effectively limits the ultimate size of the document that can be processed, and/or the number of documents that can be processed concurrently due to the total memory occupied by the index.
Other attempts have been made in the pursuit of an improved DOM implementation. The Apache foundation's Xerces (TRADEMARK) DOM parser, for example, is widely used throughout the industry. This product makes use of deferred object instantiation, but unfortunately does not provide a caching mechanism to facilitate processing of documents whose memory model exceeds the amount of available memory.
It is believed that neither Xerces (TRADEMARK) nor AXIOM (TRADEMARK) provides the ability to process arbitrarily large documents in a random fashion. And while Ximpleware VDT-XML can process arbitrarily large documents, it does so using a non-standard (e.g., non-DOM) API.
None of these approaches explicitly describe a systematic technique for limiting the total memory allocation for the Document processing within a system. More generally, there do not seem to be any apparent facilities for systematic tuning of system performance.
In addition to the above-identified issues with the conventional approaches discussed above, it is believed that none of these approaches addresses the issue of sharing a large parsed document if a session migrates across processor sessions. Furthermore, it is believed that none of these approaches addresses the issues of scalability and predictability for Enterprise-class and/or other large scale servers. There is no express explanation in the above-identified approaches tending to show that shared and/or distributed processing can be accommodated.
Thus, it will be appreciated by those skilled in the art that there is need for improved techniques for processing large XML documents, e.g., in ways that overcome the above-described and/or other problems.
In certain example embodiments, a system for processing XML documents is provided. Processing resources include at least one processor, a memory, and a non-transitory computer readable storage medium. The processing resources are configured to: parse an XML document into one or more constituent nodes, with the XML document including a plurality of objects representable in accordance with an object model, and with the XML document being parsed without also instantiating the objects therein; store the parsed constituent nodes and associated metadata in one or more partitions; and in response to requests for objects from the XML document from a user program, instantiate only said requested objects from their associated partition(s) in accordance with the object model.
In certain example embodiments, a method of processing large documents is provided. In connection with at least one processor, a large document is parsed into one or more constituent nodes, with the document including a plurality of objects representable in accordance with an object model, and with the document being parsed without also instantiating the objects therein. The parsed constituent nodes and associated metadata are stored in one or more cacheable partitions, with the cacheable partitions being located in a memory and/or a non-transitory backing store. A request from a user program for an object from the document is handled by: identifying the partition(s) in which nodes corresponding to the requested object is/are located, and instantiating only said requested objects from the identified partition(s) in accordance with the object model. The cacheable partitions are structured to include only logical references among and between different nodes.
In certain example embodiments, there is provided a non-transitory computer readable storage medium tangibly storing instructions that, when executed by at least one processor of a system, perform a method as described herein.
According to certain example embodiments, each said cacheable partition may include a locator array, a properties array, and a character array. The locator array may be configured to identify starting positions of nodes encoded in the properties array. The properties array may be configured to store encoded nodes, as well as, for each said encoded node: metadata including a respective node type, reference(s) to any familial nodes thereof, and offset(s) into the character array for any attribute and/or text value(s) associated therewith.
According to certain example embodiments, the XML document may be parsed by executing a pre-parsing initialization process that includes: creating a cacheable document node that corresponds to the starting point for user program access to the XML document; allocating a name dictionary that includes an indexed entry for each unique XML tag name included in the XML document; allocating a namespace dictionary that includes an index entry for each unique XML namespace included in the XML document; and allocating a partition table that includes a list of cacheable partitions and an allocation array that allows a node's allocation identifier to be resolved to a specific cacheable partition, each said cacheable partition including metadata from the pre-parsing initialization process.
According to certain example embodiments, the parsing may include: recognizing parsing events of predefined parsing event types within the XML document; creating nodes for the recognized parsing events; adding the created nodes to a current partition while there is sufficient space therein; and creating a new partition when there is insufficient space in the current partition for adding created nodes, updating the partition table, and continuing with the adding by treating the newly created partition as the current partition.
A feature of certain example embodiments is that the partitions may be movable from the memory to the non-transitory computer readable storage medium when memory usage reaches or exceeds a threshold.
Another feature of certain example embodiments is that the partitions may be movable through a caching storage hierarchy of the processing resources without adjusting or encoding memory references therein.
Another feature of certain example embodiments is that objects instantiated from the partitions, when resident in the memory, may be free from references to other objects in their own respective partitions and any other partitions.
Still another feature of certain example embodiments is that partitions for the parsed document may be removed from the memory and/or the non-transitory computer readable storage medium when the user program no longer includes any references to the document or any objects thereof.
Still another feature of certain example embodiments is that the partitions may include only logical references among and between different nodes.
These aspects, features, and example embodiments may be used separately and/or applied in various combinations to achieve yet further embodiments of this invention.