1. Technical Field
This disclosure relates generally to processing of documents in a data processing system and more specifically for random update and serialization of XML documents.
2. Description of the Related Art
Many applications need to perform random updates to an extensible markup language (XML) document followed by serialization of the entire document. The applications for such scenario are numerous, because the need to update and serialize XML documents is a core requirement of many services oriented architecture (SOA) transactions and products. In current processing environments, streams of data are processed that cannot be reset, such as incoming network data. A capability of performing random access and updates of documents represented in the streams of data is in very high demand. Furthermore, serialization of such documents is a next natural step that most applications require, with both steps having a common need for memory and time efficiency.
A typical need in document processing systems is to dramatically reduce time spent handling XML documents. Applications and associated products need to very efficiently update part of a document and serialize the resulting document for consumption in other parts of a system. A bottleneck typically occurs with unnecessary materialization of a whole document into objects, due to a random update nature of these applications. Generally existing XML parsing solutions appear to focus on the reading of the document rather than managing updates to the content efficiently.
A general solution to update an XML document uses a Document Object Model (DOM), which typically has very poor performance because of the materialization of the complete document into objects. On the other hand, a general fast serialization solution for XML documents uses Simple API for XML, (SAX) or Streaming API for XML (StAX), but neither solution provides a capability to randomly update. In an example of a current solution, a hybrid representation of materialized and un-materialized data is only sequential. The solution is only capable of materializing a portion of the document in document order and leaving the rest of the document un-materialized.
In another example of a current solution, an “inflatable node” is used which requires references to the offset in the byte array. This means a “wrapper” in the form of the inflatable node must exist for each node. The inflatable node information requires additional memory, thereby adding to memory requirements of the document.
In another example of a current solution data is always in binary form and a process to update that binary data is provided. However, any updates from memory have to be converted into a correct binary format first before being applied. The process also has a drawback because a mutation in one part of the binary data might require changes in other parts of the data stream.
In another example of attempting to solve both requirements of efficiently updating part of a document and serializing the resulting document, a typical solution uses Eclipse™ Modeling Framework (EMF). Although use of EMF is an improvement over DOM, the solution still lacks optimal random update and serialization that many products require. EMF loads the entire document in memory and therefore typically does not have the necessary performance required, especially for scenarios where only small parts of the document are mutated.