1. Field of Invention
This invention relates to methods and systems for processing electronic documents, and, in particular, to methods and systems for serializing and de-serializing electronic documents to support transmission or storage.
2. Discussion of Related Art
The eXtensible Markup Language (XML) can be used to facilitate implementation of integrated programmable World Wide Web (“Web”) based services. Through the exchange of XML-related messages, services can describe their capabilities and allow other services, applications or devices to easily invoke those capabilities. The Simple Object Access Protocol (SOAP) has been developed to further this goal. SOAP is an XML-based mechanism that bridges different object models over the Internet and provides an open mechanism for Web services to communicate with one another.
XML provides a format for describing structured data, and is a markup language that is similar in form to Hyper Text Markup Language (HTML) in that it is a tag-based language. Unlike HTML, however, XML tags are not predefined, permitting greater flexibility than possible with HTML. By providing a facility to define tags and the structural relationship between tags, XML supports the creation of richly structured Web documents.
The XML standard describes a class of data objects called XML documents and partially describes the behavior of computer programs which process them. XML documents are made up of storage units called entities, which contain either parsed or unparsed data. Parsed data is made up of characters, some of which form character data, and some of which form markup. Markup encodes a description of the document's storage layout and logical structure. XML provides a mechanism to impose constraints on the storage layout and logical structure.
XML “elements” are structural constructs that include a start tag, an end or close tag, and the information or content that is contained between the tags. A “start tag” is formatted as “<tagname>” and an “end tag” is formatted as “</tagname>”.
In an XML document, start and end tags can be nested within other start and end tags. All elements that occur within a particular element have their start and end tags occur before the end tag of that particular element. This defines a tree-like structure. Each element forms a node in this tree, and potentially has “child” or “branch” nodes. The child nodes represent any XML elements that occur between the start and end tags of the “parent” node.
One exemplary usage of XML is the exchange of data between different entities, such as client and server computers, in the form of requests and responses. A client might generate a request for information or a request for a certain server action, and a server might generate a response to the client that contains the information or confirms whether the certain action has been performed. The contents of these requests and responses are in the form of XML documents, i.e., sequences of characters that comply with the specification of XML.
The SOAP specification defines a uniform way of passing XML-encoded data. It also defines a way to perform remote procedure calls (RPCs) using HTTP as the underlying communication protocol.
A SOAP message is an XML document that includes a mandatory SOAP envelope, an optional SOAP Header, and a mandatory SOAP Body. SOAP provides a protocol specification for invoking methods on servers, services, components and objects. SOAP codifies the existing practice of using XML and HTTP as a method invocation mechanism. The SOAP specification mandates a small number of HTTP headers that facilitate firewall/proxy filtering. The SOAP specification also mandates an XML vocabulary that is used for representing method parameters, return values, and exceptions.
SOAP provides an open, extensible way for applications to communicate using XML-based messages over the Web, regardless of what operating system, object model or language particular applications may use. SOAP facilitates universal communication by defining a simple, extensible message format in standard XML and thereby providing a way to send that XML message over HTTP.
An “XML infoset” is an abstract representation of an XML document (described at, for example, http://www.w3.org/TR/2004/REC-xml-infoset-20040204). An infoset, which includes information items, of an XML document can be viewed as the information content of the XML document, without restriction on the document's format.
An example infoset follows. The root element of the example infoset “Book” contains one attribute called “Price.” The “Price” attribute has a value of “35”. The root element also contains one contents node of type Text having a value of “War and Peace.” The XML standard (described at, for example, http://www.w3.org/TR/REC-xml/) specifies how to serialize an infoset as text. For example, the example infoset can be serialized as follows:<Book Price=“35”>War and Peace</Book>
For transmission or storage, this textual XML is typically encoded into bytes that represent the corresponding text. Some text conversion standards include ASCII Unicode, UTF8 and UTF16. For example, the above textual XML document could be transmitted via ASCII encoding, as follows:                1st byte transmitted: 60 (ASCII code for ‘<’)        2nd byte transmitted: 66 (ASCII code for ‘B’)        3rd byte transmitted: 111 (ASCII code for ‘o’)        4th byte transmitted: 111 (ASCII code for ‘o’)        5th byte transmitted: 107 (ASCII code for ‘k’)        Etc . . .        
Thus, typically, an in-memory representation of an XML infoset is serialized into a textual XML string; then, the characters of the textual string are encoded into corresponding bytes for transmission. In the reverse process, the received textual-related XML bytes are decoded into the corresponding textual XML string, which is de-serialized and stored to provide an in-memory representation of the XML infoset.
The in-memory representation of an XML infoset exits logically, but need not exist physically. That is, information items associated with the infoset need not exist in any physical location prior to serialization.
For example, an object-oriented language-based program can include code to serialize and/or de-serialize XML documents. For example, object-oriented code to serialize the above example could look like:                XmlWriter.WriteStartElement(“Book”);        XmlWriter.WriteAttribute(“Price”,someDatabase.LookUpPriceForBook(“WarAndPeace”));        XmlWriter.WriteElementContents(“War and Peace”);        XmlWriter.WriteEndElement();        
The “Xml.Writer” method produces the bytes representing textual XML document:<Book Price=“35”>War and Peace</Book>.
The XML standard affords relatively easy serialization of XML information items, and human readable textual serialized documents. The documents, however, can be verbose and inefficient for processing.