The eXtensible Markup Language, otherwise known as XML, has become a standard for inter-application communication. XML messages passing between applications contain tags with self-describing text. The self-describing text allows these messages to be understandable not only to the applications, but also to humans reading an XML document. XML is currently used to define standards for exchanging information in various industries. These document standards are available in various forms.
Several XML-based communication protocols exist, such as the Simple Object Access Protocol (SOAP) and the ebXML protocol. The ebXML protocol is an open XML-based infrastructure that enables the global use of electronic business information. SOAP is a lightweight XML protocol, which can provide both synchronous and asynchronous mechanisms for sending requests between applications. The transport of these XML documents is usually over a lower level network standard, such as TCP/IP.
XML documents need to be valid and well-formed. An XML document is considered to be “well-formed” if it conforms to the particular XML standard. An XML document is considered valid if it complies with a particular schema. At the core of an XML document is an XML parser, which will check to verify that a document is well formed and/or valid.
The processing of XML has become a standard function in many computing environments. When parsing XML, it is necessary to get data from the XML file and transform the data such that the data can be handled by a Java application or other application running the parser. Efficient XML processing is fundamental to the server. As more and more documents become XML based, more and more traffic on the server will be in XML. The latest push into web services (with SOAP as the transport) has also highlighted the fundamental need for fast XML processing. Web services use XML over HTTP as the transport for remote procedure calls. These calls cannot be done in a timely manner if the XML parser is slow. There are primarily two standard approaches for processing XML: (1) SAX, or Simple API for XML, and (2) DOM or Document Object Model. Each protocol has its benefits and drawbacks, although SAX presently has more momentum as an XML processing API.
SAX is an event-based API for parsing XML documents, presenting a document as a serialized event stream. An API, or application programming interface, provides a defined method for developing and utilizing applications. With SAX, a Java application can work with any XML parser, as long as the parser has a SAX driver available. In SAX, an event is generated every time a piece of the XML document is processed. That event is sent to a document handler, which is an object that implements the various SAX handler APIs. Handlers can receive callbacks during the processing of an XML document. Some of the main benefits of this style of XML document processing are that it is efficient, flexible, and relatively low level. It is also possible to change handlers during the processing of an XML document, allowing the use of different handlers for different sections of a document.
One drawback to using a SAX API is that a programmer must keep track of the current state of the document in the code each time an XML document is processed. This may be an unacceptable amount of overhead for XML processing, and may further lead to convoluted document processing code.
Another problem with SAX is that it is necessary to have an event sent to a user. Events cannot be requested as they are needed, but are instead pushed to the user only as the events occur.
DOM, the other standard approach, requires loading an entire XML document into memory and provides a programmer with APIs to be used in manipulating an in-memory tree structure. DOM is a “tree-based” API, as opposed to the event-based SAX. DOM is referred to as “tree-based” because it utilizes a logical structure based on nodes for “branching” through a document. At first glance, DOM might seem like a preferred approach to parsing for an application developer, as the developer does not have to write specific parsing code. This perceived simplicity comes at a price, however, in that performance takes a significant hit. Even for very large documents, the entire document must still be read into memory before taking appropriate actions based on the data. DOM can also be restrictive in how it loads data into memory. A programmer must use a DOM tree as the base for handling XML in the document. This can be too restrictive for most application needs. For example, most application server development descriptors need to be bound to specific Java classes and not DOM trees.