The extensible markup language (XML) is a type of markup languages for describing (marking up) a semantic structure of a document with simple marks. XML allows a user to carry out a user-original extension by defining a grammar and imparting logical senses to constituents of the document.
XML documents may be intended to conform to a document type definition (DTD), and software programs can determine whether an XML document does so conform. For example, a DTD may include a grammatical rule to the effect that nodes <TITLE>, <AUTHOR> and <PUBLISHER> appear once in this specific order after a node <BOOK>. Then, it is possible to determine whether a certain XML document accords with the grammatical rule or not.
To conform to a specific DTD, an XML document expresses its data structure universally by using certain marks, or tags, which may be stored with the XML document. Accordingly, the XML document has a characteristic of a larger file size in comparison with many other file formats.
By virtue of its tag-based nature, XML also defines a strict tree structure or hierarchy. XML elements are structural constructs that consist of a start tag, an end or close tag, and the information or content that is contained between the tags. A start tag is formatted as “<tagname>” and an end tag is formatted as “</tagname>”. In an XML document, start and end tags can be nested within other start and end tags. All elements that occur within a particular element must have their start and end tags occur before the end tag of that particular element. It is this requirement that defines the tree-like structure that is a characteristic of XML documents. Each element forms a node in this tree, and each node potentially has child or branch nodes. A child nodes represent any XML elements that occur between the start and end tags of a parent node.
A software application may be intended to interact with, or read a specific XML document. One way to enable this interaction is for such an XML document to first have its content parsed using a specific software device called a parser. A parser reads an XML document and creates an output that an application, such as a Web browser, then can use, for example, to generate a display. The output that the parser generates is based on the XML document's content and the markup used to describe that content. In some instances, the document is compared to rules specified in its DTD. DTD-conforming XML documents are called valid. Parsers that have the ability to compare a document to its DTD and determine whether the document is valid are called validating parsers. Even if an XML document is not validated, the XML document still may conform to general rules of document creation established in the XML specification. Documents that obey the general rules are called well formed.
There are at least two ways to parse an XML document: using a DOM (Document Object Model) parser and using a SAX (Simple API for XML) parser. The DOM parser reads the entire document into memory and creates the tree-like structure comprising a series of nodes. The tree-like structure is also stored in memory, thereby increasing the memory requirements for this method. Furthermore, creation of the tree-like structure is CPU intensive, as is the subsequent parsing of the data populating the structure.
In contrast to the DOM parser, the SAX parser normally does not read the entire document into memory. Instead, the SAX parser reads a section of the XML document into memory and then parses the section. The SAX parser may continue this operation until the entire XML document is parsed. As the XML document is parsed, the SAX parser calls to sub routines that are registered to address a specific type of element the Sax parser encounters in the XML document. Although the SAX parser by design consumes less amount of memory than the DOM parser, there are certain drawbacks which make SAX parsing CPU and I/O intensive. For example, sometimes, the entire XML document needs to be read into memory. Also, SAX parsing is unidirectional; previously parsed data cannot be re-read without starting the parsing operation again.
A disadvantage with both DOM and SAX parsers is a large library that is used to define the syntax and rules for parsing XML documents. The large library means more demand on memory. An associated issue is that most XML parser libraries are defined for dynamic XML documents. A dynamic XML document is one in which all or part of the document content is provided by call to programs, such as Web services, with other data provided explicitly. Because of the dynamic nature of these XML documents, the XML parser library is highly redundant, which places even more demands on memory.