1. Field of the Invention
The present invention relates to methods and systems for data exchange in an information processing system. In particular, the present invention relates to providing an optimized parser for processing a structured document (e.g., a XML document) in an information processing system.
2. Discussion of the Related Art
XML is a platform-independent text-based document format1 designed to be used in structured documents maintained in an information processing system. XML documents (e.g., forms) have become the favored mechanism for data exchange among application programs sharing data over a network (e.g., the Internet). XML documents have the advantages that (a) the information in an XML document is extensible (i.e., an application program developer can define a document structure using, for example, an XML schema description), and (b) through the XML schema, an application developer can control the range of values that can be accepted for any of the XML element or attribute in the structured document. For example, in an XML schema-defined form for a pair of shoes, the application 1 In this description, the platform-independent text-based document format means a text format for defining a document which is independent of the underlying software platform (e.g., the operating system), the underlying hardware platform, or both.
program developer may constrain the shoe size attribute accepted by the form to be between 5 and 12. As a result, the form would reject as invalid input a shoe size of 100.
Because of these advantages, XML is widely used in consumer application programs. However, additional processing overhead is imposed on the application program to allow XML to be read and edited easily by a human using a word processor as interface, because the structured data in an XML document are required to be parsed by the application into representations that can be manipulated in the computer by the application program. Parsing requires intensive computational resources, such as CPU cycles and memory bandwidth, as the application program processes the XML elements or attributes one character at a time, in addition to implementing the higher level processing requirements of the XML schema. In a typical XML document, there can be a large number of elements and attributes which are defined in the schema using different data types and constraints. Character-matching is not efficient in existing hardware implementations, such as those based on IA32 and ARM architectures.
Parsers for documents written in numerous languages have been developed and used throughout the history of computers. For example, the first widely accepted parsers (which also validate) for XML are based on the W3C Document Object Model (DOM). DOM renders the information on an XML document into a tree structure. Thus, a parser based on DOM constructs a “DOM tree” in memory to represent the XML document, as it reads the XML document. The DOM tree is then passed to the application program which traverses the DOM tree to extract its required information. Constructing a DOM tree in memory is not only time-consuming, it requires a large amount of memory. In fact, the memory occupied by a DOM tree is usually 5-10 times greater than that of occupied by the original XML document. One optimization constructs a partial DOM tree in memory as needed to reduce the memory requirement and the processing time.
Alternatively, an XML document may be parsed based on a streaming model. Parsers using the streaming model include SAX and Pull. Under the streaming model, rather than a parse tree, a parser outputs a continuous stream of XML elements, together with the values of their attributes, as the XML document is parsed. Typically, such a parser reads from the XML document one XML element at a time, and passes to the consuming application the values of the element and their associated attributes. Although a streaming-based parser is efficient in its memory and processing speed requirements, such a parser merely tokenizes a string into segments of text without interpretation. The interpretation of data contained in each text segment is entirely left to the consuming application program. Thus, the burden of XML processing—which is to provide data in an XML document to the application program in a manner that can be readily used by the application program—is shifted from the parser to the consuming application program.
A parser may or may not validate an XML document. Validation is the process by which each parsed XML element is compared against its definition defined in an XML schema (e.g., an XML DTD file). Validation typically requires string pattern-matching as the validation program searches the multiple element definitions in the XML schema. A conventional approach to simplify validation is to convert the definitions of an XML schema into component models, expressed as a series of Java bean classes. An application program may then check the XML elements using methods provided in the Java bean classes. While schema conversion methods may speed up both the parsing and the validation processes to some limited degree, such conversions do not provide the fast string pattern-matching desired in XML parsing and validating.
As is apparent from the above, XML parsing involves a substantial amount of string-matching operations, which are the most CPU intensive operations in XML parsing. Further, the memory requirements of parsing XML elements also lead to a substantial amount of inefficient memory allocation and de-allocation operations.