As XML has been more widely accepted, the need to parse XML has increased. Conventionally, there have been various methods to do this. One method is to build a tree of nodes, representing the XML data that was parsed. This is known as the Document Object Model (DOM) approach, which may consume significant memory and processing time, which consumption is exacerbated when processing large XML documents. Thus, “lower level” XML parsers developed and provided access to a stream of XML tokens that facilitated reducing processing times.
In the object oriented world a common model for low-level parsing is the push model parser (e.g., SAX) that parses an entire XML document and pushes substantially all of the parsed XML, associated parsing events and related event data to a parse requestor. This approach suffers from requiring a parse requestor to maintain a complicated state machine, the inability to concurrently interact with multiple XML sources and presenting a parse requestor with undesired XML tokens, which can complicate state machines associated with such parsers. Such state machine complexity may be exacerbated, for example, by the need to maintain state for dual capability parsers that split event level abstractions from element level abstractions.
In the non-object oriented world there are simple pull model parsers that may employ, for example, a single function (e.g., GetNextToken( )) which returns a struct containing information about that token. Such parsers also suffer from the problem of presenting the parse requestor with undesired XML tokens. Furthermore, the non-object oriented XML pull model parsers typically do not provide high-level input/output abstractions and, suffer from traditional problems associated with non object code. Thus, there remains a need for an improved object oriented XML parser.
XML is a W3C (World Wide Web Consortium) endorsed standard for document format that provides a generic syntax to mark up data with human-readable tags. Since XML does not have a fixed set of tags and elements, but rather allows users to define such tags, (so long as they conform to XML syntax), XML can be considered a meta-markup language for text documents. The markup that is allowed in a particular XML document can be recorded in a document type definition (DTD).
Data is stored in XML documents as strings of text that are surrounded by text markup. A particular unit of data and markup is conventionally referred to as an element. XML defines the syntax for the markup. A simple XML document appears below:
<?xml version=“1.0”?>
<programmer grade=“G7”>
<firstname>ashton</firstname>
<lastname>annie</lastname>
<language>C</language>
<language>C#</language>
</programmer>
In this document, the name “ashton” is data (a.k.a. content), and the tags <firstname> and </firstname> are markup associated with that content. The example document is text and may be edited by conventional text editors and stored in locations including, but not limited to, a text file, a collection of text files, a database record and in memory.
XML documents can be treated as trees comprising a root node and one or more leaf nodes. In the example document, the root element is the programmer element. Furthermore, elements may contain parent elements and child elements. In the example document, the programmer element is a parent element that has four child elements: a firstname element, a lastname element, and two language elements. In the example document, the programmer element also has an attribute “grade”. An attribute is a name/value pair that is associated with the start tag of an element. XML documents may contain XML entities including elements, tags, character data, attributes, entity references, CDATA sections, comments, processing instructions, and so on.
The W3C has codified XML's abstract data model in a specification called the XML Information Set (Infoset). The Infoset describes the logical structure of an XML document in terms of nodes (a.k.a. “information items”) that have properties. Nodes in an XML tree have well-defined sets of properties that may be exposed. For example, an element node has properties including, but not limited to, a namespace name, a local name, a prefix, an unordered set of attributes, and an ordered list of children. The abstract description of an XML document standardizes information that is made available concerning XML documents. Thus, in addition to data that may be stored in an XML node, metadata concerning the node and the tree in which the node resides is available.
Programs that try to understand the contents of document like the sample XML document employ an XML parser to separate the document into individual XML tokens, elements, attributes and so on. Conventional push model parsers may perform well-formedness and validity checking on a parsed XML document. An XML document may be checked to determine whether it is well-formed (conforms to the XML specification) and to determine whether it is valid (conforms to a desired DTD). A DTD includes a list of elements, attributes and entities that an XML document can employ and the contexts in which they may and/or may not be employed. Thus, a DTD facilitates limiting the form of an XML document. A DTD may be located within an XML document, or an external reference to the DTD may be employed to locate the DTD with which an XML document is related. External references are common since it may be desirable to have more than one XML document conform to one DTD.
With XML being employed to store data for such a variety of applications, the need to parse XML for use with such variety of applications is common. Some conventional parsers may parse then write the more of the parsed output, events associated with the parsing (e.g., encountered elements, encountered attributes, encountered comments, encountered white space, etc.) and information (e.g., state, attributes) associated with the events that a user desires. Such over-parsing parsers suffer from several drawbacks, including, but not limited to, requiring the receiver of the parsed data to maintain a complicated state machine, transforming unneeded data, consuming excessive memory to hold undesired data, events and/or metadata, consuming excessive processor cycles to process such undesired data, events and/or metadata and limiting the flexibility with which the output destination can request parsed data.
As conventional parsers improve, more selective parsing, which reduces the amount of XML parsed, has appeared. However, such parsers may still present the user with non-configurable, non-selectable and thus irrelevant and/or unwanted data, events and/or metadata.
By way of illustration of a drawback of a conventional over-parsing parser, consider a user who desires to see the data associated with the <firstname> tags in the sample XML document listed above. Conventionally, the pieces of the document other than just the desired data would be loaded and parsed, and the user would be required to extract the relevant data from the parsed data. Again, excessive memory and processor cycles have been employed in parsing irrelevant data.
Conventional parsers typically interact with event driven user programs that receive event notifications from the parser along with a set of data concerning the event. One drawback with such conventional systems is that event notifications may require unnecessary processing by a user program that may only be interested in a subset of events. Furthermore, simple pull model parsers may only provide a single pull method that will non-selectively provide the next XML token in an XML data source, regardless of whether the user desires such token, which forces the user to handle an irrelevant (to the user) token, event, data and/or metadata. Further still, user programs that interact with such event producing parsers may be required to maintain complicated state machines in order to interact with the conventional parser.
Since conventional parsers typically interact with event driven user programs that are required to maintain complicated state machines concerning the progress of the parsing, it is typical that such conventional parsers only interact with a single XML data source. Thus, flexibility in processing parsed data is limited in such conventional systems.