The use of hierarchical mark-up languages for structuring and describing data has found wide acceptance in the computer industry. An example of a mark-up language is XML.
Data structured using a hierarchical mark-up language is composed of tree nodes. Tree nodes are delimited by a pair of corresponding start and end tags, which not only delimit the tree node, but also specify the name of the tree node. For example, in the following structured data fragment,
<ZIPCODE><CODE>95125</CODE> <CITY>SAN JOSE</CITY><STATE>CA</STATE></ZIPCODE>
the start tag <ZIPCODE> and the end tag </ZIPCODE> delimit a tree node having name ZIPCODE.
The data between the corresponding tags is referred to as the tree node's content. A tree node's content can either be a scalar value (e.g. integer, text string), or one or more other tree nodes.
A tree node that contains another tree node is referred to herein as a structured tree node. The contained tree nodes are referred to herein as descendant tree nodes. A structured tree node thus forms a hierarchy of tree nodes with multiple levels, the structured tree node being at the top level. A tree node at each level is linked to one or more tree nodes at a different level. Any given tree node at a level below the top level is a child tree node of a parent tree node at the level above the given tree node. Tree nodes having the same parent are sibling tree nodes. A parent tree node may have multiple child tree nodes. A tree node that has no parent tree node linked to it is a root tree node, and a tree node that has no child tree nodes linked to it is a leaf tree node. For example, in structured tree node ZIPCODE, tree node ZIPCODE is the root tree node at the top level. Tree nodes CODE, CITY, and STATE are descendant and child tree nodes of ZIPCODE, and with respect to each other, tree nodes CODE, CITY and STATE are sibling tree nodes. Tree nodes CODE, CITY, and STATE are also leaf tree nodes.
In a tree node tree that represents an XML document, a tree node corresponds to an element, child tree nodes of the tree node correspond to another element contained in the element. For convenience of expression, elements and other parts of an XML document are referred to as tree nodes within a tree of tree nodes that represents the document.
A structured tree node within a structured tree node may be referred to as a subtree of the structured tree node. In XML, a structured tree node is a complex tree node. XML documents may contain large complex elements and subtrees. A subtree may also be referred to herein as a subdocument.
Namespace
An advantage of a mark-up language is that tags that are used to structure a document may be given names that are descriptive to humans of the tag's intended content. However, the same name may be used in many applications and contexts and the semantic for the name may vary.
To allow re-use of a name, namespaces are used. A name is referred to by qualifying the name with a namespace, thereby allowing a name to be reused.
In XML, an XML namespace may have a namespace name, such as a uniform resource identifier (URI). The namespace name may be bound to an alias, and the alias is used as a proxy for the namespace name, and may be used to qualify names of elements in an XML document. A namespace is declared for an element, and the scope of the namespace is the element and the element's descendants.
The following XML document DOC is used to illustrate namespaces.
<DOC....><ACCT:CUSTOMERxmlns:ACCT=“HTTP://WWW.MY.COM/ACCT-REV10”><ACCT:NAME>CORPORATION</ACCT:NAME><ACCT:ORDER ACCT:REF=“5566”/><ACCT:STATUS>INVOICE</ACCT:STATUS></ACCT:CUSTOMER><FUL:CUSTOMERxmlns:FUL=“HTTP://WWW.YOUR.COM/FUL”><FUL:NAME>CORPORATION</FUL:NAME><FUL:ORDER FUL:REF=“A98756”/><FUL:STATUS>SHIPPED</FUL:STATUS> </FUL:CUSTOMER></DOC>
The element ACCT:CUSTOMER declares the namespace HTTP://WWW.MY.COM/ACCT-REV10, through use of the XML reserved keyword xmlns, and declares ACCT as a prefix or alias for the namespace. This prefix is used to qualify the names of elements ACCT:CUSTOMER, ACCT:ORDER, ACCT:NAME, and ACCT:STATUS. The scope of namespace ACCT namespace is element ACCT:CUSTOMER and does not extend to element FUL:CUSTOMER.
The element FUL:CUSTOMER declares the namespace HTTP://WWW.YOUR.COM/FUL, and declares FUL as a prefix or alias for the namespace. This prefix is used to qualify the names in elements FUL:CUSTOMER, FUL:NAME, FUL:ORDER, and FUL:STATUS.
Parsing XML Documents
An XML parser is a software component that takes XML data and generates data representing the structure and/or content of XML data. There are at least two types of XML parsers, streaming-event parser and a DOM (Document Object Model) parser.
A DOM parser uses tree-traversal-based parsing to build an object tree in memory representing the XML document. The object tree is referred to herein as a DOM. The DOM allows complete, dynamic access to an entire XML document through an object-oriented API. Because the XML document is represented in memory as an object tree, DOM parsers preserve and allow dynamic access to the XML document structure and content.
A streaming-event parser may also use tree-traversal-based parsing. However, the streaming-event parser generates “parsing” events encountered during the traversal of an XML document. Example events include encountering a beginning of an element or end of an element.
Large datasets are frequently packaged as large XML documents. Parsing large XML documents expends a lot of time. Described herein are techniques for more quickly parsing large XML documents.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.