Extensible Markup Language (XML) document has become the data representation format of choice to exchange data between applications or user agents on the Internet. This document is usually wholly or partially read in and parsed by a computer program and stored in memory as a tree of hierarchical nodes with each node holding the data used by the application. This in-memory tree is called a Document Object Model (DOM). The parent-child and sibling relationship between the tree nodes in the XML document is preserved in the DOM. The position of a node in the tree may be as important as the data that the node holds. A node in this tree can be addressed and identified in two ways: by a unique identifier or by a path expression starting from a known node, typically the top root node of the tree. The unique identifier of a node is set by an external agent, e.g., by the application that created the original XML document or the DOM. The path expression does not depend on an external agent; it is a built-in characteristic of the tree-like data structure that by convention a computer program uses to navigate around the tree. An example of the path expression is the fully qualified name of the node, e.g., ‘/book/chapter[2]’ that points to the 2nd chapter of a book.
The addressing of a node in a tree-like data structure is important when the tree is being shared among user agents running in a distributed computing environment across a network like the Internet. Client-server use configuration is a common runtime topology where the data is centrally located on the server and with user agents running on client computers that communicate operations on the data to the server. The client computer usually has a full or a partial representation of the data locally cached, either in memory or in persistent storage. XML document and its derivatives such as the HyperText Markup Language (HTML) are commonly used to represent the data that the server sends to a client computer. An example of this topology is a web browser running on a client computer communicating with a web application running on a server. The web server sends a HTML document to the browser, and the browser parses it and creates a DOM tree in memory. This DOM tree contains not only the content but also style and formatting information that tell the browser how to render the document for display. There is a one-to-one mapping between what is being displayed in the browser and the region of sub-tree of the DOM tree. So when user types an input or click within the browser, he or she “interacts” with the nodes of the DOM tree, like updating the content of a node, or creating a new node, or deleting an existing node. In many cases, the browser user agent needs to communicate to the server of these user operations and/or changes to the DOM tree. In order for it to do so, the user agent needs to determine the address of a node whose content is changed and send the address to the server along with changed content. Typically, all nodes of interest have a marker assigned to them. Alternatively, the user agent can generate a fully qualified name, which is a simple form of path expression like XPath, to the node by starting from the node and walk across and up the tree to the root node.
The assignment of unique identifier to a node is done either statically by the designer of the tree data or dynamically as the tree is being created. An example of the former is HTML form found on a web page used to capture user inputs. An example of the latter is the collaborative editing of a XML document, where all editable nodes are dynamically assigned a unique identifier by the server, before it is sent to the client computers. These dynamically generated identifiers are usually not found in the original XML document. They are generated to facilitate the communication of operations on the tree data structure between the client user agents and the server, which has the master copy of the data.
Assigning a unique identifier to all nodes of interests in a tree data structure can be considered wasteful and undesirable under certain circumstances; for example, when the number of nodes is very large in tens and hundreds of thousands and only a very small fraction of the nodes are likely to change by the user agents. In XML, a special attribute name call “Id” is defined to hold this unique identifier, and the DOM API has a special method call getElementById( ) that retrieves the Element node whose attribute “Id” matches the one requested. A common practice in the implementation of DOM is to use a list or a hash map to hold the name-value pair of “Id” name and the referenced node, so the more nodes with unique identifier, the larger this list or hash map, which consume greater amount of memory resource. Additionally, there could be performance and data integrity issues when the user agents and the server have to coordinate and maintain all these unique identifiers in sync with each other.
Alternatively, one can use path expression to represent the address of a node and bypasses some of problems with using unique identifiers, however, this approach has its own problem because it is usually expressed as a single absolute path from the root node, and because if the underlying tree is changed from the time the path is generated to the time this path is applied on the tree to retrieve the node, then the retrieved node may not match to the one that is addressed. The reason is if the path contains relative array index, and if the array of child nodes that these relative indexes reference changes because a new child node is added or an existing child node is removed in the array, then the path becomes incorrect. For example, a simple path expression that addresses sentence #3 in paragraph #9 in chapter 2 of a book may look like this
a. /book/chapter[2]/paragraph[9]/sentence[3]
and it remains valid as long as the positions of the chapter, paragraph, and sentence reference by the relative indexes 2, 9 and 3, respectively, remain unchanged in the tree. However, this simple path expression becomes incorrect when the “book” undergoes structure changes like: a new chapter is added before chapter 2, or paragraph #8 is removed from chapter 2, or a new sentence is inserted as sentence #2 in paragraph #9 in chapter 2. This simple path expression becomes incorrect because it no longer is pointing to the original sentence #3, and when this path expression is applied to the changed tree, it will retrieve a sentence #3 that is incorrect comparing to what it was suppose retrieve before the tree change. This problem is more severe for highly nested multi-level trees, and the path expression becomes more sensitive to minor changes to the structure of the tree.
XPath and XPointer are open standard specifications that define the syntax and the path expression to navigate and select nodes or node-sets in an XML document. For a given node, there are many path expressions that can be used to represent the address of that node. These specifications do not define which one of them makes a good address for a node. In most cases, the fully qualified name of the node is used to address it. The specification also does not take into account the schema of the XML document or the use history and pattern of the XML document that can be used to produce optimal addressing for a given node.
Known methods deal with arbitrarily large SGML/HTML/XML documents, and formatting or subsetting them effectively without serial processing. This is a key enabler for dealing with non-trivial sized documents, regardless of medium. Other known systems function as batch formatters. In those batch formatters, seeing page N takes time at least proportional to N. Other widely used known techniques use two ways to identify a node: by its Element ID (EID), which is created for each node as it is being instantiated in memory; and by its fully qualified name.
Dom4j is an open source program that is used to navigate and query a data tree. One of the method on a Element node is getUniquePath( ), which returns a unique path from the root node to the Element node in question. This unique path is same as the fully qualified name and always starts from the root node.