This disclosure relates generally to information and data management in a data processing system and, more specifically, to manipulating a parse tree structure and efficiently performing node order comparisons within the parse tree in support of a tree order enforcing expression language.
Data formats such as extensible markup language (XML) or JavaScript® Object Notation (JSON) are typically syntactically parsed into a general tree data structure containing a logical node for each pertinent syntactic component of the data. Regardless of the data format, this parse tree data structure is referred to as a document object model (DOM). Each node of a DOM typically contains information about the syntactic component being represented, such as an XML element tag name or content value, as well as index or pointer values that bind the DOM node into the tree structure, including an indicator of the parent, preceding sibling and next sibling, a child list, and possibly a separate attribute list. The document order of a DOM corresponds to a visitation order of DOM nodes resulting from a depth first traversal of the DOM tree. A depth first traversal, also known as a pre-order traversal, is a traversal of a tree structure in which a node is deemed visited or processed before any of its child nodes are visited or processed.
Creating a DOM for data enables querying or mutation of the data using an application programming interface (API) interface to the DOM. A tree addressing scheme is used to indicate a particular node or a set of nodes in the DOM. For example, an XML path language, XPath (a query language for selecting nodes from an XML document) expression can be used to obtain an XML node or set of nodes, and a dotted JavaScript notation expression can be used to obtain a JSON object. Typically, during the execution of either expression and production of a respective result, the referenced nodes are navigated in an organized manner relative to the DOM structure. For example, each XPath location step produces a set of nodes, in document order, before proceeding to a next location step. A key factor in placing nodes in document order is a DOM node comparator that determines which of two given DOM nodes is earlier in document order.
Once a DOM node (or set of nodes) is obtained, both informational and structural mutations can then be performed using the API, including changing tag names or content values (informational mutation) or performing insert and delete operations on a DOM node or nodes (structural mutation, or structural manipulation of the parse tree data structure).
Given two distinct DOM tree nodes of node DX and node DY, the DOM tree traversal comparison method first traverses the parent links of node DX and node DY to find the closest common ancestor node A (in this example). If one of node DX and node DY is the ancestor node A, then the node equivalent to node A is the earlier node in document order. Otherwise, the children node CX and node CY of the closest common ancestor A are obtained; where node CX is the root node of the DOM sub-tree containing node DX, and node CY is the root node of the DOM sub-tree containing node DY. When node CX is earlier in the child list of node A, then node DX is the earlier node in document order, and otherwise node DY is the earlier node in document order.
An advantage of the DOM tree traversal comparison method is that the method places no encumbrance on insert and delete operations, which normally have an O(1) cost for structural manipulations of the DOM. However a disadvantage of the method is the comparison can require O(n) time, where n is the traversal length of the tree path (node DX, . . . node CX, . . . , node CY, . . . node DY) that excludes node A. The comparison operation most typically becomes expensive in a DOM due to node A having a large number of children, for example, there is an O(n) distance between node CX and node CY.
To mitigate this disadvantage, a common practice is use of a node index method. The node index method performs a depth first search operation to associate a depth-first index (DFI), with each DOM node. The node index method has an O(N) cost, where N is the number of DOM nodes visited and indexed, but the advantage is that once the indexing operation is performed, all subsequent comparison operations have a very fast O(1) cost to compare the DFIs of the node DX and node DY, where the node with the lesser DFI is determined to be the earlier node. The disadvantage is that this efficiency only lasts until a next insert operation occurs, which alters the DOM structure. Since the newly inserted node or nodes do not have an associated DFI value, a common practice is to mark the whole node index map as stale and revert to using the DOM tree traversal comparison method.
To mitigate this problem, a depth first search after each mutation sequence (for example, script) can be used to re-index the nodes, which clear a stale flag and restore a previously efficient node comparison operation. However, mutation scripts containing few structural mutations in relation to a number of informational mutations tend to run faster by re-indexing after each insert operation. On the other hand, re-indexing after each insert operation places an O(N) worst case cost on each insert operation, so mutation scripts containing many structural manipulations of the parse tree data structure tend to run much slower due to re-indexing after each insert operation.
In a further proposed solution, a 2010 paper in the Journal of Information and Data Management: DeweyIDs—The Key to Fine-Grained Management of XML Documents, focuses optimization of XML document storage and retrieval in XML databases. The DeweyID proposal is derived from the Dewey decimal system of organizing library books. The DeweyID is a single key comprised of a variable number of integer index values that help achieve efficient B*-tree operations relative to prior XML database systems. While appropriate for XML databases and B*-tree operations, the DeweyID proposal is not appropriate for implementing efficient node comparison and structural mutation of an in-memory DOM.