1. Field of Invention
This invention relates generally to data processing systems, and specifically to techniques for storing and retrieving data elements with reference to their place in a hierarchical structure.
2. Prior Art
In the field of computer systems engineering, a great deal of attention and effort is given to the definition and development of so-called “abstract data structures.” When dealing with a problem which requires manipulation of data, computer engineers typically take great care to select an abstract data structure most appropriate for the data to be handled, such that expenditure of computer resources is minimized in the execution of programs written to manage the given problem.
A number of classic abstract data structures have been defined which have proven useful and efficient when applied to a broad range of computing problems. Among these is the tree structure, which is useful for organizing information that is naturally expressible as a hierarchy of data nodes; in a tree structure, each individual data node is identified as having a place in a hierarchy such that at most one other node is known as its “parent” occupying a superior position in the hierarchy, and an unrestricted number of other nodes are identified as its “children,” each of which is placed at an inferior position.
Distinct from the tree structure are data structures generally known as arrays, lists and linked lists. These structures organize data elements serially, so that one element is understood to be placed in a position “before” or “after” another in a defined sequence. The sequential nature of such structures may be exploited for fast searching and sorting, and when placed on a persistent storage device may be composed formally as a database, or a set of databases known as a relational database management system. Databases and relational database systems have proved very successful in providing fast and efficient access to sequentially structured data.
A great deal of research has been done in the field of computer engineering in pursuit of combining the descriptive utility of the tree data structure with the speed and efficiency of the search and manipulation algorithms associated with arrays, lists and database systems. Solutions found in the prior art have only been partially effective, and attempts to improve on traditional methods typically simply swap one form of inefficiency for another.
The simplest solution in the prior art is to incorporate hierarchical information into a sequential data structure by assigning a unique ID to each node in the sequential structure, then including in or associating with the node an identifiable value corresponding to the unique ID of the node's parent node. This is known in the art as the “adjacency list” approach. The advantages of this approach are the simplicity of the concept and of the algorithms for searching and manipulating nodes as they relate to the hierarchical structure. The chief disadvantage is that the algorithms typically used to search, sort and manipulate data elements within the structure require recursive loops in the logic, which are slow and resource-intensive. That is, in order for example to find all the descendants of a particular node (children, children's children, etc.), code must be executed to search the structure for the node's children, then the same code executed again for each child found to find its children, and so on—the same code repeatedly executed as many times as necessary until all the desired nodes are found. Often, impractical amounts of time are required to complete such recursive algorithms.
Another approach in the prior art, which attempts to eliminate recursive algorithms, associates with each node a record of all the node's ancestors (parent, parent's parent, etc.). These records are known in the art as “materialized lists” or “materialized paths.” This approach eliminates necessity for recursion in most useful algorithms: for example, to find all of a given node's descendants, one simply does a single search of all materialized paths to find those that contain the given node's unique ID. The chief disadvantage of this approach is that, even if the node ID's themselves are pure numbers, when included in a materialized path they must be delimited so the place where one ends and another begins can be detected; this is typically done by putting a non-numerical character between each ID, which means on the most commonly used computer systems the path itself must be stored as a string data type rather than a numerical data type. Operations on string data types are much slower that those on numerical data types; so that although recursion is eliminated, search and manipulation algorithms can still take an unacceptably long time to complete.
U.S. Pat. No. 6,480,857 (Chandler) describes a method of delimiting integer node ID's in a materialized path by storing each ID in a separate column in a relational database table. String character delimiters are thus eliminated and fast integer functions can be used to compose necessary algorithms. But this method makes inefficient use of memory allocated as storage space—every row in the table must contain a cell corresponding to each possible level's column, even though the great majority of cells will often be empty. This great excess multiplicity of cells wastes memory and will tend to slow down queries as they are parsed.
U.S. Pat. No. 6,625,615 (Shi et al.) and U.S. Pat. No. 5,467,471 (Bader) each make use of the concept known in the art as “hierarchical genealogical tables.” This approach replaces the need for a large number of columns in a relational database table with the necessity for a large number of rows in a table separate from the original node data. This separate genealogical table requires a row entry for each ancestor relationship of each node in a hierarchy. That is, for example if a given node has four ancestors, four rows must be created for it in the genealogical table, each one linking the ID of the given node to the ID of one of its ancestors. This approach eliminates arbitrary limits on hierarchy depth and necessity for string type functions in algorithms. But this approach requires the size of the genealogical table to grow geometrically in proportion to the number of nodes added. The large number of rows which accrue in the table of all but small hierarchies will result in large memory storage requirements and slow execution of database queries. The need for a second table separate from the main node data table means increased complexity of the system, and keeping the hierarchical information of the two tables in sync is an error-prone task. In addition, it is considered desirable among database engineers to minimize the number of write actions to a table necessary to complete a given task, because write actions are generally the slowest and most resource-intensive. Creating a new node with this approach typically requires multiple write actions to the genealogical table, and the inefficiency of node creation grows as the tree increases in size.
Celko introduced the “nested sets” approach to modeling a hierarchy in a relational database. In this approach each node is assigned two parameters, a low-value integer and a high-value integer. The child of a given node is assigned integers as parameters such that their values fall between the low value and the high value of the given node's parameters. Thus all the descendants of a given node may be found by searching for all nodes whose low-value integer is greater than the given node's and whose high-value integer is less than the given node's. This approach uses integer data types and does not require excessive rows or columns, and thus searches are fast. The chief disadvantage of this approach is that if a new node is added, the gaps between the high and low value integers of some nodes may have to be widened to make room for it, and depending on how the tree is structured the parameters for a large number of existing nodes may have to be recalculated, and all the new parameter values must be written to the database table. The large number of write actions required for a typical edit of the hierarchy makes add and delete algorithms extremely slow. Thus the nested sets approach is only considered practical if the hierarchy structure is expected to change only rarely.
Tropashko introduced the concept of “nested intervals” as a means of overcoming the shortcomings of the nested sets approach. This approach makes use of rational numbers instead of integers as high and low value parameters. Since it is always theoretically possible to find, between the values of two rational numbers, two more rational numbers, it is in theory possible to add a node to a hierarchy described by nested intervals without recalculating or rewriting parameters, thus eliminating the chief drawback of the nested sets approach.
The chief challenge of Tropashko's approach is designing an algorithm which efficiently calculates the intervals between the high and low values to be associated with each node, since they must become progressively smaller as nodes are added to the tree, and the limited precision of numeric data types found on the most commonly used computer systems puts a strict bound on how small a stored value can be. Tropashko offers a number of algorithms for calculating the interval, all of which result in exponential decreases in interval size as nodes are added. Thus only a relatively small number of nodes can be described before the precision limits of the computer system are reached. For example, using the most efficient algorithm (“continued fractions”) to store a tree in which each node has ten children, a numeric data type of 32-bit precision could only hold parameters corresponding to six levels' worth of nodes before its capacity ran out.
An alternative algorithm offered by Tropashko, (“Farey fractions”) is less efficient in terms of controlling shrinkage of interval size, but it stores the numerator and denominator of the rational numbers calculated rather than the numbers themselves; this allows more nodes to be stored within given precision limits, but consequently requires extensive calculations to reconstitute and compare the intervals as retrieval queries are processed—this slows query execution time. Also, since the stored numerators and denominators only express the node intervals indirectly, the index functions available with the most commonly used database systems can't be meaningfully used, thus full table scans are typically necessary to execute searches, another contributor to slow query execution. This alternative also requires that iterative algorithms be executed when calculating the parameters for new nodes, these algorithms run more slowly as the hierarchy grows in size; thus the time required to add nodes may become unacceptable for large trees.