The present invention relates generally to the field of data structures, and more particularly to a hierarchical data organization using exception lists of indexes to provide enhanced node compression.
Computer processors and associated memory components continue to increase in speed. As hardware approaches physical speed limitations, however, other methods for generating appreciable decreases in data access times are required. Even when such limitations are not a factor, maximizing software efficiency maximizes the efficiency of the hardware platform, extending the capabilities of the hardware/software system as a whole. One method of increasing system efficiency is by providing effective data management, achieved by the appropriate choice of data structure and related storage and retrieval algorithms. For example, various prior art data structures and related storage and retrieval algorithms have been developed for data management including arrays, hashing, binary trees, AVL trees (height-balanced binary trees), b-trees, and skiplists. In each of these prior art data structures and related storage and retrieval algorithms an inherent trade-off has existed between providing faster access times and providing lower memory overhead. For example, an array allows for fast indexing through the calculation of the address of a single array element but requires the pre-allocation of the entire array in memory before a single value is stored, and unused intervals of the array waste memory resources. Alternatively, binary trees, AVL trees, b-trees and skiplists do not require the pre-allocation of memory for the data structure and attempt to minimize allocation of unused memory but exhibit an access time which increases as the population increases.
An array is a prior art data structure which has a simplified structure and allows for rapid access of the stored data. However, memory must be allocated for the entire array and the structure is inflexible. An array value is looked up xe2x80x9cpositionallyxe2x80x9d, or xe2x80x9cdigitallyxe2x80x9d, by multiplying the index by the size (e.g., number of bytes) allocated to each element of the array and adding the offset of the base address of the array. Typically, a single Central Processing Unit (CPU) cache line fill is required to access the array element and value stored therein. As described and typically implemented, the array is memory inefficient and relatively inflexible. Access, however, is provided as O(1), i.e., independent of the size of the array (ignoring disk swapping).
Alternatively, other data structures previously mentioned including binary trees, b-trees, skiplists and hash tables, are available which are more memory efficient but include undesirable features. For example, hashing is used to convert sparse, possibly multi-word indexes (such as strings) into array indexes. The typical hash table is a fixed-size array, and each index into it is the result of a hashing algorithm performed on the original index. However, in order for hashing to be efficient, the hash algorithm must be matched to the indexes which are to be stored. Hash tables also require every data node to contain a copy of (or a pointer to) the original index (key) so you can distinguish nodes in each synonym chain (or other type of list). Like an array, use of hashing requires some preallocation of memory, but it is normally a fraction of the memory which must be allocated for a flat array, if well designed, i.e., the characteristics of the data to be stored are well known, behaved and matched to the hashing algorithm, collision resolution technique and storage structure implemented.
In particular, digital trees, or tries, provide rapid access to data, but are generally memory inefficient. Memory efficiency may be enhanced for handling sparse index sets by keeping tree branches narrow, resulting in a deeper tree and an increase in the average number of memory references, indirections, and cache line fills, all resulting in slower access to data. This latter factor, i.e., maximizing cache efficiency, is often ignored when such structures are discussed yet may be a dominant factor affecting system performance. A trie is a tree of smaller arrays, or branches, where each branch decodes one or more bits of the index. Prior art digital trees have branch nodes that are arrays of simple pointers or addresses. Typically, the size of the pointers or addresses are minimized to improve the memory efficiency of the digital tree.
At the xe2x80x9cbottomxe2x80x9d of the digital tree, the last branch decodes the last bits of the index, and the element points to some storage specific to the index. The xe2x80x9cleavesxe2x80x9d of the tree are these memory chunks for specific indexes, which have application-specific structures.
Digital trees have many advantages including not requiring memory to be allocated to branches which have no indexes or zero population (also called an empty subexpanse). In this case the pointer which points to the empty subexpanse is given a unique value and is called a null pointer indicating that it does not represent a valid address value. Additionally, the indexes which are stored in a digital tree are accessible in sorted order which allows identification of neighbors. An xe2x80x9cexpansexe2x80x9d of a digital tree as used herein is the range of values which could be stored within the digital tree, while the population of the digital tree is the set of values that are actually stored within the tree. Similarly, the expanse of a branch of a digital tree is the range of indexes which could be stored within the branch, and the population of a branch is the number of values (e.g., count) which are actually stored within the branch. (As used herein, the term xe2x80x9cpopulationxe2x80x9d refers to either the set of indexes or the count of those indexes, the meaning of the term being apparent to those skilled in the art from the context in which the term is used.)
xe2x80x9cAdaptive Algorithms for Cache-efficient Trie Searchxe2x80x9d by Acharya, Zhu and Shen (1999) describes cache-efficient algorithms for trie search. Each of the algorithms use different data structures, including a partitioned-array, B-tree, hashtable, and vectors, to represent different nodes in a trie. The data structure selected depends on cache characteristics as well as the fanout of the node. The algorithms further adapt to changes in the fanout at a node by dynamically switching the data structure used to represent the node. Finally, the size and the layout of individual data structures is determined based on the size of the symbols in the alphabet as well as characteristics of the cache(s). The publication further includes an evaluation of the performance of the algorithms on real and simulated memory hierarchies.
Other publications known and available to those skilled in the art describing data structures include Fundamentals of Data Structures in Pascal, 4th Edition; Horowitz and Sahni; pp 582-594; The Art of Computer Programming, Volume 3; Knuth; pp 490-492; Algorithms in C; Sedgewick; pp 245-256, 265-271; xe2x80x9cFast Algorithms for Sorting and Searching Stringsxe2x80x9d; Bentley, Sedgewick; xe2x80x9cTernary Search Treesxe2x80x9d; 5871926, INSPEC Abstract Number: C9805-6120-003; Dr Dobb""s Journal; xe2x80x9cAlgorithms for Trie Compactionxe2x80x9d, ACM Transactions on Database Systems, 9(2):243-63, 1984; xe2x80x9cRouting on longest-matching prefixesxe2x80x9d; 5217324, INSPEC Abstract Number: B9605-6150M-005, C9605-5640-006; xe2x80x9cSome results on tries with adaptive branchingxe2x80x9d; 6845525, INSPEC Abstract Number: C2001-03-6120-024; xe2x80x9cFixed-bucket binary storage treesxe2x80x9d; 01998027, INSPEC Abstract Number: C83009879; xe2x80x9cDISCS and other related data structuresxe2x80x9d; 03730613, INSPEC Abstract Number: C90064501; and xe2x80x9cDynamical sources in information theory: a general analysis of trie structuresxe2x80x9d; 6841374, INSPEC Abstract Number: B2001-03-6110-014, C2001-03-6120-023.
An enhanced storage structure is described in U.S. patent application Ser. No. 09/457,164 filed Dec. 8, 1999, currently pending, entitled xe2x80x9cA Fast Efficient Adaptive, Hybrid Tree,xe2x80x9d (the ""164 application) assigned in common with the instant application and hereby incorporated herein by reference in its entirety. The data structure and storage methods described therein provide a self-adapting structure which self-tunes and configures xe2x80x9cexpansexe2x80x9d based storage nodes to minimize storage requirements and provide efficient, scalable data storage, search and retrieval capabilities. The structure described therein, however, does not take full advantage of certain data distribution situations.
An enhancement to the storage structure described in the ""164 application is detailed in U.S. Pat. No. 6,735,595, issued May 11, 2004, entitled xe2x80x9cA Data Structure And Storage And Retrieval Method Supporting Ordinality Based Searching and Data Retrievalxe2x80x9d, assigned in common with the instant application and hereby incorporated herein by reference in its entirety. This latter application describes a data structure and related data storage and retrieval method which rapidly provides a count of elements stored or referenced by a hierarchical structure of ordered elements (e.g., a tree), access to elements based on their ordinal value in the structure, and identification of the ordinality of elements. In an ordered tree implementation of the structure, a count of indexes present in each subtree is stored, i.e., the cardinality of each subtree is stored either at or associated with a higher level node pointing to that subtree or at or associated with the head node of the subtree. In addition to data structure specific requirements (e.g., creation of a new node, reassignment of pointers, balancing, etc.) data insertion and deletion includes steps of updating affected counts. Again, however, the structure fails to accommodate certain data distribution situations.
A further enhancement is described in related U.S. Pat. No. 6,654,760, entitled xe2x80x9cSYSTEM AND METHOD OF PROVIDING A CACHE-EFFICIENT, HYBRID, COMPRESSED DIGITAL TREE WITH WIDE DYNAMIC RANGES AND SIMPLE INTERFACE REQUIRING NO CONFIGURATION OR TUNINGxe2x80x9d, the disclosure of which is hereby incorporated herein by reference. The application describes a syst and data structure including a self-modifying data structure based on a digital tree (or xe2x80x9ctriexe2x80x9d) data structure which is stored in the memory, can be treated as a dynamic array, and is accessed through a root pointer. For an empty tree, this root pointer is null, otherwise it points to the first of a hierarchy of branch nodes of the digital tree. Low-fanout branches are avoided or re laced with alternative structures that are less wasteful of memory while retaining most or all of the performance advantages of a conventional digital tree structure, including index insertion, search, access and deletion performance. Thus, in addition to n-way branches implemented by arrays of n pointers (uncompressed branches), the disclosure describes linear branches for small populations wherein pointers to populated subexpanses are identified in a list arrangement (i.e., linear branches), and, for higher populations, bit vector identifies populated subexpanses, pointers to the populated subexpanses following the bit vector (i.e., bitmap branches). Similar compression is provided for terminal nodes by providing linear and bitmap leaf structures.
The improvements described in this last application reduce or eliminate memory otherwise wasted on null pointers prevalent in sparsely populated and/or wide/shallow digital trees. Further, additional processing time required to effectuate and accommodate the branch modification is minimal, particularly in comparison to processing advantages inherent in reducing the size of the structure so that data fetching from memory is more efficient, capturing more data and fewer null pointers in each CPU cache line fill. Opportunistic reconfiguration of nodes is used to automatically readjust for changing subexpanse population. However, the disclosure fails to address certain data or index distributions that adversely affect data structure storage requirements.
Accordingly, a need exists for techniques and tools to optimize performance characteristics of digital tree and similar structures.
The present invention is directed to an indexing scheme particularly applicable to data structures, such as digital trees, in which compression techniques are implemented to reduce storage requirements for completely filled and highly populated groups of indexes. These groups of indexes may be stored in a variety of data structures to support data access as required to store and retrieve data and to, for example, traverse a hierarchical data structure such as a digital tree. Thus, in the case of the latter, interior branch and terminal leaf nodes include indications of indexes (or portions of indexes) present in subsidiary nodes (branches) or present in the subject node (leaves.) The invention addresses full and nearly full populations of indexes by providing respective designations of these conditions so as to avoid individually listing the larger number of valid indexes in favor of listing the smaller number of invalid or missing indexes. In the case of a small number of missing indexes, a xe2x80x9cnearly fullxe2x80x9d designation is supplemented by a listing of the missing indexes preferably in an immediate listing within a branch or, if the list is too large, in an inverse linear leaf node. The invention further encompasses other means of compressing branch and leaf nodes that are particularly applicable to large expanses of indexes so as to minimize node storage requirements while taking into consideration additional processing requirements for node decompression.