The present invention relates generally to the field of data structures, and more particularly to a hierarchical data organization in which the structure of the data organization is dependent on the data stored, with components of the data structure compressed to match the data.
Computer processors and associated memory components continue to increase in speed. As hardware approaches physical speed limitations, however, other methods for generating appreciable decreases in data access times are required. Even when such limitations are not a factor, maximizing software efficiency maximizes the efficiency of the hardware platform, extending the capabilities of the hardware/software system as a whole. One method of increasing system efficiency is by providing effective data management, achieved by the appropriate choice of data structure and related storage and retrieval algorithms. For example, various prior art data structures and related storage and retrieval algorithms have been developed for data management including arrays, hashing, binary trees, AVL trees (height-balanced binary trees), b-trees, and skiplists. In each of these prior art data structures and related storage and retrieval algorithms an inherent trade-off has existed between providing faster access times and providing lower memory overhead. For example, an array allows for fast indexing through the calculation of the address of a single array element but requires the pre-allocation of the entire array in memory before a single value is stored, and unused intervals of the array waste memory resources. Alternatively, binary trees, AVL trees, b-trees and skiplists do not require the pre-allocation of memory for the data structure and attempt to minimize allocation of unused memory but exhibit an access time which increases as the population increases.
An array is a prior art data structure which has a simplified structure and allows for rapid access of the stored data. However, memory must be allocated for the entire array and the structure is inflexible. An array value is looked up xe2x80x9cpositionallyxe2x80x9d, or xe2x80x9cdigitallyxe2x80x9d, by multiplying the index by the size (e.g., number of bytes) allocated to each element of the array and adding the offset of the base address of the array. Typically, a single Central Processing Unit (CPU) cache line fill is required to access the array element and value stored therein. As described and typically implemented, the array is memory inefficient and relatively inflexible. Access, however, is provided as O(1), i.e., independent of the size of the array (ignoring disk swapping).
Alternatively, other data structures previously mentioned including binary trees, b-trees, skiplists, linked lists and hash tables, are available which are more memory efficient but include undesirable features. For example, hashing is used to convert sparse, possibly multi-word indexes (such as strings) into array indexes. The typical hash table is a fixed-size array, and each index into it is the result of a hashing algorithm performed on the original index. However, in order for hashing to be efficient, the hash algorithm must be matched to the indexes which are to be stored. Hash tables also require every data node to contain a copy of (or a pointer to) the original index (key) so you can distinguish nodes in each synonym chain (or other type of list). Like an array, use of hashing requires some preallocation of memory, but it is normally a fraction of the memory which must be allocated for a flat array, if well designed, i.e., the characteristics of the data to be stored are well known, behaved and matched to the hashing algorithm, collision resolution technique and storage structure implemented.
In particular, digital trees, or tries, provide rapid access to data, but are generally memory inefficient. Memory efficiency may be enhanced for handling sparse index sets by keeping tree branches narrow, resulting in a deeper tree and an increase in the average number of memory references, indirections, and cache line fills, all resulting in slower access to data. This latter factor, i.e., maximizing cache efficiency, is often ignored when such structures are discussed yet may be a dominant factor affecting system performance. A trie is a tree of smaller arrays, or branches, where each branch decodes one or more bits of the index. Most prior art digital trees have branch nodes that are arrays of simple pointers or addresses. Typically, the size of the pointers or addresses are minimized to improve the memory efficiency of the digital tree.
At the xe2x80x9cbottomxe2x80x9d of the digital tree, the last branch decodes the last bits of the index, and the element points to some storage specific to the index. The xe2x80x9cleavesxe2x80x9d of the tree are these memory chunks for specific indexes, which have application-specific structures.
Digital trees have many advantages including not requiring memory to be allocated to branches which have no indexes or zero population (also called an empty subexpanse). In this case the pointer which points to the empty subexpanse is given a unique value and is called a null pointer indicating that it does not represent a valid address value. Additionally, the indexes which are stored in a digital tree are accessible in sorted order which allows identification of neighbors. An xe2x80x9cexpansexe2x80x9d of a digital tree as used herein is the range of values which could be stored within the digital tree, while the population of the digital tree is the set of values that are actually stored within the tree. Similarly, the expanse of a branch of a digital tree is the range of indexes which could be stored within the branch, and the population of a branch is the number of values (e.g., count) which are actually stored within the branch. (As used herein, the term xe2x80x9cpopulationxe2x80x9d refers to either the set of indexes or the count of those indexes, the meaning of the term being apparent to those skilled in the art from the context in which the term is used.)
xe2x80x9cAdaptive Algorithms for Cache-Efficient Trie Searchxe2x80x9d by Acharya, Zhu and Shen (1999), the disclosure of which is hereby incorporated herein by reference, describes cache-efficient algorithms for trie search. Each of the algorithms use different data structures, including a partitioned-array, B-tree, hashtable, and vectors, to represent different nodes in a trie. The data structure selected depends on cache characteristics as well as the fanout of the node. The algorithms further adapt to changes in the fanout at a node by dynamically switching the data structure used to represent the node. Finally, the size and the layout of individual data structures is determined based on the size of the symbols in the alphabet as well as characteristics of the cache(s). The publication further includes an evaluation of the performance of the algorithms on real and simulated memory hierarchies.
Other publications known and available to those skilled in the art describing data structures include Fundamentals of Data Structures in Pascal, 4th Edition; Horowitz and Sahni; pp 582-594; The Art of Computer Programming, Volume 3; Knuth; pp 490-492; Algorithms in C; Sedgewick; pp 245-256, 265-271; xe2x80x9cFast Algorithms for Sorting and Searching Stringsxe2x80x9d; Bentley, Sedgewick; xe2x80x9cTernary Search Treesxe2x80x9d; 5871926, INSPEC Abstract Number: C9805-6120-003; Dr Dobb""s Journal; xe2x80x9cAlgorithms for Trie Compactionxe2x80x9d, ACM Transactions on Database Systems, 9(2):243-63, 1984; xe2x80x9cRouting on longest-matching prefixesxe2x80x9d; 5217324, INSPEC Abstract Number: B9605-6150M-005, C9605-5640-006; xe2x80x9cSome results on tries with adaptive branchingxe2x80x9d; 6845525, INSPEC Abstract Number: C2001-03-6120-024; xe2x80x9cFixed-bucket binary storage treesxe2x80x9d; 01998027, INSPEC Abstract Number: C83009879; xe2x80x9cDISCS and other related data structuresxe2x80x9d; 03730613, INSPEC Abstract Number: C90064501; and xe2x80x9cDynamical sources in information theory: a general analysis of trie structuresxe2x80x9d; 6841374, INSPEC Abstract Number: B2001-03-6110-014, C2001-03-6120-023, the disclosures of which are hereby incorporated herein by reference.
An enhanced storage structure is described in U.S. patent application Ser. No. 09/457,164 filed Dec. 8, 1999, entitled xe2x80x9cA FAST EFFICIENT ADAPTIVE, HYBRID TREE,xe2x80x9d (the ""164 application) assigned in common with the instant application and incorporated herein by reference in its entirety. The data structure and storage methods described therein provide a self-adapting structure which self-tunes and configures xe2x80x9cexpansexe2x80x9d based storage nodes to minimize storage requirements and provide efficient, scalable data storage, search and retrieval capabilities. The structure described therein, however, does not take full advantage of certain sparse data situations.
An enhancement to the storage structure described in the ""164 application is detailed in U.S. patent application Ser. No. 09/725,373, filed Nov. 29, 2000, entitled xe2x80x9cA DATA STRUCTURE AND STORAGE AND RETRIEVAL METHOD SUPPORTING ORDINALITY BASED SEARCHING AND DATA RETRIEVALxe2x80x9d, assigned in common with the instant application and incorporated herein by reference in its entirety. This latter application describes a data structure and related data storage and retrieval method which rapidly provides a count of elements stored or referenced by a hierarchical structure of ordered elements (e.g., a tree), access to elements based on their ordinal value in the structure, and identification of the ordinality of elements. In an ordered tree implementation of the structure, a count of indexes present in each subtree is stored, i.e., the cardinality of each subtree is stored either at or associated with a higher level node pointing to that subtree or at or associated with the head node of the subtree. In addition to data structure specific requirements (e.g., creation of a new node, reassignment of pointers, balancing, etc.) data insertion and deletion includes steps of updating affected counts. Again, however, the structure fails to take full advantage of certain sparse data situations.
Accordingly, a need exists for techniques and tools to optimize performance characteristics of digital tree and similar structures.
A system and data structure according to the present invention include a self-modifying data structure based on a digital tree (or xe2x80x9ctriexe2x80x9d) data structure which is stored in the memory, can be treated as a dynamic array, and is accessed through a root pointer. For an empty tree, this root pointer is null, otherwise it points to the first of a hierarchy of branch nodes of the digital tree. Low-fanout branches are avoided or replaced with alternative structures that are less wasteful of memory while retaining most or all of the performance advantages of a conventional digital tree structure, including index insertion, search, access and deletion performance. This improvement reduces or eliminates memory otherwise wasted on null pointers prevalent in sparsely populated and/or wide/shallow digital trees. Additional processing time required to effectuate and accommodate the branch modification is minimal, particularly in comparison to processing advantages inherent in reducing the size of the structure so that data fetching from memory is more efficient, capturing more data and fewer null pointers in each CPU cache line fill. The invention includes linear and bitmap branches and leaves implemented, for example, using a rich pointer structure. Opportunistic reconfiguration of nodes automatically readjusts for changing subexpanse population.