The present invention relates generally to the field of data structures, and more particularly to a dynamic data structure which dynamically adapts storage allocated to describe a population to the size of the population.
Computer processors and associated memory components continue to increase in speed. As hardware approaches physical speed limitations, however, other methods for generating appreciable decreases in data access times are required. Even when such limitations are not a factor, maximizing software efficiency maximizes the efficiency of the hardware platform, extending the capabilities of the hardware/software system as a whole. One method of increasing system efficiency is by providing effective data management, achieved by the appropriate choice of data structure and related storage and retrieval algorithms. For example, various prior art data structures and related storage and retrieval algorithms have been developed for data management including arrays, hashing, binary trees, AVL trees (height-balanced binary trees), b-trees, and skiplists. In each of these prior art data structures and related storage and retrieval algorithms an inherent trade-off has existed between providing faster access times and providing lower memory overhead. For example, an array allows for fast indexing through the calculation of the address of a single array element but requires the pre-allocation of the entire array in memory before a single value is stored, and unused intervals of the array waste memory resources. Alternatively, binary trees, AVL trees, b-trees and skiplists do not require the pre-allocation of memory for the data structure and attempt to minimize allocation of unused memory but exhibit an access time which increases as the population increases.
An array is a prior art data structure which has a simplified structure and allows for rapid access of the stored data. However, memory must be allocated for the entire array and the structure is inflexible. An array value is looked up xe2x80x9cpositionallyxe2x80x9d, or xe2x80x9cdigitallyxe2x80x9d, by multiplying the index by the size (e.g., number of bytes) allocated to each element of the array and adding the offset of the base address of the array. Typically, a single Central Processing Unit (CPU) cache line fill is required to access the array element and value stored therein. As described and typically implemented, the array is memory inefficient and relatively inflexible. Access, however, is provided as O(1), i.e., independent of the size of the array (ignoring disk swapping).
Alternatively, other data structures previously mentioned including binary trees, b-trees, skiplists and hash tables, are available which are more memory efficient but include undesirable features. For example, hashing is used to convert sparse, possibly multi-word indexes (such as strings) into array indexes. The typical hash table is a fixed-size array, and each index into it is the result of a hashing algorithm performed on the original index. However, in order for hashing to be efficient, the hash algorithm must be matched to the indexes which are to be stored. Hash tables also require every data node to contain a copy of (or a pointer to) the original index (key) so you can distinguish nodes in each synonym chain (or other type of list). Like an array, use of hashing requires some preallocation of memory, but it is normally a fraction of the memory which must be allocated for a flat array, if well designed, i.e., the characteristics of the data to be stored are well known, behaved and matched to the hashing algorithm, collision resolution technique and storage structure implemented.
In particular, digital trees, or tries, provide rapid access to data, but are generally memory inefficient. Memory efficiency may be enhanced for handling sparse index sets by keeping tree branches narrow, resulting in a deeper tree and an increase in the average number of memory references, indirections, and cache line fills, all resulting in slower access to data. This latter factor, i.e., maximizing cache efficiency, is often ignored when such structures are discussed yet may be a dominant factor affecting system performance. A tree is a tree of smaller arrays, or branches, where each branch decodes one or more bits of the index. Prior art digital trees have branch nodes that are arrays of simple pointers or addresses. Typically, the size of the pointers or addresses are minimized to improve the memory efficiency of the digital tree.
At the xe2x80x9cbottomxe2x80x9d of the digital tree, the last branch decodes the last bits of the index, and the element points to some storage specific to the index. The xe2x80x9cleavesxe2x80x9d of the tree are these memory chunks for specific indexes, which have application-specific structures.
Digital trees have many advantages including not requiring memory to be allocated to branches which have no indexes or zero population (also called an empty subexpanse). In this case the pointer which points to the empty subexpanse is given a unique value and is called a null pointer indicating that it represents an empty range of indexes. Additionally, the indexes which are stored in a digital tree are accessible in sorted order which allows identification of neighbors. An xe2x80x9cexpansexe2x80x9d of a digital tree as used herein is the range of values which could be stored within the digital tree, while the population of the digital tree is the set of values that are actually stored within the tree. Similarly, the expanse of a branch of a digital tree is the range of indexes which could be stored within the branch, and the population of a branch is the number of values (e.g., count) which are actually stored within the branch. (As used herein, the term xe2x80x9cpopulationxe2x80x9d refers to either the set of indexes or the count of those indexes, the meaning of the term being apparent to those skilled in the art from the context in which the term is used.)
xe2x80x9cAdaptive Algorithms for Cache-Efficient Tree Searchxe2x80x9d by Acharya, Zhu and Shen (1999), the disclosure of which is hereby incorporated herein by reference, describes cache-efficient algorithms for tree search. Each of the algorithms use different data structures, including a partitioned-array, B-tree, hashtable, and vectors, to represent different nodes in a trie. The data structure selected depends on cache characteristics as well as the fanout of the node. The algorithms further adapt to changes in the fanout at a node by dynamically switching the data structure used to represent the node. Finally, the size and the layout of individual data structures is determined based on the size of the symbols in the alphabet as well as characteristics of the cache(s). The publication further includes an evaluation of the performance of the algorithms on real and simulated memory hierarchies.
Other publications known and available to those skilled in the art describing data structures include Fundamentals of Data Structures in Pascal, 4th Edition; Horowitz and Sahni; pp. 582-594; The Art of Computer Programming, Volume 3; Knuth; pp. 490-492; Algorithms in C; Sedgewick; pp. 245-256, 265-271; xe2x80x9cFast Algorithms for Sorting and Searching Stringsxe2x80x9d; Bentley, Sedgewick; xe2x80x9cTernary Search Treesxe2x80x9d; 5871926, INSPEC Abstract Number: C9805-6120-003; Dr Dobb""s Journal; xe2x80x9cAlgorithms for Trie Compactionxe2x80x9d, ACM Transactions on Database Systems, 9(2):243-63, 1984; xe2x80x9cRouting on longest-matching prefixesxe2x80x9d; 5217324, INSPEC Abstract Number: B9605-6150M-005, C9605-5640-006; xe2x80x9cSome results on tries with adaptive branchingxe2x80x9d; 6845525, INSPEC Abstract Number: C2001-03-6120-024; xe2x80x9cFixed-bucket binary storage treesxe2x80x9d; 01998027, INSPEC Abstract Number: C83009879; xe2x80x9cDISCS and other related data structuresxe2x80x9d; 03730613, INSPEC Abstract Number: C90064501; and xe2x80x9cDynamical sources in information theory: a general analysis of trie structuresxe2x80x9d; 6841374, INSPEC Abstract Number: B2001-03-6110-014, C2001-03-6120-023, the disclosures of which are hereby incorporated herein by reference.
An enhanced storage structure is described in U.S. patent application Ser. No. 09/457,164 filed Dec. 8, 1999, currently pending, entitled xe2x80x9cA FAST EFFICIENT ADAPTIVE, HYBRID TREE,xe2x80x9d (the ""164 application) assigned in common with the instant application and hereby incorporated herein by reference in its entirety. The data structure and storage methods described therein provide a self-adapting structure which self-tunes and configures xe2x80x9cexpansexe2x80x9d based storage nodes to minimize storage requirements and provide efficient, scalable data storage, search and retrieval capabilities.
An enhancement to the storage structure described in the ""164 application is detailed in U.S. Pat. No. 6,735,595, filed Nov. 29, 2000, issued May 11, 2004, entitled xe2x80x9cA DATA STRUCTURE AND STORAGE AND RETRIEVAL METHOD SUPPORTING ORDINALITY BASED SEARCHING AND DATA RETRIEVALxe2x80x9d, assigned in common with the instant application and hereby incorporated herein by reference. This latter application describes a data structure and related data storage and retrieval method which rapidly provides a count of elements stored or referenced by a hierarchical structure of ordered elements (e.g., a tree), access to elements based on their ordinal value in the structure, and identification of the ordinality of elements. In an ordered tree implementation of the structure, a count of indexes present in each subtree is stored, i.e., the cardinality of each subtree is stored either at or associated with a higher level node pointing to that subtree or at or associated with the head node of the subtree. In addition to data structure specific requirements (e.g., creation of a new node, reassignment of pointers, balancing, etc.) data insertion and deletion includes steps of updating affected counts.
While digital trees provide an xe2x80x9cexpansexe2x80x9d based storage of information, other structures are also used to store data including, for example, b-trees, AVL trees, and binary trees that use a divide-by-population storage scheme (referred to as a binary storage tree) in which keys are compared with whole key values stored in each node. In these, and other storage structures, dynamic manipulation of the structure (including insertion and deletion of indexes and rebalancing operations) is highly dependent upon pointer structures, i.e., a special type of variable that holds a memory address (that is, it points to a memory location). However, while pointers provide for efficient xe2x80x9ctraversalxe2x80x9d of dynamic data structures, each xe2x80x9credirectionxe2x80x9d to another portion of the structure often entails a memory access operation to retrieve the xe2x80x9cpointed toxe2x80x9d node. That is, when data locality is not maintained, traversal of the data structure suffers by requiring the completion of relatively slow memory access operations before processing can continue.
Accordingly, a need exists for techniques and tools to optimize performance characteristics of a data structure to more effectively utilize pointer objects and constructs.
The invention is directed to a dynamic pointer construct which dynamically allocates memory storing data about a referenced structure commensurate with the size of the population. When the population is null or small, information about the population, such as a count of the population, is hidden within unused portions of an otherwise conventional pointer. As the population grows, an auxiliary data structure is spawned and may be inserted between the pointer and the referenced structure. Thus, this overhead information describing a target object is provided using unused bits of a pointer until such time as the population referenced grows to a point where additional memory may be allocated for this and additional information, the additional memory overhead being amortized over the larger population.
While the invention is applicable to a wide variety of objects that might be referenced by a pointer, including, for example, arrays, parameter lists, executables, etc., it is particularly applicable to structures having a large number of pointers, such as trees. Thus, the invention may be incorporated into a data structure including a root pointer addressing or xe2x80x9cpointing toxe2x80x9d a tree including a plurality of nodes comprising branches and/or leaves. The nodes are preferably arranged in a hierarchical structure so that each interior or branch node forms the root of a subtree, pointing to one or more subsidiary nodes of the tree using a pointer construct. Preferably, each node is some minimum size which is some whole multiple of a minimum addressable unit supported by a pointer (e.g., word addressability) and consistently aligned in memory (i.e., each node begins at an address having the same value of its least significant bits), such that some number of least significant bits of the parent pointer are always unused for addressing purposes. These unused bits are used instead to store information about the target object of the pointer, i.e., the pointed-to node in the case of small populations, such that these low order bits comprise an auxiliary data field. Since a pointer is typically a single word in size, the auxiliary data field is accessed by appropriate masking of the pointer word, while conventional pointer operations are supported by xe2x80x9cmasking outxe2x80x9d the auxiliary data field bits to bring the pointer back into proper node pointing alignment.
As the population size of a tree or subtree exceeds a threshold value, i.e., a threshold number of indexes, a separate data structure may be created to store additional data about the target object, the tree pointer redirected to the separate data structure, and the auxiliary data field set to indicate that the auxiliary structure is now the target of the pointer.