1. Technical Field
The invention relates generally to PATRICIA tries. More specifically, the invention relates to a cascading index of PATRICIA tries.
2. Discussion of the Prior Art
The Practical Algorithm To Retrieve Information Coded In Alphanumeric (PATRICIA) is a trie shown by D. R. Morrison in 1968. It is well known in the industry as a compact way for indexing, and it is commonly used in databases, as well as in networking technologies. In a PATRICIA implementation, nodes that have only one child are eliminated. The remaining node is labeled with a character position number that indicates the node's depth in the uncompressed trie. FIG. 1 shows an example of such an implementation of a PATRICIA trie for an alphabetical case. The words to be stored are ‘greenbeans,’ ‘greentea,’ ‘grass,’ ‘corn,’ and ‘cow.’ The first three words differ from the last two words words in the first letter, i.e. three other words begin with the letter ‘g,’ while the other two begin with the letter ‘c.’ Hence, there is a difference at the first position. Therefore, there is a node 110-1 at depth ‘0’ separating the ‘g’ words from the ‘c’ words. The edge connecting nodes 110-1 and 110-2 hold the characters ‘gr’ and the edge connecting nodes 110-1 and 110-3 hold the characters ‘co.’ Moving on the ‘gr’ side, the next time a difference is found is in the third position where two words have an ‘e’ while one word has an ‘a.’ Therefore, a node 110-2 at that level indicates a depth level of ‘2’, i.e. the depth level equivalent to the length of the string ‘gr.’ Continuing down the left path reveals that the next time a different letter is found is at the 6th position of the ‘greenbeans’ and ‘greente’ words, where one word has a ‘b’ while the other has a ‘t.’ Therefore, there is a node 110-4 at depth ‘5.’ The words, i.e. keys are stored in the leaves 120. For example, leaf 120-1 contains the key ‘greenbeans,’ the leaf 120-2 contains the key ‘greentea,’ and so on.
The problem with this implementation is that keys are not uniquely specified by the search path. Hence, the key itself has to be stored in the appropriate leaf. The advantage of this is that only t*n pointers are required, where ‘t’ is the size of the alphabet and ‘n’ is the number of leaves. For purposes of the discussion herein, a alphabet is group of symbols, where the size of an alphabet is determined by the number of symbols in the group. That is, an alphabet having a t=2 is binary alphabet having only two symbols, e.g. ‘0’ and ‘1.’ FIG. 2 shows an exemplary implementation for such an alphabet with two nodes 210-1 and 210-2 and three leaves 220-1, 220-2 and 220-3, including the keys ‘1000,’ 1110,’ and ‘1111,’ respectively. For binary PATRICIA tries, the number of internal nodes 210 is equal to the number of leaves 220 minus 1. The height of the PATRICIA trie is bounded by the number of leaves ‘n.’
A PATRICIA trie is either a leaf L(k) containing a key k or a node N(d, l, r) containing a bit offset d≧0 along with a left sub-tree l, and a right sub-tree r. This is a recursive description of the nodes of a PATRICIA tree, and leaves descending from a node N(d, l, r) must agree on the first d−1 bits. A description of PATRICIA tries may be found in A Compact B-Tree, Bumbulis and Bowman, Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 533-541, which is herein incorporated in its entirety by this reference thereto. A block of pointers may be now prepared using the PATRICIA trie architecture, the block having pointers that allow, for efficient retrieval of the data. The number of pointers or fanout of the block may be calculated, based on several parameters.
A PATRICIA trie is split when a trie no longer fits on a disk block. The PATRICIA trie is split such that the split operation returns two PATRICIA tries, each conforming to the PATRICA trie characteristics. Reference is made to FIG. 3A, where a parent and child node, N(d, l, s) and s respectively, are shown. A split operation takes place (FIG. 3B) returning two PATRICIA tries, T2 320 containing s, and T1 310 containing the original parent node and a descendant pointer to s such that T1 and T2 contain all of the leaves of the original trie. The trie T2 320 consists of the new root node s and all labeled nodes and leaves which are on the path from the original root to d that additionally have a 1 at position d+1. The trie T1 310 consists of all other nodes in the original trie. The split operation can take any required depth to accommodate overflow of data in a given disk block. Similarly merge operations are also possible.
The arbitrary split process is a contributing factor to the difficulty of using PATRICIA tries within block-based systems, where the blocks reside in a differentiated memory hierarchy. This is true even with a simple hierarchy, such as main-memory to disk. With a PATRICIA trie, some pointers to data are near the root, while others are quite far. This inherent imbalance comes both from the key values inserted, and from the insertion order of the keys. For some ideal PATRICIA tries, this means that query performance can be O(log N), where N is the number of blocks in the PATRICIA trie. However, for other PATRICIA tries, worst case query performance is O(N). It is therefore that PATRICIA tries present several difficulties, such as that they are not good for low-latency operations, they are difficult to plan with, they are probabilistic data structures, and they are difficult to allocate resources to.
Previously, prior art solutions attempted to use an additional indexing dimension. However, the techniques used resulted in complex solutions, such as those proposed, for example, in U.S. Pat. No. 6,208,993 by Shadmon, which uses layered index approach to balance the trie. Specifically, there is the possibility of multiple erroneous path selections due to the compacting of the PATRICIA trie block. Additionally, in Shadmon the upper layers are not managed by an actual PATRICIA, but through a more complex, proprietary structure.
It would be therefore advantageous to provide a solution taking advantage of the strength of the PATRICIA trie, while overcoming at least the weaknesses discussed above, including, but not limited to, the inherent imbalance of PATRICIA trie blocks. It would be further advantageous if all data structures would be PATRICIA tries and furthermore if the number of errors were bound.