1. Technical Field
The invention relates generally to PATRICIA tries. More specifically, this invention relates to an improved termination of variable length keys using ternary PATRICIA tries.
2. Background of the Invention
The trie is a data structure that allows for a fast search and data retrieval over a large text. Tries are used to implement the dictionary abstract data type (ADT), where basic operations, such as search, insert, and delete can be performed. Further, a trie can be used for encoding and compression of text.
One type of trie known in the art is the practical algorithm to retrieve information coded in alphanumeric or PATRICIA (PATRICIA—Practical algorithm to retrieve information coded in alphanumeric, D. R. Morrison, J. ACM, 15 (1968) pp. 514-534). The PATRICIA is a trie shown by D. R. Morrison in 1968. It is well known in the industry as a compact way for indexing, and is commonly used in databases, as well as in networking technologies.
In a PATRICIA implementation, trie nodes that have only one child are eliminated, i.e. unary nodes are collapsed. The remaining nodes are labeled with a character position number that indicates the nodes' depth in the uncompressed trie. FIG. 1 shows an example of such an implementation of a PATRICIA trie for an alphabetical case. The words to be stored are “greenbeans,” “greentea,” “grass,” “corn,” and “cow.” The first three words differ from the last two words in the first letter, i.e. three words begin with the letter “g,” while the other two words begin with the letter “c.” Hence, there is a difference at the first position. Therefore, there is a node 110-1 at depth “0” separating the “g” words from the “c” words. The edge connecting nodes 110-1 and 110-2 holds the characters “gr” and the edge connecting nodes 110-1 and 110-3 holds the characters “co.” Moving on the “gr” side, the next time a difference is found is in the third position where two words have an “e” while one word has an “a.” Therefore, a node 110-2 at that level indicates a depth level of “2,” i.e. the depth level equivalent to the length of the string “gr.” Continuing down the left path reveals that the next time a different letter is found is at a sixth position of the “greenbeans” and “greentea” words where one word has a “b” while the other has a “t.” Therefore, there is a node 110-4 at depth “5.” The words, i.e. keys are stored in the leaves 120. For example, leaf 120-1 contains the key “greenbeans,” the leaf 120-2 contains the key “greentea,” and so on.
The problem with this implementation is that keys are not uniquely specified by the search path. Hence, the key itself has to be stored in the appropriate leaf. An advantage of this PATRICIA implementation is that only about t*n bits of storage are required, where t is the size of the alphabet and n is the number of leaves.
An alphabet is group of symbols, where the size of an alphabet is determined by the number of symbols in the group. That is, an alphabet in which t=2 is a binary alphabet having only two symbols, possibly 0 and 1. FIG. 2 shows an exemplary implementation for such an alphabet with two nodes 210-1 and 210-2, and three leaves 220-1, 220-2, and 220-3, including the keys 1000, 1110, and 1111 respectively. For binary PATRICIA tries, the number of internal nodes 210 is equal to the number of leaves 220 minus 1. The height of the PATRICIA trie is bounded by the number of leaves n.
A PATRICIA trie is either a leaf L (k) containing a key k or a node N (d, l, r) containing a bit offset d=0 along with a left sub-tree l, and a right sub-tree r. This is a recursive description of the nodes of a PATRICIA tree, and leaves descending from a node N (d, l, r) must agree on the first d-1 bits. A description of PATRICIA tries may be found in Bumbulis and Bowman, A Compact B-Tree, Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pages 533-541, which is herein incorporated in its entirety by this reference thereto.
A block of pointers may be now prepared using the PATRICIA trie architecture, the block having pointers that allow for efficient retrieval of the data. The number of pointers, or fanout, of the block may be calculated, based on several parameters.
The assumption is that the keys ki are unique. In cases where such keys are not unique, unique keys must be created. Several strategies are suggested by prior art, such as the appending of a record identifier (RID) of the record to the respective key. Assuming that normalization of all keys to binary strings in an order preserving fashion is possible, one could implement the normalization such that no key is a prefix of another. This is trivially possible for fixed length keys. For variable length keys, an end marker would have to be added, while maintaining order. For bounded length keys, a strategy could be to pad all keys with binary 1s to a length that is greater than the length in bits of any key one could possibly encounter. Using such a strategy simplifies the algorithms and serves for alignment purposes too. The deficiencies of the prior art are clear: there is a difficulty in handling indexes that are over data sets containing duplicate values, the complexity of the handling of prefix keys, and the need to pad with bits in order to terminate indexed keys.
It would be therefore advantageous to provide a practical solution for handling the termination of variable length keys of a PATRICIA trie. It would be furthermore advantageous if such solution would eliminate the need for the use of tricks or padding keys to longer than the longest possible key. It would be further advantageous if such solution is applicable for the indexing of infinite strings.