The present invention relates to radix-tree search structures and, in particular, it concerns radix-tree search structures having the ability to handle wildcards in any location within an entry.
A large number of computing or networking tasks require recognizing keys from a database, such as a lookup table in a computer network or dictionaries in general. After establishing a match between the search key and a data sequence from the database, either the information linked to this search key is retrieved or a program driven by this search key is executed.
In communication networks consisting of a number of interconnected users or nodes, data can be sent from one node to any other. Specialized nodes called routers are responsible for delivering or xe2x80x9cforwardingxe2x80x9d the data to their destination. By analogy, the routers act as post offices. As letters, any of the data sent through a communication network contains information about the destination address, generally as part of a so-called header.
Each router compares this information or at least part of it with a list of addresses stored internally. If a match between stored addresses and the destination address is found, the router establishes a path leading to the destination node. Depending on the size of the network and its structure, the data are either directly forwarded to their destination or sent to another (intermediate) router, very much the same way a letter is passed through several post offices until reaching its final address (if ever).
The explosive growth of the Internet has forced a review of address assignment policies. The traditional uses of general purpose networks have been modified to achieve better use of IP""s 32-bit address space. Classless Inter Domain Routing (CIDR) is a method currently being deployed in the Internet backbones to achieve this added efficiency. CIDR depends on deploying and routing to arbitrarily sized networks. In this model, hosts and routers make no assumptions about the use of addressing in the internet. The Class D (IP Multicast) and Class E (Experimental) address spaces are preserved, although this is primarily an assignment policy.
By definition, CIDR comprises three elements:
topologically significant address assignment,
routing protocols that are capable of aggregating network layer reachability information, and
consistent forwarding algorithm (xe2x80x9clongest matchxe2x80x9d).
The use of networks and subnets is now historical, although the language used to describe them remains in current use. They have been replaced by the more tractable concept of a network prefix. A network prefix is, by definition, a contiguous set of bits at the more significant end of the address that defines a set of systems; host numbers select among those systems. There is no requirement that all the internet use network prefixes uniformly. To collapse routing information, it is useful to divide the internet into addressing domains. Within such a domain, detailed information is available about constituent networks; outside it, only the common network prefix is advertised.
The classical IP addressing architecture used addresses and subnet masks to discriminate the host number from the network prefix. With network prefixes, it is sufficient to indicate the number of bits in the prefix. Both representations are in common use. Architecturally correct subnet masks are capable of being represented using the prefix length description.
An effect of the use of CIDR is that the set of destinations associated with address prefixes in the routing table may exhibit subset relationship. A route describing a smaller set of destinations (a longer prefix) is said to be more specific than a route describing a larger set of destinations (a shorter prefix); similarly, a route describing a larger set of destinations (a shorter prefix) is said to be less specific than a route describing a smaller set of destinations (a longer prefix). Routers must use the most specific matching route (the longest matching network prefix) when forwarding traffic.
Thus, the CIDR routing method provides an excellent example of the necessity of having partial or prefix matching ability within a data structure for retrieving keys.
The data structure of the invention is a database organized as tree. A tree consists of a number of nodes, each of which possibly branch or point to other nodes.
A class of trees is known in which the key is stored in the node. A decision whether and to which node to branch implies a comparison between the search key and those stored in the node. The result of this comparison determines the choice of the following subtree. Typically, binary search trees belong to this class. The main disadvantage of this class of trees and its corresponding search method is that all bits of the key are in the worst case compared k times, k denoting the maximum number of levels within the tree.
In another approach, termed radix-search methods, the search proceeds by examining the search keys one small piece at a time, rather than using full comparisons between keys at each step. Radix-search methods are particularly suited to handling variable-length keys. They also provide reasonable worst-case performance. Two important disadvantages, noted by R. Sedgewick (xe2x80x9cAlgorithms in C++xe2x80x9d, Addison Wesley, 1998), are that such search methods can make inefficient use of space, and that performance can suffer if efficient access to the bytes of the keys is not available.
In another class of trees known as xe2x80x9ctriesxe2x80x9d, the keywords or data belonging to these keywords are stored in the terminal nodes or leaves of the tree. When a key is searched or inserted, the bits of the key determine the path to be followed down to a leaf. Trees of this class are called xe2x80x9ctriexe2x80x9d in the technical fieldxe2x80x94a term introduced as an allusion to the term xe2x80x9cretrievalxe2x80x9d. If a key, for example, is a sequence of n bits (either 0 or 1), the decision to branch to the left (represented by 0) or to the right (1) at the kth level of the tree is made by using the kth bit of the key sequence, with k being a number between 1 and n. In a data structure belonging to this class, each bit of the key is compared only once.
This data structure is, however, storage consuming: many of these internal nodes may have one empty subtree, and, thus, are traversed without a gain of information.
Such unnecessary internal nodes are eliminated in a compact data structure known as Patricia trie. The Patricia trie is characterized by having a minimum number of internal nodes, or, equally, pointers or key-bit inspections. Each node within a Patricia trie contains information about the bits which characterize the compacted path, e.g., the number of bits to skip before making another bit comparison to decide the direction to branch to. An example of a Patricia trie can be found in: D. E. Knuth, xe2x80x9cThe Art of Computer Programmingxe2x80x9d, Vol. 3: Sorting and Searching, 1973, pp. 490-493.
Patricia tries handle keys of varying lengths and are characterized by a deterministic structure. Additional characteristics of Patricia tries are summarized by R. Sedgewick (xe2x80x9cAlgorithms in C++xe2x80x9d, Addison Wesley, 1998), pp. 637-645. Typical applications include locating strings containing names of different lengths, searching for IP addresses, etc.
To date, however, Patricia tries have been fundamentally incapable of handling keys containing wildcards, with the exception of wildcards at the suffix. When a key is stored in a leaf of a Patricia trie, and the path to the leaf examined is no more than the kth bit out of n bits (n greater than k), the bits that follow the kth bit have no effect on the tree, such that a plurality of keys that are identical over the first k bits can be stored in a single leaf. Thus, the plurality of keys may be allowed to contain one or more wildcards in the remaining (nxe2x88x92k) bits. It must be emphasized that such wildcards are not part of the tree/node structure, but are contained in the leaves.
The use of strings containing wildcards in the node structure of radix-search trees is particularly problematic. Bits are binary, having values of 0 or 1. A bit containing a wildcard is either, which is actually a third possibility. More importantly, a string having wildcards is actually a plurality of strings, the number of strings (Ns) being determined by the following relationship:
Ns=2Nw 
wherein Nw is the number of wildcards in the string. Thus, the presence of wildcards creates a representation problem compounded by a significant increase in entries in the tree structure, such that both the size and the depth of the tree are appreciably increased.
U.S. Pat. No. 5,787,430 to Doeringer et al. discloses a database having a trie-like structure for storing entries and retrieving an at least partial match, preferably the longest partial match, or all partial matches of a search argument (input key) from the entries, the database having nodes, with each node containing first link information leading to at least one previous node (parent pointer) and second link information leading to at least one following node (child pointer), at least a stored key, or a combination, thereof. The particular structure of the nodes allows a two-step search process, in which segments of a search argument are firstly used to determine a search path through the trie-like database. The search path is subsequently backtracked in the second part of the search. During the second part of the search, the entire search argument is compared to entries stored in the nodes until a match is found. It is claimed that the described database allows an efficient use of memories and is advantageously applied to fast data retrieval, in particular related to communication within computer networks.
In xe2x80x9cIP Lookups Using Multiway and Multicolumn Searchxe2x80x9d (IEEE/ACM TRANACTIONS ON NETWORKING, VOL. 7, NO. 3, JUNE 1999), B. Lampson et al. teach an IP lookup routine using a binary search scheme that is adapted to solve the longest matching prefix problem.
Although several prefix-matching techniques are known, to the best of our knowledge, there are no radix-search trees that are capable of handling wildcards positioned in any location (bit) in the string. The ability to assimilate strings having wildcards positioned in any location in the string is desirable in many applications. For example, if we would like to give special routing priority to all packets addressed to particular individuals within a company, the URL would generally have wildcards located in middle of the string. In another application example, if we would like to give special routing priority to all sessions between departments of a company or between two companies, the session ID may be represented as a concatanation of the two individual addresses, each having wildcard(s) in the suffix. As a result, the concatanated string will have wildcard(s) in middle of the string as well as in the suffix.
There is therefore a recognized need for, and it would be highly advantageous to have, a method for incorporating and retrieving entries in a radix-search tree, wherein one or more wildcards can be present at any location within the string.
According to the teachings of the present invention there is provided a leaf within a radix-search tree, the leaf including: at least a first entry containing at least one wildcard, the at least a first entry forming at least two single keys that are distinguishable by a node.
According to another aspect of the present invention there is provided a system for storing and retrieving data using a radix-search tree having a plurality of sub-trees containing nodes and leaves, the system including: (a) a data storage module designed and configured for storing the plurality of sub-trees, wherein at least one of the leaves contains at least one entry having at least one wildcard in a primary position, and (b) a processor that is operative to perform operations including: (i) building the radix-search tree in the data storage module.
According to further features in the described preferred embodiments, the processor is further operative to retrieve data from the radix-search tree in the data storage module.
According to yet another aspect of the present invention there is provided a method for storing and retrieving data using a radix-search tree having a plurality of sub-trees for storing at least one set of entries, the plurality of sub-trees containing nodes and leaves, the method including: (a) providing a system including: (i) a data storage module designed and configured for storing the plurality of sub-trees, and (ii) a processor that is operative to perform operations including building the radix-search tree in the data storage module, and (b) storing within the radix-search tree at least one set of entries containing at least one entry having at least one wildcard in a primary position.
According to yet another aspect of the present invention there is provided a method for storing and retrieving data using a radix-search tree having a plurality of sub-trees for storing at least one set of entries, the plurality of sub-trees containing nodes and leaves, the method including: (i) finding at least two split bits within the set of entries; and (ii) selecting one of the at least two split bits according to a best-balanced split criterion.
According to further features in the described preferred embodiments, the at least one wildcard in a primary position is followed by a bit forming a pseudo-node.
According to still further features in the described preferred embodiments, the at least one wildcard in a primary position is followed by a split bit.
According to still further features in the described preferred embodiments, the at least one entry is a plurality of entries, and the at least one wildcard in a primary position is in a suffix position.
According to still further features in the described preferred embodiments, the at least one entry having at least one wildcard in a primary position has a split bit preceding the at least one wildcard and a split bit following the at least one wildcard.
According to still further features in the described preferred embodiments, at least one bit is checked out of order of appearance.
According to still further features in the described preferred embodiments, the building of the radix-search tree in the data storage module includes ordering the nodes of at least one of the sub-trees by examining a bit having a minimum number of wildcards.
According to still further features in the described preferred embodiments, the building the radix-search tree in the data storage module includes ordering the nodes of at least one of the sub-trees by examining a bit providing a best-balanced split of single keys in the sub-tree.
According to still further features in the described preferred embodiments, the bit is selected according to a combination of selection criteria including: (a) a minimum number of wildcards in the sub-tree, and (b) a best-balanced split of single keys in the sub-tree.
According to still further features in the described preferred embodiments, if at least two bits have the minimum number of wildcards, the nodes are ordered at least according to a bit providing a best-balanced split of single keys in the sub-tree.
According to still further features in the described preferred embodiments, the inventive leaf further includes: (b) at least a second entry, the first entry and the second entry being distinguishable by a pseudo-node.
According to still further features in the described preferred embodiments, the (at least a) second entry is a subset of the first entry.
According to still further features in the described preferred embodiments, the (at least a) second entry contains at least one wildcard.
According to still further features in the described preferred embodiments, the (at least a) second entry includes a third entry.
According to still further features in the described preferred embodiments, the storing of at least one set of entries in the radix-search tree includes: (i) finding at least two split bits within the set of entries, and (ii) selecting at least one of the at least two split bits according to a minimum wildcard criterion.
According to still further features in the described preferred embodiments, the storing of at least one set of entries in the radix-search tree includes: (i) finding at least two split bits within the set of entries, and (ii) selecting one of the at least two split bits according to a best-balanced split criterion.
According to still further features in the described preferred embodiments, the storing of at least one set of entries in the radix-search tree includes: (i) finding at least two split bits within the set of entries, and (ii) selecting one of the at least two split bits according to criteria including a minimum wildcard criterion and a best-balanced split criterion.
According to still further features in the described preferred embodiments, the method further includes: (iii) if the minimum wildcard criterion is met by at least two of the split bits, selecting one of the at least two split bits according to a best-balanced split criterion.
According to still further features in the described preferred embodiments, the method further includes: (iv) splitting the selected bit to form a left sub-tree containing at least one entry from the set of entries, and a right sub-tree containing at least one entry from the set of entries.
According to still further features in the described preferred embodiments, the storing of at least one set of entries in the radix-search tree includes: (i) looking for split bits within the set of entries, and (ii) if no split bits are found, building a leaf containing the set of entries.
According to still further features in the described preferred embodiments, the storing of at least one set of entries in the radix-search tree includes: (i) looking for split bits within the set of entries, and (ii) if no split bits are found, and if the entries cannot be inserted in a single leaf, selecting the bit according to a maximal number of pairs criterion.
According to still further features in the described preferred embodiments, the storing of at least one set of entries in the radix-search tree includes: (i) looking for at least one split bit within the set of entries; (ii) if the at least one split bit is found within the set of entries, selecting at least one of the split bits according to a minimum wildcard criterion; (iii) if the minimum wildcard criterion is met by at least two of the split bits, selecting one of the split bits according to a best-balanced split criterion; (iv) if no split bit is found in step (i), and if the set of entries can be inserted in a single leaf, building a leaf containing the set of entries; (v) if no split bit is found in step (i), and if the entries cannot be inserted in a single leaf, selecting the bit according to a maximal number of pairs criterion; and (vi) splitting selected bit to form a left sub-tree containing at least one entry from the set of entries, and a right sub-tree containing at least one entry from the set of entries.
According to still further features in the described preferred embodiments, the method further includes: (b) retrieving at least one entry from the radix-search tree.