The communications industry is rapidly changing to adjust to emerging technologies and ever increasing customer demand. This customer demand for new applications and increased performance of existing applications is driving communications network and system providers to employ networks and systems having greater speed and capacity (e.g., greater bandwidth). In trying to achieve these goals, a common approach taken by many communications providers is to use packet switching technology. Increasingly, public and private communications networks are being built and expanded using various packet technologies, such as Internet Protocol (IP).
A network device, such as a switch or router, typically receives, processes, and forwards or discards a packet based on one or more criteria, including the type of protocol used by the packet, addresses of the packet (e.g., source, destination, group), and type or quality of service requested. Additionally, one or more security operations are typically performed on each packet. But before these operations can be performed, a packet classification operation must typically be performed on the packet.
IP forwarding requires a longest matching prefix computation at wire speeds. The current IP version, IPv4, uses 32 bit destination addresses and a core Internet router can have over 200,000 prefixes. A prefix is typically denoted by a bit string (e.g., 01*) followed by a ‘*’ to indicate the value of these trailing bits does not matter. For destination routing, each prefix entry in a routing table typically consists of a prefix and a next hop value. For example, suppose the database consists of only two prefix entries (01*->L1; 0100*-->L2). If the router receives a packet with destination address that starts with 01000, the address matches both the first prefix (01*) and the second prefix (0100*). Because the second prefix is the longest match, the packet should be sent to next hop L2. On the other hand, a packet with destination address that starts with 01010 should be sent to next hop L1. The next hop information will typically specify an output port on the router and possibly a data link address.
FIG. 1A illustrates an example of a set of prefixes P1-9 shown as nodes 1A-9A in table 10A and as nodes 1B-9B in unibit trie 10B. Also shown in unibit trie 10B are placeholder/vacant nodes 11B-18B, which represent non-matching nodes (i.e., nodes that are not possible results as a longest matching prefix.) For example, a string of 1110000 matches prefixes P1 (1B), P2 (2B) and P5 (5B), with the longest matching prefix being P5 (B5).
One known approach is typically referred to as “tree bitmap”, described in Eatherton et al., “Data Structure Using a Tree Bitmap and Method for Rapid Classification of Data in a Database,” U.S. patent application Ser. No. 09/371,907, filed Aug. 10, 1999, which issued as U.S. Pat. No. 6,560,610 on May 6, 2003, with this application being hereby incorporated by reference in its entirety. Tree bitmap is a multibit trie algorithm that implements a representation of the trie by grouping nodes into sets of strides. A stride is typically defined as the number of tree levels of the binary trie that are grouped together or as the number of levels in a tree accessed in a single read operation representing multiple levels in a tree or trie. FIG. 1B illustrates one such partitioning of nodes P1-P9 (1B-9B) and vacant nodes 11B-18B (FIG. 1A) into strides 20-25. In this example, the stride is of size three.
In a known implementation of the tree bitmap algorithm, all child nodes of a given trie node are stored contiguously, which allows the use of just one pointer for all children (the pointer points to the start of the child node block), as each child node can be calculated as an offset from the single pointer. This can reduce the number of required pointers and cuts down the size of trie nodes.
In addition, there are two bit maps per trie node, one for all the internally stored prefixes and one for the external pointers. The internal bit map has a 1 bit set for every prefixes stored within this node. Thus, for an r-bit trie node, there are (2r)-1 possible prefixes of lengths less than r, and hence, a (2r)-1 bit map is used. The external bit map contains a bit for all possible 2r child pointers. A trie node is of fixed size and only contains an external pointer bit map, an internal next hop information bit map, and a single pointer to the block of child nodes. The next hops associated with the internal prefixes are stored within each trie node in a separate array associated with this trie node. For memory allocation purposes, result arrays are normally an even multiple of the common node size (e.g. with 16-bit next hop pointers, and 8-byte nodes, one result node is needed for up to four next hop pointers, two result nodes are needed for up to 8, etc.) Putting next hop pointers in a separate result array potentially requires two memory accesses per trie node (one for the trie node and one to fetch the result node for stored prefixes). A simple lazy strategy to not access the result nodes till the search terminates is typically used. The result node corresponding to the last trie node encountered in the path that contained a valid prefix is then accessed. This adds only a single memory reference at the end besides the one memory reference required per trie node.
FIG. 1C illustrates one representation of a tree bitmap implementation of the prefix example shown in FIGS. 1A-B. As shown, root node 30 represents the first level trie. Child pointer 31 connects root node 30 to child array 40 containing the second level strides. In level 3, there are two child arrays 50 and 60, which are connected from child array 40 respectively by child pointers 41 and 42.
A longest prefix match is found by starting with the root node. The first bits of the destination address (corresponding to the stride of the root node, three in this example) are used to index into the external bit map at the root node at say position P. If a 1 is located in this position, then there is a valid child pointer. The number of 1's not including and to the left of this 1 (say I) is determined. Because the pointer to the start position of the child block (say C) is known and the size of each trie node (say S), the pointer to the child node can be computed as C+(I*S).
Before moving on to the child, the internal bit map is checked to see if there is a stored prefix corresponding to position P. To do so, imagine successively remove bits of P starting from the right and index into the corresponding position of the internal bit map looking for the first 1 encountered. For example, suppose P is 101 and a three bit stride is used at the root node bit map. The right most bit is first removed which results in the prefix 10*. Because 10* corresponds to the sixth bit position in the internal bit map, a check is made to determine if there is a 1 in that position. If not, the right most two bits (resulting in the prefix 1*) are removed. Because 1* corresponds to the third position in the internal bit map, a check is made to determine if a 1 is there. If a 1 is found there, then the search ends. If a 1 is not found there, then the first three bits are removed and a search is performed for the entry corresponding to * in the first entry of the internal bit map.
Once it has been determined that a matching stored prefix exists within a trie node, the information corresponding to the next hop from the result node associated with the trie node is not immediately retrieved. Rather, the number of bits before the prefix position is counted to indicate its position in the result array. Accessing the result array would take an extra memory reference per trie node. Instead, the child node is examined while remembering the stored prefix position and the corresponding parent trie node. The intent is to remember the last trie node T in the search path that contained a stored prefix, and the corresponding prefix position. When the search terminates (i.e., a trie node with a 0 set in the corresponding position of the external bit map is encountered), the result array corresponding to T at the position already computed is accessed to read off the next hop information.
FIG. 1D illustrates pseudocode of one implementation of the full tree bitmap search. It assumes a function treeFunction that can find the position of the longest matching prefix, if any, within a given node by consulting the internal bitmap. “LongestMatch” keeps track of a pointer to the longest match seen so far. The loop terminates when there is no child pointer (i.e., no bit set in external bit map of a node) upon which the lazy access of the result node pointed to by LongestMatch is performed to get the final next hop. The pseudocode assumes that the address being searched is already broken into strides and stride[i] contains the bits corresponding to the ith stride.
Keeping the stride constant, one method of reducing the size of each random access is to split the internal and external bitmaps, which is sometimes referred to as split tree bitmaps. This is done by placing only the external bitmap in each trie node. If there is no memory segmentation, the children trie nodes and the internal nodes from the same parent can be placed contiguously in memory. If memory segmentation exists, it is a bad design to have the internal nodes scattered across multiple memory banks. In the case of segmented memory, one option is for a trie node to have pointers to the child array, the internal node, and to the results array.
An alternative, as illustrated in FIG. 1E, has the trie node point at the internal node, and the internal node point at the results array. To make this optimization work, each child must have a bit indicating if the parent node contains a prefix that is a longest match so far. If there was a prefix in the path, the lookup engine records the location of the internal node (calculated from the data structure of the last node) as containing the longest matching prefix thus far. Then, when the search terminates, the corresponding internal node is accessed and then the results node corresponding to the internal node is accessed. Notice that the core algorithm accesses the next hop information lazily; the split tree algorithm accesses even the internal bit map lazily. What makes this work is that any time a prefix P is stored in a node X, all children of X that match P can store a bit saying that the parent has a stored prefix. The software reference implementation uses this optimization to save internal bit map processing; the hardware implementations use it only to reduce the access width size (because bit map processing is not an issue in hardware). A nice benefit of split tree bitmaps is that if a node contained only paths and no internal prefixes, a null internal node pointer can be used and no space will be wasted on the internal bitmap.
With this optimization, the external and internal bitmaps are split between the search node and the internal node respectively. Splitting the bitmaps in this way results in reduced node size which benefits hardware implentations. Each Search node Sj has two pointers—one pointing to the children and the other to the internal node, Ij. The internal node Ij maintains a pointer to the leaf array LAj of leaves corresponding to prefixes that belong to this node. For example, FIG. 1E illustrates search nodes S1 (111), S2 (112) and S3 (113), internal nodes I1 (121), I2 (115) and I3 (114), and leaf arrays LA1 (122), LA2 (116) and LA3 (123), and their interconnection by pointers. Additionally, leaf arrays LA1 (122), LA2 (116) and LA3 (123) respectively include leaf nodes L1 (122A), L2 (116A), and L3 (123A). Note, nodes illustrated in solid lines are the nodes accessed during a tree bitmap lookup example described hereinafter.
Now, consider the case where a lookup proceeds accessing search nodes S1 (111), S2 (112) and S3 (113). If the parent_has_match flag is set in S3 (113), this implies there is some prefix in one of the leaf nodes L2 (116A) in the leaf array LA2 (116) which is the current longest match. In this case, the address of internal node I2 (115) is saved in the lookup context. Now suppose that S3 (113) is not extending paths for this lookup. There could be some prefix in leaf array LA3 (123) which is the longest matching prefix. Hence I3 (114) is first accessed and its internal bitmap checked for a longest matching prefix. If no longest matching prefix is found, internal node I2 (115), whose address has been saved, is retrieved, its bitmap parsed, and leaf node L2 (116A) corresponding to the longest match is returned. The above access sequence is S1 ( 111), S2 (112), S3 (113), I3 (114), I2 (115), L2 (116A). This example shows that there are cases where two internal nodes need to be accessed and two internal bitmaps parsed before the longest match can be determined.
In hardware implementations, the memory access speeds are generally the bottleneck as opposed to node processing time. A typical implementation of a hardware based tree bitmap lookup engine uses multiple memory channels to store the tree bitmap data structure. In this case the tree bitmap nodes are spread out across the memory channels in such a way that per lookup, successive nodes accessed fall in different memory channels. If a single memory channel can sustain ‘x’ accesses per second, then with multiple lookups in progress simultaneously, ‘x’ lookups per second on average can be achieved provided each memory channel is accessed at most once per lookup. If any of the channels is accessed twice per lookup, then the packet forwarding rate drops by half because that particular channel becomes the bottleneck.
Therefore, all the Internal nodes along any path from root to bottom of the tree need to be stored in different memory channels. Accessing two internal nodes presents a problem when there are a limited number of memory channels as both internal nodes need to be placed in different memory channels, and which two internal nodes are going to be accessed depends on the particular tree bitmap and the particular lookup value. Referring to FIG. 1E, for example, the internal nodes accessed could be I3 (114) and I2 (115), or I3 (114) and I1 (121), or I2 (115) and I1 (121). Therefore, in this example, all seven nodes S1 ( 111), S2 (112), S3 (113),I1 (121), I2 (115), I3 (114), and L2 (116) need to be in separate memory modules. This is problematic when there are less than seven memory modules. Needed are new methods and apparatus for storing and retrieving elements of a tree bitmap and other data structures.