The communications industry is rapidly changing to adjust to emerging technologies and ever increasing customer demand. This customer demand for new applications and increased performance of existing applications is driving communications network and system providers to employ networks and systems having greater speed and capacity (e.g., greater bandwidth). In trying to achieve these goals, a common approach taken by many communications providers is to use packet switching technology. Increasingly, public and private communications networks are being built and expanded using various packet technologies, such as Internet Protocol (IP). Note, nothing described or referenced in this document is admitted as prior art to this application unless explicitly so stated.
A network device, such as a switch or router, typically receives, processes, and forwards or discards a packet based on one or more criteria, including the type of protocol used by the packet, addresses of the packet (e.g., source, destination, group), and type or quality of service requested. Additionally, one or more security operations are typically performed on each packet. But before these operations can be performed, a packet classification operation must typically be performed on the packet.
IP forwarding requires a longest matching prefix computation at wire speeds. The current IP version, IPv4, uses 32 bit destination addresses and a core Internet router can have over 200,000 prefixes. A prefix is typically denoted by a bit string (e.g., 01*) followed by a ‘*’ to indicate the value of these trailing bits does not matter. For destination routing, each prefix entry in a routing table typically consists of a prefix and a next hop value. For example, suppose the database consists of only two prefix entries (01*-->L1; 0100*-->L2). If the router receives a packet with destination address that starts with 01000, the address matches both the first prefix (01*) and the second prefix (0100*). Because the second prefix is the longest match, the packet should be sent to next hop L2. On the other hand, a packet with destination address that starts with 01010 should be sent to next hop L1. The next hop information will typically specify an output port on the router and possibly a data link address.
FIG. 1A illustrates an example of a set of prefixes P1-9 shown as nodes 1A-9A in table 10A and as nodes 1B-9B in unibit trie 10B. Also shown in unibit trie 10B are placeholder/vacant nodes 11B-18B, which represent non-matching nodes (i.e., nodes that are not possible results as a longest matching prefix.) For example, a string of 1110000 matches prefixes P1 (1B), P2 (2B) and P5 (5B), with the longest matching prefix being P5 (B5).
One known approach is typically referred to as “tree bitmap”, described in Eatherton et al., “Data Structure Using a Tree Bitmap and Method for Rapid Classification of Data in a Database,” U.S. Pat. No. 6,560,610, issued May 6, 2003, which is hereby incorporated by reference. Tree bitmap is a multibit trie algorithm that implements a representation of the trie by grouping nodes into sets of strides. A stride is typically defined as the number of tree levels of the binary trie that are grouped together or as the number of levels in a tree accessed in a single read operation representing multiple levels in a tree or trie. FIG. 1B illustrates one such partitioning of nodes P1-P9 (1B-9B) and vacant nodes 11B-18B (FIG. 1A) into strides 20-25. In this example, the stride is of size three.
In a known implementation of the tree bitmap algorithm, all child nodes of a given trie node are stored contiguously, which allows the use of just one pointer for all children (the pointer points to the start of the child node block), as each child node can be calculated as an offset from the single pointer. This can reduce the number of required pointers and cuts down the size of trie nodes.
In addition, there are two bit maps per trie node, one for all the internally stored prefixes and one for the external pointers. The internal bit map has a 1 bit set for every prefix stored within this node. Thus, for an r-bit trie node, there are (2r)−1 possible prefixes of lengths less than r, and hence, a (2r)−1 bit map is used. The external bit map contains a bit for all possible 2r child pointers. A trie node is of fixed size and only contains an external pointer bit map, an internal next hop information bit map, and a single pointer to the block of child nodes. The next hops associated with the internal prefixes are stored within each trie node in a separate array associated with this trie node. For memory allocation purposes, result (e.g., leaf) arrays are normally an even multiple of the common node size (e.g., with 16-bit next hop pointers, and 8-byte nodes, one result node is needed for up to four next hop pointers, two result nodes are needed for up to 8, etc.) Putting next hop pointers in a separate result array potentially requires two memory accesses per trie node (one for the trie node and one to fetch the result node for stored prefixes). A simple lazy strategy to not access the result nodes till the search terminates is typically used. The result node corresponding to the last trie node encountered in the path that contained a valid prefix is then accessed. This adds only a single memory reference at the end besides the one memory reference required per trie node.
FIG. 1C illustrates one representation of a tree bitmap implementation of the prefix example shown in FIGS. 1A-B. As shown, root node 30 represents the first level trie. Child pointer 31 connects root node 30 to child array 40 containing the second level strides. In level 3, there are two child arrays 50 and 60, which are connected from child array 40 respectively by child pointers 41 and 42.
A longest prefix match is found by starting with the root node. The first bits of the destination address (corresponding to the stride of the root node, three in this example) are used to index into the external bit map at the root node at say position P. If a 1 is located in this position, then there is a valid child pointer. The number of 1's not including and to the left of this 1 (say I) is determined. Because the pointer to the start position of the child block (say C) is known and the size of each trie node (say S), the pointer to the child node can be computed as C+(I*S).
Before moving on to the child, the internal bit map is checked to see if there is a stored prefix corresponding to position P. To do so, imagine successively remove bits of P starting from the right and index into the corresponding position of the internal bit map looking for the first 1 encountered. For example, suppose P is 101 and a three bit stride is used at the root node bit map. The right most bit is first removed which results in the prefix 10*. Because 10* corresponds to the sixth bit position in the internal bit map, a check is made to determine if there is a 1 in that position. If not, the right most two bits (resulting in the prefix 1*) are removed. Because 1* corresponds to the third position in the internal bit map, a check is made to determine if a 1 is there. If a 1 is found there, then the search ends. If a 1 is not found there, then the first three bits are removed and a search is performed for the entry corresponding to * in the first entry of the internal bit map.
Once it has been determined that a matching stored prefix exists within a trie node, the information corresponding to the next hop from the result node associated with the trie node is not immediately retrieved. Rather, the number of bits before the prefix position is counted to indicate its position in the result array. Accessing the result array would take an extra memory reference per trie node. Instead, the child node is examined while remembering the stored prefix position and the corresponding parent trie node. The intent is to remember the last trie node T in the search path that contained a stored prefix, and the corresponding prefix position. When the search terminates (i.e., a trie node with a 0 set in the corresponding position of the external bit map is encountered), the result array corresponding to T at the position already computed is accessed to read off the next hop information.
In hardware implementations, the memory access speeds are generally the bottleneck as opposed to node processing time. A typical implementation of a hardware based tree bitmap lookup engine uses multiple memory channels to store the tree bitmap data structure. In this case the tree bitmap nodes are spread out across the memory channels in such a way that per lookup, successive nodes accessed fall in different memory channels. If a single memory channel can sustain ‘x’ accesses per second, then with multiple lookups in progress simultaneously, ‘x’ lookups per second on average can be achieved provided each memory channel is accessed at most once per lookup. If any of the channels is accessed twice per lookup, then the packet forwarding rate drops by half because that particular channel becomes the bottleneck.
Another known approach for performing lookup operations is described in Wilkinson, III et al., U.S. Pat. No. 5,781,772, issued Jul. 14, 1998, which is hereby incorporated by reference. Wilkinson, III et al., describes a previous system in which each node has an array of n number of pointers, wherein n is the number of possible next possible values that can occur in an input string. Additionally, Wilkinson, III et al. describes uncompressed and compressed routing data structures.
In another known prior approach, sometimes referred to as “mtree” or “mtrie”, the next child node of a parent node during a lookup operation is determined by an offset value corresponding to a next bit or stride of a lookup value from a common base pointer. Thus, if the next value is 138, the child node is located at base pointer+138× the size of a node. For 16 bit stride, this requires each parent node to have a unique, non-overlapping memory block of 64K entries (i.e., nodes)×the size of a node. This typically wastes a lot of space as these memory blocks are often sparsely populated. Moreover, each entry must be populated with the value of a node or an indication that no child node exists.
The approaches previously described perform searches based on a Patricia tree representation of the prefix space. A different approach for identifying a longest prefix match is described in Lampson et al., “IP Lookups Using Multiway and Multicolumn Search,” IEEE/ACM Transactions on Networking, Vol. 7, No. 3, June 1999, which is hereby incorporated by reference. Lampson et al. treats a prefix as a range and encodes it using the start and end of range, and describes techniques for performing binary and multiway searches to identify a longest prefix match for an input IP address.
FIG. 1D visually illustrates a set of prefixes 100, which include prefixes 101-106. As shown, the longer matching prefixes are contained within boundaries identified by shorter prefixes contained within the longer matching prefix. For example, prefix 105 is longer than prefix 104, which is longer than prefix 103, which is longer than prefix 101. For example, prefix 101 could represent the IP address of 10.*.*.*, prefix 103 represent 10.3.*.*, and so on. The possible value of input values is divided upon matching ranges 110 based on prefixes 101-106, with the individual matching ranges 111 corresponding to prefix 101, range 112 corresponding to prefix 102, ranges 113 corresponding to prefix 103, ranges 114 corresponding to prefix 104, range 115 corresponding to prefix 105, and range 116 corresponding to prefix 106. Note, there are two ranges 113 for prefix 103 because there are longer matching prefixes 104, 105, 106. Thus, finding a range 111-116 matching an input value identifies the corresponding longest matching prefix 101-106. As such the mapping of the ranges to their respective longest matching prefix can be pre-computed. However, a problem with this technique as acknowledged within Lampson et al. is that as presented, it does not provide an efficient mechanism to update the search space, which is a major obstacle for use in a network as routing tables are typically continuously updated.