The Internet is becoming ubiquitous: everyone wants to join in. Statistics show that the number of computers on the internet is tripling approximately every two years. Traffic on the Internet is also increasing exponentially. Traffic increase can be traced not only to increased hosts, but also to new applications (e.g., the Web, video conferencing, remote imaging) which have higher bandwidth needs than traditional applications. One can only expect further increases in users, computers, and traffic. The possibility of a global Internet with multiple addresses per user (e.g., each user may have several appliances on the Internet) has necessitated a transition from the older Internet routing protocol (called IPv4, with small 32 bit addresses) to the proposed net generation protocol (called IPv6, with much larger 128 bit addresses).
The increasing traffic demand placed on the network forces two key factors to keep pace; first, the speed of communication links; and second, the rate at which routers (routers are boxes that route messages in the Internet, very much like automated Post Offices in the postal network) can forward messages. With the advent of fiber optic links, it is easily and economically possible to solve the first problem. For example, MCI is currently upgrading its Internet backbone links from 45 Mbits/s to 155 Mbits/s, and they plan to switch to 622 Mbit/s within a year. However, improving the speed of communication links is insufficient unless router forwarding speeds increase proportionately.
Today's fastest routers (built by say CISCO systems) forward messages at a maximum rate of 100,000 to 500,000 messages a second. However, communication link speeds are already reaching speeds of 1 Gigabit/sec (1 Gigabit=1000 million bits per second). A router has to forward 5 million messages (of average size say 200 bits) per second to keep up with the speed of a Gigabit link. With the popularity of the Internet and the larger traffic volumes expected, every router vendor (CISCO, BAY NETWORKS, etc.) wants to increase the forwarding performance of its routers.
The major problem that routers face in forwarding an Internet message is something known as address lookup. To understand this, we must first have an intuitive idea of what a router does. Consider a hypothetical fragment of the Internet linking users in Europe with users in the United States. Consider a source user (see label called Source in the left of FIG. 1) in Paris. If this user wishes to send say an email message to San Francisco, the user will send its message to a router R1 which is, say, in Paris. The Paris router may send this message on the communication link L4 to router R, say in London. The London Router R may then send the message on line L2 to router R3 in San Francisco; R3 then sends the message to the destination.
Notice how a message travels from source to destination alternating between communication links and routers. This is almost identical to the way a postal letter travels from post office to post office using some communication channel (e.g., an airplane). How does each post office decide where to forward the letter? Each post office does so, using the destination address that is placed on the envelope containing the letter. In the same way, routers must decide to forward a message based on a destination address that is placed in an easily accessible portion of the message called a header.
With this context, we can understand how a router forwards an incoming message. Consider the router R in FIG. 1. We show a schematic description of router R in FIG. 2. When a message arrives on say link L4, the message carries its destination address San Francisco in its message header. Router R is a special computer whose job is to forward all messages that are sent to it towards their final destinations. To do so, router R consults a Forwarding Table (sometimes also called a Forwarding Database). This is a table in the memory of R, which lists each possible destination and the corresponding output link. Please do verify that the Forwarding Table contents are consistent with FIG. 1.
Thus when a message to San Francisco arrives on link L4, router R looks up the destination address San Francisco in its forwarding table. Since the table says L2, the router then switches the entire message to the output link L2. It then proceeds to service the next arriving message. Notice that so far the word "lookup" is no different from looking up a word in a dictionary or a phone number in the phone book. We will show it is a lot harder than dictionary or phone book lookup shortly.
Thus the two main functions of a router are to lookup destination address (address lookup) and then to send the packet to the right output link (message switching) To be more precise, there are some additional chores such as incrementing a visit count in a message; but these chores are fairly trivial compared to lookup and switching. Both must be done at very high speeds. Fortunately, the problem of message switching is very well understood in recent years because of advances in ATM Switching Technology. Economical gigabit message switching is quite feasible today because of the work of co-inventors Jon Turner and others. (Thus one can imagine a router as having an ATM core to switch packets.)
We have already seen that of the two main functions of a router, message switching is a solved problem and already available in many commercial products. Despite this, the problem of doing address lookups at Gigabit speeds remains. Current vendor speeds for lookups are quite slow. For example, Ascend's product has hardware assistance for lookups and can take up to 3 .mu.s for a single lookup in the worst case and 1 .mu.s on average. Our invention, on the other hand, gives ten times faster address lookup performance (lookups in around 0.1 .mu.s).
Before we describe how our invention works, it is important to understand why Internet address lookup is hard. It is hard for two reasons. First, Internet addresses are not specially created (like ATM addresses) to be easy to lookup. Second, the Internet deals with scaling issues by using address prefixes which requires a more complex lookup. We describe details below.
First, looking up Internet addresses is a lot harder than say looking up ATM addresses. ATM addresses (VCs) are carefully chosen to be simple to lookup in switch tables. Unfortunately, ATM addresses must be set up for each conversation which adds delay; by contrast, Internet addresses (like postal addresses) are relatively fixed and there is no additional set up delay per conversation. Secondly, ATM addresses do not currently make much provision for hierarchical networks and so are perceived not to be scalable to truly global networks. IP, through the use of prefixes (see below), has provision for scaling. Thus for various reasons, some technical and some political, the Internet and ATM seem to be each going their own way. In the future, they are likely to coexist with ATM backbones and ATM LANs IN THE Internet. IP address lookup is a lot harder and ii) the Internet is unlikely, if at all, to change completely to ATM.
The second thing to realize is that the Internet lookup problem is a lot harder than looking up a phone number in a phone book, or a word in a dictionary. In those problems, we can search quite fast by first sorting all the words or names. Once sorted, if we are looking for a word starting with Sea, we simply go to the pages of S entries and then look for words starting with Sea etc. Clearly, such lookup is a lot faster than looking up all entries in a dictionary. In fact, such lookup is called exact matching lookup; standard solutions based on hashing and binary search provide very fast times for exact matching.
The Internet lookup problem is a lot harder than dictionary search because Internet routers store address prefixes in their forwarding tables to reduce the size of their tables. However, the use of such address prefixes makes the lookup problem one of longest matching prefix instead of exact matching. The longest matching prefix problem is a lot harder. Before we explain why, let us digress briefly and explain why routers store prefixes in their tables.
Consider FIG. 3. The situation is similar to that in FIG. 1. However, we show the geographic significance of the addresses more clearly. Router R has link L1 to get to Boston as before, but Boston is also the "hub" for the whole of the U.S. Assume that we can get to any destination in the U.S. from a hub router in Boston. As before line L3 leads to California. from where a message can be sent directly to any location in California. Finally, as before, link L2 leads directly to San Francisco.
If we were to use the naive database in FIG. 2, we would have to list every destination in the U.S. (possibly thousands) in the database. For example, we might list Denver, Kans., and other cities as being reachable through Boston on link L1 . This would lead to an enormously large table in router R, which would be difficult to store and maintain.
Instead, we prefer to store prefixes in the modified database of FIG. 4. Notice that we now store all the destinations such as Denver, Kans. by the single entry USA.*(anything in the USA). We store California as USA.CA.* (anything California), and San Francisco as USA.CA.SF. Thus we have used only three entries to store the same amount of information. Of course, to make this work we have to modify the destination address in a message from say SanFrancisco (see FIG. 2) to say USA.CA.SF. But this is easily done.
The use of prefixes introduces a new dimension to the lookup problem: multiple prefixes may match a given address. If a packet matches multiple prefixes, it is intuitive that the packet should be forwarded corresponding to the most specific prefixes or longest prefix match. Thus a packet address to USA.CA.SF matches the USA*, USA.CA.*, and the USA.CA.SF entries. Intuitively, it should be sent to L2 corresponding to the most specific match USA.CA.SF. This is because (see FIG. 3) we have a direct line to San Francisco and want to use it in place of possibly longer routing through Boston. Similarly a packet addressed to USA.CA.LA matches the USA* and USA.CA.* entries. Intuitively, it should be sent to L3 corresponding to the most specific match USA.CA.*.
In summary, routers obtain massive savings in table size by summarizing several address entries by using a single prefix entry. Unfortunately, this leads to possibly multiple prefixes matching a given address, with the result that routers must solve a harder problem called best matching prefix.
With this interlude behind us, we can define the Internet address lookup problem. First, Internet addresses are strings of bits, not words using English characters, as we used above for the sake of illustration. A bit is either a 0 or 1. A bit string is a sequence of bits like 0101. The length of a bit string is the number of bits it contains. Thus the length of bit string 0101 is 4. Internet addresses come in two flavors. The current Internet (sometimes called IPv4, for Internet Protocol, version 4) uses addresses that are bit strings of length 32. We often say that IPv4 uses 32 bit addresses. The Internet is expected to evolve to a next generation Internet (sometimes called IPv6, for Internet Protocol, version 6) which uses 128 bit addresses. As we will see, the longer length of IPv6 addresses will only compound the problems of routers.
Except for this minor difference (bit strings instead of character strings), the Internet lookup problem is exactly the best matching prefix problem described above. To make things more concrete, consider the forwarding table of Internet address prefixes shown in FIG. 5. We will use this table, with minor variations, for all the examples herein.
Except for the fact that we use bit strings (and we have labeled the prefixes for convenience), the situation is identical to the table in FIG. 4.
Now suppose we have a 32 bit IPv4 destination address whose first 6 bits are 10101. Clearly its best matching prefix is Prefix P4 though it also matches Prefix P3 and P2. Thus any message to such a destination address should be sent to the output link corresponding to P4, which is L2.
The naivest method to solve the best matching prefix problem is to scan the entire forwarding table looking for the best matching prefix of an address. This would be grossly inefficient for large tables.
We now describe two standard solutions that attempt to solve the IP matching prefix. The first solution is based on converting the best matching prefix problem into an exact match problem. The second solution is based on using a data structure called a trie. We will see that both solutions examine a destination address one bit at a time, and hence can take up to 32 steps for IPv4 (and 128 for IPv6). This can be too slow.
From now, we will describe all schemes with respect to IPv4 (32 bit) addresses unless we specifically generalize to include IPv6.
In this idea we divide the forwarding table into several (at most 32) separate forwarding table such that Table i contains all prefixes of length i. Thus, if we redraw the forwarding table of FIG. 5 using this idea, we get FIG. 6. Notice that prefix 1* is in the Length 1 table, Prefix 10* is in the Length 2 table, and so on. We have simply separated prefixes into separate tables according to prefix length.
The idea now is to start trying to find the longest prefix possible starting with the longest length prefix table and work backwards until we find a prefix table that we get a match on. So consider an address A whose first 8 bits are 11000000. Since our longest prefix length is 7, we first try for a 7 bit match. We take the first 7 bits of address A (i.e., 1100000) and use any technique for exact matching to match these first 7 bits of address A against any prefix in the Length 7 database. A good technique to use for this is hashing. Since we fail to find a match, we move to the next length table (Length 6). This time we take the first 6 bits of address A (i.e., 110000) and we search the Length 6 Table (see FIG. 6. Since we failed to find a match we try again with the first 5 bits of A in the length 5 table, then the first bit of A (i.e., 1) and we get a match with prefix P4. Notice that we have tried all possible length tables in the database before we got a match.
On the other hand, if we were to search for an address B whose first 8 bits were 10000011, we would try the length 7 table and fail, but when we try the first six bits, we will find a match in the length 6 database with P6. This time we only searched 2 tables. However, while the best case can involve searching only a few tables, the worst case can involve searching all possible prefix lengths. If we use W bit addresses, this can take W table searches, where W is 32 for IPv4 and 128 for IPv6. Each search through a table requires what we call an exact match (unlike finding the best matching prefix).
This method can cost up to 32 exact matches (often done using hashing in software) for IPv4 and 128 exact matches for IPv6. (To see this consider an address that matches a 1 bit prefix, in a table that contains prefixes of all possible lengths.) An example of a patent that does this is U.S. Pat. No. 5,493,564 by Mullan. This is often too time consuming in software. A Bellcore patent proposes doing all the exact matches in parallel using hardware. Each exact match is done using a Context Addressable Memory (CAM). Unfortunately, the hardware cost of this solution is also formidable as we have to use 32 CAMs for IPv4 (128 for v6); each CAM is expensive. Other methods have proposed pipelining the CAMs instead of doing the searches in parallel.
We will describe a considerable improvement of this scheme that improves the worst case time from W to log.sub.2 W.
A trie is a data structure which allows us to search for prefixes a bit at a time and to do so incrementally. A trie is a tree of nodes, each node containing a table of pointers. The standard solutions for IPv4 (e.g., the solution used in BSD UNIX) uses binary tries, in which each trie node is a table consisting of two pointers.
An example will explain how tries work. Consider FIG. 7. The root node is shown on the top left. Each trie node is a table whose topmost entry can contain a prefix. Each table also can contain two pointers, each of which points to other trie nodes (FIG. 7) or to prefixes. This trie stores the same table as FIG. 5. The root node (topmost node) has two pointers. The first pointer, corresponding to the value `0`, points to a subtrie that contains all prefixes that start with `0`. Since there is only one such prefix, i.e., P5, the `0` pointer points directly to P5. On the other hand, all other prefixes begin with `1`. Thus the `1` pointer in the root node, points to a subtrie that contains the remaining prefixes.
Each subtrie is a smaller trie with a smaller number of prefixes. In addition to pointers, each node may also have a stored prefix P. Define the path of a trie node N to be the sequence of bits corresponding to the pointers used to reach N starting from the root. Thus in FIG. 7, the path of the trie node containing P4 is 1 and the path of the trie node containing P1 is 10. We store a prefix P inside node N if the path of node N is equal to prefix P, ignoring the * character. Thus in FIG. 7, we see that the node that stores P1 (which is equal to 10*) is indeed 10.
If there is at most one pointer at a node and there are no prefixes stored, then we can collapse a general trie node into a simpler primitive node that only contains a bit and a pointer. For example, the path to prefix P3=11001 (stored at bottom left of FIG. 7) starts at the root and follows the 1 pointer (first bit of P3); then goes to the node containing P4 and follows the 1 pointer (second bit of P3); at the next node the path follows the 0 pointer (third bit of P3). After this there are no other prefixes that share the same path with P3 and thus we have 2 primitive nodes corresponding to the fourth and fifth bits of P3 (0 and 1 respectively) which finally lead to P3.
Thus the bits in a prefix can be used to trace a path through the trie that leads to the prefix by itself (e.g., P3) or to a node that stores the prefix (e.g., P4).
Now consider searching the trie table for the best matching prefix corresponding to an address A whose first 8 bits are 11000000. We use the bits of an address, starting with the leftmost bit, to follow a path through the trie. We always begin at the root node. Since the first bit of A is 1, we follow the `1` pointer. Since the node contains a prefix, P4, we remember this as a possible matching prefix. Then, since the second bit of A is 0, we follow the `0` pointer. We then keep following the path of P3 (because the first four bits are A are the same as that of P3). But when we try the fifth bit of A we find a 0 instead of a 1 and the search fails. At this point, the search terminates with the best matching prefix equal to P4.
On the other hand, if we are searching for the best matching prefix of address B whose first 8 bits are 10010000, the 1 pointer at the root will lead us to P4's node (and we remember P4 as the longest prefix seen so far). Then 0 pointer will lead us to P1's node (and we now remember P1 as the longest prefix seen so far). The 0 pointer (corresponding to 3rd bit of B) at P1's node will lead us to primitive node containing a 0. But at this point we fail because the fourth bit of the address is a 1 and not a 0. Thus the best matching prefix corresponding to address B is P1.
Thus, to find a best match prefix in a trie, we use successive bits of the address to trace a path through the trie, starting from the root, until we fail to find a pointer or we end at a prefix. As we walk through the trie, we remember the last prefix we saw at a previous node, if any. When we fail to find a pointer, this is the best matching prefix.
The worst case time to walk through a trie path is the maximum number of nodes in a trie path. In the example of FIG. 7, the path to P8 requires following 7 pointers. In general if we have the prefixes 1*, 11*, 111*, 1111*, etc. then we can easily have a trie path equal to the maximum address length (32 for IPv4, 128 for IPv6). Thus the time for trie search of an address can be as bad as following 32 (or 128 for v6) pointers. This is somewhat better than the 32 exact matches required in FIG. 6, but it is still slow for real routers. The problem is that the following of each pointer requires at least one READ of memory. The fastest reads to reasonably inexpensive memory takes about 0.06 .mu.sec.
Thus 32 READs takes about 1.8 .mu.sec, which is the fastest that trie search can do today.
A description of Tries can be found in the textbook called "Fundamental Algorithms, Sorting and Searching, by Donald Knuth, Addison Wesley, 1973". A description of a particular kind of trie (called a Patricia trie, and which is optimized to reduce storage) applied to Internet lookups can be found in Keith Sklower, A tree-based routing table for berkeley unix, Technical report, University of California, Berkeley and in W. Richard Stevens and Gary R. Wright, TCP/IP Illustrated, Volume 2 The Implementation, Addison-Wesley, 1995. H. Wilkinson, G. Varghese and N. Poole, Compressed Prefix Matching Database Searching, U.S. patent application 07/378,718 December 89, Issued in Australia as Patent 620994 describes another variant of tries that reduces storage using a technique called path compression. All the existing trie schemes assume that trie search must be performed 1 bit at a time if the prefixes can be of arbitrary length. This greatly slows down trie search as it requires W memory READs, where W is the size of a destination address.
Trie search that searches multiple bits at a time is described in Tong-Bi Pei and Charles Zukowski, Putting routing tables in silicon, IEEE Network Magazine, January 1992. However, this work only applies to exact matching and not to prefix matching.
The work in U.S. Pat. No. 5,414,704 by Spinney applies only to exact matching and not to prefix matching. Radia Perlman, Interconnections, Bridges and Routers, Addison-Wesley, 1992 describes a method based on binary search on all prefixes. Unfortunately binary search takes time proportional to the logarithm of the number of entries. For a typical router database of 30,000 entries (growing as the Internet grows) this takes 13 READs to memory which is too slow. The work in U.S. Pat. No. 5,261,090 applies to range checking which is a similar problem but also uses binary search.