1. Field of the Invention
The invention relates to network communication routing and, in particular, to performing longest prefix matching for network address lookup using hash functions and Bloom filters.
2. Description of the Related Art
Internet core routers need to forward packets as fast as possible. Forwarding decisions on a data path are made through IP (Internet Protocol) lookup, also known as the Longest Prefix Matching (LPM). A prefix lookup table (LUT) includes prefixes of different lengths. Each prefix value in the prefix LUT is associated with an output interface connecting to the next hop along the data path. To forward a packet, a router processor uses the destination IP address contained in the packet header to search against the prefix LUT and then extract the associated output interface corresponding to the longest matching prefix.
As the central function of an Internet router, IP lookup often poses as the performance bottleneck. The number of IPv4 prefixes in a core router prefix LUT has recently exceeded 250K, increasing at a rate of a few tens of thousand of prefixes each year. While the current IPv6 table is still relatively small, the foreseeable large-scale deployment of IPv6 will result in a table size no smaller than that of IPv4. Recently, 40G line cards have been installed in high-end core routers such as Cisco's CRS-1 router and Juniper's T640 router, which support a packet forward rate of about 50 million packets per second (Mpps). Driven by media-rich Internet applications, IEEE has started to standardize the 100 Gigabit Ethernet (GbE) and planned to finish it in 2010, to partially fulfill the insatiable demands on more network bandwidth. Pre-standard 100 GbE products are expected to be available in the market in about the same time frame. Accordingly, the required packet lookup rate for 100G line card will be further boosted to 150 Mpps. This more-than-two-times leap beyond 40G creates a vast technical challenge that the currently adopted IP lookup solutions cannot address.
It is tempting to think of using TCAM (Ternary Content-Addressable Memory) devices for IP lookups. Indeed, with a remarkable rate of 250M+ searches per second, it seems not a big deal to support even the next-generation IP lookup demand. Unfortunately, even though cost is a secondary consideration for core routers, TCAMs are by no means popular in core routers in practice. The major reasons are their inherent high power dissipation and large footprint. In addition to these disadvantages, an incremental prefix update in TCAM involves as many memory operations as the number of unique prefix lengths.
It is always the theme of designing an efficient IP lookup algorithm to (1) achieve more compact storage and (2) sustain a faster lookup rate. Note that compact storage has an important implication: it potentially enables use of smaller yet faster memory components, such as SRAM devices or even on-chip embedded memory blocks, and, as a result, it also benefits the throughput performance.
As the throughput requirement of modern routers outpaces improvements in SRAM speed, people started to think about using on-chip memory as cache to facilitate faster IP lookups. Thanks to technology advancements, we can now embed a few tens of megabits of fast memory on a chip. This scarce resource has proven to be very critical to satisfy the throughput requirements of the next-generation network applications.
U.S. Patent Application Publication No. US 2005/0195832 A1 (“the '832 publication”), the teachings of which are incorporated herein by reference in its entirety, discloses an IP lookup algorithm that relies on the use of Bloom filters. Bloom filters allow the use of fast on-chip memory and take advantage of the massive parallel processing power of hardware. This Bloom-filter-based IP lookup algorithm, which is described in more detail later in this specification, is relatively simple and promises a very good average performance. However, it also has some drawbacks preventing it from being used in real applications.
First, in the worst case when all the Bloom filters show false positive, the prefix table needs to be searched as many times as the number of Bloom filters. One way to improve the worst-case performance is to reduce the number of Bloom filters. This means prefixes with different lengths need to be “compressed” into a single Bloom filter by using a technique known as prefix expansion. The improvement on the worst-case performance comes at a cost of more memory consumption, because the size of the prefix table can be significantly expanded, even in a controlled fashion. In addition, prefix expansion makes the routing updates much more time-consuming and awkward, while incremental updates happen fairly frequently in core routers. Multiple expanded prefixes need to be taken care of when only a single original prefix is inserted or deleted. In short, the algorithm does not scale very well for larger tables and longer prefixes.
Second, the distribution of prefix lengths is highly asymmetric and dynamic with the incremental updates. To reduce the false positive rate and best utilize the scarce memory resources, the size of each Bloom filter as well as the number of hash functions need to be customized according to the number of prefixes that need to be programmed into the Bloom filter. The current prefix distribution also needs to be able to be adapted by adjusting the memory allocation dynamically. Engineering such a system is difficult and expensive. It requires either over-provisioning or the capability of reconfiguration. We can easily rule out the over-provisioning option, because fast on-chip memory is still a scarce and costly resource. Theoretically, reconfiguration can be done in field-programmable gate arrays (FPGAs); however, in practice, it takes seconds to finish and can interrupt router services. In fixed application-specific integrated circuit (ASIC) devices, reconfiguration is simply impossible.
Third, in order to achieve a desired goal of one cycle per lookup, the '832 publication assumes that a Bloom filter is implemented in a k-port memory, where k equals to the number of hash functions. This is impractical in real hardware implementations for even modest values of k (e.g., greater than two).