1. Layer 3 Routing
As known in the art, a router is a network device that interconnects multiple networks and forwards data packets between the networks (a process referred to as Layer 3, or L3, routing). To determine the best path to use in forwarding an ingress packet, a router examines the destination IP address of the packet and compares the destination IP address to routing entries in a routing table. Each routing entry corresponds to a subnet route (e.g., 192.168.2.0/24) or a host route (e.g., 192.168.2.129/32). If the destination IP address matches the subnet/host route of a particular routing entry, the router forwards the packet out of an egress port to a next hop address specified by the entry, thereby sending the packet towards its destination. In some cases, the destination IP address of an ingress packet may match multiple routing entries corresponding to multiple subnet/host routes. For example, the IP address 192.168.2.129 matches subnet routes 192.168.2.128/26 and 192.168.2.0/24, as well as host route 192.168.2.129/32. When this occurs, the router can perform its selection via longest prefix match (LPM), which means that the router will select the matched routing entry with the highest subnet mask (i.e., the most specific entry).
For performance reasons, many conventional routers perform the routing operations described above using a combination of software and hardware routing tables. For instance, FIG. 1 depicts an exemplary router 100 that includes a management CPU 102, a software (SW) routing table 104, and a packet processor 106 comprising a hardware (HW) routing engine 108, a HW routing table 110, and data ports 112. SW routing table 104 is implemented using a data structure that is stored in a random access memory accessible by management CPU 102 (not shown). Generally speaking, SW routing table 104 contains all of the subnet/host routes known to router 100, such as statically configured routes and routes that are dynamically learned from routing protocols (e.g., RIP, OSPF or BGP). On other hand, HW routing table 110 is implemented using a hardware-based memory component, such as a ternary content-addressable memory (TCAM) or other similar associative memory. Due to its specialized hardware design, HW routing table 110 can enable faster table lookups that SW routing table 104, but is limited in size. Thus, HW routing table 110 typically includes a subset of the routing entries in SW routing table 104.
When an ingress packet is received at router 100, HW routing engine 108 of packet processor 106 first looks for a LPM match for the packet's destination IP address in HW routing table 110. As mentioned above, HW routing engine 108 can perform this lookup very quickly (e.g., at line rate) because of table 110's hardware design. If a match is found, HW routing engine 108 forwards the packet to the next hop specified in the matched entry, without involving management CPU 102. If a match is not found, HW routing engine 108 takes a predefined action, such as dropping the packet or sending it to management CPU 102. If sent to management CPU 102, CPU 102 can perform additional inspection/processing to determine how the packet should be forwarded (such as performing a lookup in SW routing table 104).
2. Routing Tries
In certain implementations, router 100 maintains SW routing table 104 as a binary trie (referred to as a “routing trie”), which makes traversal and searching of SW routing table 104 more efficient. FIG. 2 depicts an exemplary routing trie 200 that may be used to represent SW routing table 104 of FIG. 1. As shown, routing trie 200 includes both branch nodes (denoted by the unmarked circles) and route nodes (denoted by the circles marked with “R”). Each branch node corresponds to a “fork” in routing trie 200, and thus forms the root of a sub-trie. Each route node corresponds to a routing entry in SW routing table 104. Further, each node (whether branch or route) is associated with an IP address prefix. There are generally two rules regarding how the nodes of a routing trie may be positioned:                1. Assume node 1 has prefix of mask1/m-bit, node 2 has a prefix of mask2/n-bit, and m>n. If mask2 is the same as the first n bits of mask1, node 1 is a descendent of node 2.        2. If the (n+1)-th bit of mask1 is 0, node 1 is at the left sub-trie of node 2. Otherwise, it is at the right sub-trie of node 2.        
To illustrate the rules above, consider a routing table that includes two routing entries corresponding to two routes: 01001010/8 and 01010101/8 (represented in binary form). In this example, the routing trie for the table will contain three nodes: two route nodes (one for each of the two routes), and a root node that is a branch node associated with prefix 010/3 (because its two child nodes differ starting from the 4th bit). Note that if a new route 010/3 is added, the routing trie will still contain three nodes—the branch node associated with prefix 010/3 will become a route node.
3. Multi-Packet Processor Networking Systems
While router 100 of FIG. 1 is depicted as a standalone device with a single packet processor and a single HW routing table, some routers are implemented as a system of interconnected devices/modules, where each device/module incorporates a separate packet processor (with a separate HW routing table). Such systems are referred to herein as “multi-packet processor” (MPP) networking systems.
For example, FIG. 3 depicts a stacking system 300 (also known as a “stack”), which is one type of MPP networking system. As shown, stacking system 300 comprises a number of stackable switches 302(1)-302(3) that are interconnected via stacking ports 314(1)-314(3). Each stackable switch 302(1)-302(3) includes components that are similar to router 100 of FIG. 1, such as management CPU 304(1)-304(3), SW routing table 306(1)-306(3), packet processor 308(1)-308(3), HW routing engine 310(1)-310(3), HW routing table 312(1)-312(3), and data ports 316(1)-316(3). However, rather than acting as individual switches/routers, stackable switches 302(1)-302(3) of system 300 can act in concert as a single, logical switch/router. For instance, stackable switch 302(1) can receive a packet on an ingress data port 316(1), perform a lookup (via HW routing engine 310(1)) into its local HW routing table 312(1), and determine based on the lookup that the packet should be forwarded out of, e.g., an egress data port 316(3) of stackable switch 302(3) in order to reach its next hop destination. Stackable switch 302(1) can then send the packet over stacking link 318 to stackable switch 302(3), thereby allowing switch 302(3) to forward the packet out of the appropriate egress data port.
FIG. 4 depicts a chassis system 400, which is another type of MPP networking system. Chassis system 400 includes a management module 402 and a number of I/O modules 410(1)-410(3) that are interconnected via an internal switch fabric 408. Management module 402 includes a management CPU 404 and a SW routing table 406 that are similar to management CPU 102 and SW routing table 104 of router 100 of FIG. 1. In addition, each I/O module 410(1)-410(3) includes a packet processor 412(1)-412(3), a HW routing engine 414(1)-414(3), a HW routing table 416(1)-416(3), and data ports 418(1)-418(3) that are similar to components 106-112 of router 100 of FIG. 1. Generally speaking, I/O modules 410(1)-410(3) can act in concert to carry out various data plane functions, including L3 routing, for chassis system 400. For instance, I/O module 410(1) can receive a packet on an ingress data port 418(1), perform a lookup (via HW routing engine 414(1)) into its local HW routing table 416(1), and determine based on the lookup that the packet should be forwarded out of, e.g., an egress data port 418(2) of I/O module 410(2) in order to reach its next hop destination. I/O module 410(1) can then send the packet over switch fabric 408 to I/O module 410(2), thereby allowing I/O module 410(2) to forward the packet out of the appropriate egress data port.
One inefficiency with performing L3 routing in a MPP networking system like stacking system 300 or chassis system 400 as described above pertains to the way in which the multiple HW routing tables of the system are utilized. In particular, since ingress packets may arrive at any packet processor of the system, the same set of routing entries are replicated in the HW routing table of every packet processor. As a result, the HW routing table capacity of the system is constrained by the size of the smallest HW routing table. For instance, in stacking system 300, assume that HW routing table 312(1) supports 16K entries while HW routing tables 312(2) and 312(3) support 32K entries each. In this scenario, every HW routing table 312(1)-312(3) will be limited to holding a maximum of 16K entries (since additional entries beyond 16K cannot be replicated in table 312(1)). This means that a significant percentage of the system's HW routing resources (e.g., 16K entries in tables 312(2) and 312(3) respectively) will go unused. This also means that the HW routing table capacity of the system cannot scale upward as additional switches are added.