In recent years, it has become increasingly common for network devices such as load balancers, firewalls, and the like to incorporate multiple, general purpose processing cores (e.g., Intel x86, PowerPC, or ARM-based cores) for network processing purposes. An important aspect of designing such a network device involves determining a mechanism for evenly distributing incoming data packets among the multiple processing cores of the device. By ensuring that each core is assigned a proportional share of the incoming traffic, processing bottlenecks can be avoided and the overall throughput/performance of the network device can be increased.
One approach for distributing data packets among multiple processing cores is to hash, at the packet processor level, certain key fields of an incoming data packet such as the source IP address and the destination IP address. The packet processor can then use the resulting hash value to select one of the processing cores for handling the data packet. Unfortunately, while this approach works well for distributing stateless (e.g., UDP) traffic, it does not work as well for stateful (e.g., TCP) traffic. For example, consider a typical server load balancing scenario where a load balancer receives, from a client device, a data packet that is part of a forward TCP flow destined for a virtual IP address (VIP) configured on the load balancer (i.e., the source IP address of the data packet is the client device's IP address and the destination IP address of the data packet is the VIP). If the load balancer is using a standard hash-based distribution algorithm, the load balancer will hash the data packet's source and destination IP addresses and use the hash value to distribute the packet to a particular processing core (e.g., “core 1”). As part of its processing, core 1 will access or generate TCP state information for the TCP session. The load balancer will then select a real (i.e., physical) server based on server load, perform network address translation (NAT) on the data packet to change its destination IP address from the VIP to the real server IP address, and forward the packet to the real server.
The problem with standard hash-based distribution in this scenario occurs when the real server generates a reply data packet that is part of a reverse TCP flow destined for the client device (i.e., the source IP address of the reply data packet is the real server IP address the destination IP address of the reply data packet is the client IP address). Like the forward TCP flow, upon intercepting the reply data packet, the load balancer will hash the packet's source and destination IP addresses and use the hash value to distribute the packet to a processing core. However, since the source and destination IP addresses of the reply data packet are different from the source and destination IP addresses of the data packet that originated from the client device, this hashing will result in a hash value that is different from the hash value calculated during the forward TCP flow. This, in turn, will likely cause the reply data packet to be distributed to a different processing core (e.g., “core 2”) that does not have access to the same TCP state information as core 1. As a result, core 2 will not be able to perform stateful processing of the reply data packet.
To address this problem, it is possible to implement a shared memory design in the load balancer/network device that allows multiple processing cores/processors to access a single, common pool of memory. In the scenario above, this would allow core 1 and core 2 to read and write the same state information. However, the scalability of this design is usually limited by the processor architecture being used (e.g., some processor architectures may only support 2-core memory sharing, others may only support 4-core memory sharing, etc.), and thus cannot be arbitrarily scaled out by the network device vendor to meet market demands. Further, there is often a performance penalty with such shared memory designs due to synchronization mechanisms and increased memory latency.
Another solution is to leverage the ternary content addressable memory (TCAM) that is commonly included in (or implemented in conjunction with) existing packet processors to perform a rule-based hash. With this solution (referred to herein as the “TCAM-only solution”), the TCAM is populated with one entry per each VIP and each real server IP address that is configured on the load balancer/network device. Each VIP entry is associated with a rule or action that hashes the source IP address of an incoming data packet if the destination IP address matches the corresponding VIP, and each real server IP entry is associated with a rule/action that hashes the destination IP address of an incoming data packet if the source IP address matches the corresponding real server IP address. These entries essentially enable the packet processor to use the TCAM for (1) identifying a data packet as being part of a forward flow or a reverse flow of a stateful connection, and (2) hashing the common portion of the IP header that appears in both flows—namely, the client IP address (which appears in the source IP address field in the forward flow and the destination IP address field in the reverse flow). By hashing the common client IP address, data packets in the forward and reverse flows will always result in the same hash value, and thus will always be distributed to the same processing core.
Unfortunately, although the TCAM-only solution works well for both shared memory and distributed memory network devices, this solution suffers from its own scalability limitations. First, TCAM is a relatively expensive type of memory and consumes a significant amount of power. As a result, existing packet processors/network devices typically do not include large TCAMs (e.g., at most a few thousand entries). Second, the trend in newer generation packet processor designs is to reduce internal TCAM sizes even further from prior generations, as well as to eliminate support for external TCAMs. These factors will likely limit the number of TCAM entries available in network devices moving forward, which in turn will adversely affect the ability of network device vendors to scale out the TCAM-only solution. For instance, since this solution requires one TCAM entry per each VIP and each real server IP address for server load balancing, the number of VIPs and real servers that can be supported will be directly constrained by the amount of available TCAM space.