Software-based network packet processing on commodity servers and Software Defined Networking (SDN) and Network Function Virtualization (NFV) promise better flexibility, manageability and scalability, thus gaining tremendous industry momentum in recent years. However, with the rapid growth of network bandwidth consumption, software is hard-pressed to keep pace with the speed and scale of packet processing workloads. As an example, Telecommunications (Telco) workloads require support of network Quality of Service (QoS) on millions of active concurrent flows. To achieve this QoS support, we need to: (1) perform flow classification based on an arbitrary portion of the packets (as input keys) and assign QoS priority to the flow; and (2) enforce a given transmission rate for flows that belong to a priority assigned to it by step 1.
Under current software processing approaches, such as supported by the open-source Data Plane Development Kit (DPDK), all of foregoing functionalities, including the rate-limiting action itself, are facilitated and implemented through software running on a commodity server. On one hand, the flow classification and QoS priority assignment can be done very efficiently with carefully designed software modules (both Longest Prefix Match (LPM) and Exact Match or even based on packet payload information for example). On the other hand, performing rate limiting related operations on each flow (including time stamping, rate limiting, leaky bucket, etc,) proves to be very difficult to scale using a software-based approach, since the associated rate-limiting operations consume significant amounts of CPU (Central Processing Unit) cycles.
Specifically, performing the rate limiting using software executing on a CPU has to use the CPU running clock to account for the time, which requires reading the CPU cycle count or the system time during each loop to calculate the running time to release packets when appropriate (e.g., using a leaky bucket algorithm). However, executing the cycle-count instruction (RdTSC) requires serializing instructions to guarantee that the out-of-order pipeline has completed and finished before reading the cycle count. As a result, a huge variance of the RdTSC can be seen, which can add significant overhead in terms of latency and throughput for the packet processing pipeline to the extent that the CPU might not be able to process certain network flows with very strict QoS requirements. The problem worsens with the increase of number of flows.
QoS rate-limiting is also performed today in switches and Network Interface Controllers (NICs); however this approach lacks the flexibility of packet classification in the CPU because of the limited number of flows or packet classification fields supported by hardware and the limited TCAM (ternary content-addressable memory) capacity, TCAM is generally very costly and power hungry, and thus, generally it can support a limited number of flows.