There are different approaches for reducing the processing power of TCP/IP stack processing. In a TCP Offload Engine (TOE), the offloading engine performs all or most of the TCP processing, presenting to the upper layer a stream of data. There may be various disadvantages to this approach. The TOE may be tightly coupled with the operating system and therefore may require solutions that are dependent on the operating system and may require changes in the operating system to support it. The TOE may require a side by side stack solution, requiring some kind of manual configuration, either by the application, for example, by explicitly specifying a socket address family for accelerated connections. The TOE may also require some kind of manual configuration by an IT administrator, for example, by explicitly specifying an IP subnet address for accelerated connections to select which of the TCP flows will be offloaded and the offload engine is very complex as it needs to implement TCP packet processing.
Large segment offload (LSO)/transmit segment offload (TSO) may be utilized to reduce the required host processing power by reducing the transmit packet processing. In this approach the host sends to the NIC, bigger transmit units than the maximum transmission unit (MTU) and the NIC cuts them to segments according to the MTU. Since part of the host processing is linear to the number of transmitted units, this reduces the required host processing power. While being efficient in reducing the transmit packet processing, LSO does not help with receive packet processing. In addition, for each single large transmit unit sent by the host, the host receives from the far end multiple ACKs, one for each MTU-sized segment. The multiple ACKs require consumption of scarce and expensive bandwidth, thereby reducing throughput and efficiency.
In large receive offload (LRO), a stateless receive offload mechanism, the TCP flows may be split to multiple hardware queues, according to a hash function that guarantees that a specific TCP flow would always be directed into the same hardware queue. For each hardware queue, the mechanism takes advantage of interrupt coalescing to scan the queue and aggregate subsequent packets on the queue belonging to the same TCP flow into a single large receive unit.
While this mechanism does not require any additional hardware from the NIC besides multiple hardware queues, it may have various performance limitations. For example, if the number of flows were larger than the number of hardware queues, multiple flows would fall into the same queue, resulting in no LRO aggregation for that queue. If the number of flows is larger than twice the number of hardware queues, no LRO aggregation is performed on any of the flows. The aggregation may be limited to the amount of packets available to the host in one interrupt period. If the interrupt period is short, and the number of flows is not small, the number of packets that are available to the host CPU for aggregation on each flow may be small, resulting in limited or no LRO aggregation. The limited or no LRO aggregation may be present even in instances where the number of hardware queues is large. The LRO aggregation may be performed on the host CPU, resulting in additional processing. The driver may deliver to the TCP stack a linked list of buffers comprising a header buffer followed by a series of data buffers, which may require more processing than in the case where all the data is contiguously delivered on one buffer.
When the host processor has to perform a read/write operation, a data buffer has to be allocated in the user space. A read operation may be utilized to copy data from the file into this allocated buffer. A write operation may be utilized to transmit the contents of the buffer to a network. The OS kernel has to copy all data from the user space into the kernel space. Copy operations are CPU and memory bandwidth intensive, limiting system performance.
The host processing power may be consumed by the copying of data between user space and kernel space in the TCP/IP stack. Some solutions have been proposed to reduce the host processing power. For example, utilizing remote direct memory access (RDMA) avoids memory copy in both transmit and receive directions. However, this requires a new application programming interface (API), a new wire protocol, and modifications to existing applications at both sides of the wire. A local DMA engine may be utilized to offload memory copy in both transmit and receive directions. Although a local DMA engine may offload copying operations from the CPU, it does not relieve the memory bandwidth required. The memory bandwidth may be a severe bottleneck in high speed networking applications as platforms shift towards multiple CPU architectures, with multiple cores in each CPU, all sharing the same memory.
Further limitations and disadvantages of conventional and traditional approaches will become apparent to one of skill in the art, through comparison of such systems with some aspects of the present invention as set forth in the remainder of the present application with reference to the drawings.