There are different approaches for reducing the processing power of TCP/IP stack processing. In a TCP Offload Engine (TOE), the offloading engine performs all or most of the TCP processing, presenting to the upper layer a stream of data. There may be various disadvantages to this approach. The TTOE is tightly coupled with the operating system and therefore requires solutions that are dependent on the operating system and may require changes in the operating system to support it. The TTOE may require a side by side stack solution, requiring some kind of manual configuration, either by the application, for example, by explicitly specifying a socket address family for accelerated connections. The TTOE may also require some kind of manual configuration by an IT administrator, for example, by explicitly specifying an IP subnet address for accelerated connections to select which of the TCP flows will be offloaded and the offload engine is very complex as it needs to implement TCP packet processing.
Large segment offload (LSO)/transmit segment offload (TSO) may be utilized to reduce the required host processing power by reducing the transmit packet processing. In this approach the host sends to the NIC, bigger transmit units than the maximum transmission unit (MTU) and the NIC cuts them to segments according to the MTU. Since part of the host processing is linear to the number of transmitted units, this reduces the required host processing power. While being efficient in reducing the transmit packet processing, LSO does not help with receive packet processing. In addition, for each single large transmit unit sent by the host, the host would receive from the far end multiple ACKs, one for each MTU-sized segment. The multiple ACKs require consumption of scarce and expensive bandwidth, thereby reducing throughput and efficiency.
In large receive offload (LRO), a stateless receive offload mechanism, the TCP flows may be split to multiple hardware queues, according to a hash function that guarantees that a specific TCP flow would always be directed into the same hardware queue. For each hardware queue, the mechanism takes advantage of interrupt coalescing to scan the queue and aggregate subsequent packets on the queue belonging to the same TCP flow into a single large receive unit.
While this mechanism does not require any additional hardware from the NIC besides multiple hardware queues, it may have various performance limitations. For example, if the number of flows were larger than the number of hardware queues, multiple flows would fall into the same queue, resulting in no LRO aggregation for that queue. If the number of flows is larger than twice the number of hardware queues, no LRO aggregation is performed on any of the flows. The aggregation may be limited to the amount of packets available to the host in one interrupt period. If the interrupt period is short, and the number of flows is not small, the number of packets that are available to the host CPU for aggregation on each flow may be small, resulting in limited or no LRO aggregation, even if the number of hardware queues is large. The LRO aggregation may be performed on the host CPU, resulting in additional processing. The driver may deliver to the TCP stack a linked list of buffers comprising of a header buffer followed by a series of data buffers, which may require more processing than in the case where all the data is contiguously delivered on one buffer.
Accordingly, the computational power of the offload engine needs to be very high or at least the system needs a very large buffer to compensate for any additional delays due to the delayed processing of the out-of-order segments. When host memory is used for temporary storage of out-of-order segments, additional system memory bandwidth may be consumed when the previously out-of-order segments are copied to respective buffers. The additional copying provides a challenge for present memory subsystems, and as a result, these memory subsystems are unable to support high rates such as 10 Gbps.
In general, one challenge faced by TCP implementers wishing to design a flow-through NIC, is that TCP segments may arrive out-of-order with respect to the order placed in which they were transmitted. This may prevent or otherwise hinder the immediate processing of the TCP control data and prevent the placing of the data in a host buffer. Accordingly, an implementer may be faced with the option of dropping out-of-order TCP segments or storing the TCP segments locally on the NIC until all the missing segments have been received. Once all the TCP segments have been received, they may be reordered and processed accordingly. In instances where the TCP segments are dropped or otherwise discarded, the sending side may have to re-transmit all the dropped TCP segments and in some instances, may result in about a fifty percent (50%) decrease in throughput or bandwidth utilization.
There are different approaches for reducing the processing power of TCP/IP stack processing. In a TCP Offload Engine (TOE), the offloading engine performs all or most of the TCP processing, presenting to the upper layer a stream of data. There may be various disadvantages to this approach. The TOE is tightly coupled with the operating system and therefore requires solutions that are dependent on the operating system and may require changes in the operating system to support it. The TOE may require a side by side stack solution, requiring some kind of manual configuration, either by the application, for example, by explicitly specifying a socket address family for accelerated connections. The TOE may also require some kind of manual configuration by an IT administrator, for example, by explicitly specifying an IP subnet address for accelerated connections to select which of the TCP flows will be offloaded and the offload engine is very complex as it needs to implement TCP packet processing.
Large segment offload (LSO)/transmit segment offload (TSO) may be utilized to reduce the required host processing power by reducing the transmit packet processing. In this approach the host sends to the NIC, bigger transmit units than the maximum transmission unit (MTU) and the NIC cuts them to segments according to the MTU. Since part of the host processing is linear to the number of transmitted units, this reduces the required host processing power. While being efficient in reducing the transmit packet processing, LSO does not help with receive packet processing. In addition, for each single large transmit unit sent by the host, the host would receive from the far end multiple ACKs, one for each MTU-sized segment. The multiple ACKs require consumption of scarce and expensive bandwidth, thereby reducing throughput and efficiency.
During conventional TCP processing, each of the plurality of TCP segments received would have to be individually processed by a host processor in the host system. TCP processing requires extensive CPU processing power in terms of both protocol processing and data placement on the receiver side. Current processing systems and methods involve the transfer of TCP state to a dedicated hardware such as a NIC, where significant changes to host TCP stack and/or underlying hardware are required.
The host processing power may be consumed by the copying of data between user space and kernel space in the TCP/IP stack. Some solutions have been proposed to reduce the host processing power. For example, utilizing remote direct memory access (RDMA) avoids memory copy in both transmit and receive directions. However, this requires a new application programming interface (API), a new wire protocol, and modifications to existing applications at both sides of the wire. A local DMA engine may be utilized to offload memory copy in both transmit and receive directions. Although a local DMA engine may offload copying operations from the CPU, it does not relieve the memory bandwidth required. The memory bandwidth may be a severe bottleneck in high speed networking applications as platforms are shifting to multiple CPU architectures, with multiple cores in each CPU architecture, all sharing the same memory.
When the host processor has to perform a read/write operation, a data buffer has to be allocated in the user space. A read operation may be utilized to copy data from the file into this allocated buffer. A write operation may be utilized to transmit the contents of the buffer to a network. The OS kernel has to copy all data from the user space into the kernel space. Copy operations are CPU and memory bandwidth intensive, limiting system performance.
Further limitations and disadvantages of conventional and traditional approaches will become apparent to one of skill in the art, through comparison of such systems with some aspects of the present invention as set forth in the remainder of the present application with reference to the drawings.