The rapid growth in computer networking has spurred the development of ever-faster network media rates. For instance, over the last ten years, Ethernet-format maximum media rates have gone from 10 megabits-per-second (Mbps), to 100 Mbps (fast Ethernet), and now to 1000 Mbps (gigabit Ethernet). Future increases are planned to allow even faster network communications.
Traditionally, networked host computers have handled communication tasks at the network and transport layers (and some tasks at the link layer) using host software, while leaving the remaining link and physical layer communication tasks to an attached network adapter (which also may be partially implemented in host-resident driver software). Thus for virtually every packet transmitted or received by the network adapter, the host processor must expend resources in handling packetization, header manipulation, data acknowledgment, and error control. At gigabit Ethernet speeds, even sophisticated server systems will often have a maximum network transmission rate limited by the ability of the host processor to handle its network and transport layer tasks, rather than by the speed of the physical connection. Consequently, host-implemented networking tasks can reduce bandwidth utilization and occupy processor throughput that could otherwise be devoted to running applications.
Some network adapter vendors have attempted to increase network performance by offloading the entire transport and lower-layer protocol stack to the network adapter. This approach greatly eases the burden on the host processor, but increases the complexity and expense of the adapter. It also limits flexibility, limits upgradability, and makes platform-specific tailoring difficult. Such an adapter may also require that the entire network stack be rewritten to allow the hardware solution to integrate with the operating system.
Several less-severe modifications to the traditional division of labor between a host processor and a network adapter have also been proposed. One of the more appealing of these proposals is a feature known as “TCP segmentation offload” (See the Microsoft Windows 2000 Device Driver Development Kit for detailed information. Transmission Control Protocol/Internet Protocol (TCP/IP) is perhaps the most popular transport/network layer protocol suite in use today. See Network Working Group, RFC 791, Internet Protocol (1981); Network Working Group, RFC 793, Transmission Control Protocol (1981)). With TCP segmentation offload, the host processor can indicate to the network adapter that a large block of data is ready for TCP transmission, rather than passing numerous smaller TCP packets (each containing part of the large block of data) to the network adapter. With offloading, the network adapter segments the block of data into the smaller packets, builds the TCP, IP, and link-layer headers for each packet, and transmits the packets.
TCP segmentation offload benefits overall system performance due to several factors. First, sending a large block of data requires fewer calls down through the software protocol stack than does sending multiple small blocks, thus reducing CPU utilization for a given workload. Second, when the headers are built in the network adapter hardware, header-building host overhead is avoided, and header information must only be transferred across the host bus once per block rather than once per packet, reducing latency and lowering bus utilization. And third, the network adapter hardware can reduce the number of host interrupts that it generates in order to indicate data transmission, in some instances down to one per block.
I have now recognized that, despite its benefits, TCP segmentation offload has several rather large limitations. First, the size of the block offloaded cannot be larger than the receiving endpoint's TCP window size (typically equal to somewhere between two and ten maximum-sized Ethernet packets). And second, the host processor must still process roughly the same number of acknowledgment packets (ACKs) from the receiving endpoint—roughly one-half to one ACK per data packet sent—despite the segmentation offloading.