In recent years, as a result of the introduction of faster (e.g., multi-gigabit) network adapters, there has been a tremendous increase in available network bandwidth. Unfortunately, there has not been a comparable increase in the processing power of central processing unites (CPUs) to take advantage of the available bandwidth. In particular, the processing of received packets is still a CPU-intensive task and a common bottleneck in network input/output (I/O). Many technologies have been developed to alleviate this problem, including: “checksum task offload,” which delegates the calculating and verifying of Internet Protocol (IP) and Transmission Control Protocol (TCP) checksum to the Network Interface Card (NIC) hardware; “TCP Chimney offload,” which offloads the handling of the entire TCP connection to the hardware; “Remote DMA” (RDMA), which makes it possible for a NIC to employ direct memory access (DMA) techniques to send incoming packets directly to the application buffer (without CPU assistance); and “Receive Side Scaling” (RSS), which distributes the processing of receive packets across multiple processors.
One of the most CPU-intensive tasks during receive processing (i.e., the processing of packets received from a network) is copying an incoming packet from a NIC receive buffer to an application buffer. This copy results from the following process. At the time of receiving a network packet, NIC hardware does not know the final destination of the packet payload. Therefore, the hardware copies the packet to a temporary buffer (i.e., a NIC receive buffer). After TCP/IP processing of the packet identifies the application buffer (I/O request buffer) to which the packet payload should be copied, the CPU is utilized to copy the payload to the application buffer. A DMA engine can be used to perform this copy without CPU intervention, which frees up CPU processing power to perform other tasks, such as processing other incoming packets. Using the freed-up CPU processing time to process other incoming packets allows incoming packets to be processed at a faster rate overall, thereby improving the throughput of the network from which the packets are received.
DMA engines that can perform memory-to-memory DMA are now available on chips made by Intel Corp., and it is expected that DMA engines having this capability will become available from other vendors as well. Because the purpose of using a DMA engine is to free a CPU for other purposes, an interface with the DMA engine (i.e., a DMA interface) must support submitting a DMA copy request to the DMA engine on behalf of a processing entity (e.g., a TCP/IP stack) and returning control to the processing entity immediately without waiting for the copy operation to finish. This requirement implies that the copy operation performed by the DMA engine must be “asynchronous.” That is, the copy happens concurrently to the CPU performing other tasks. For such asynchronous copying, a mechanism must be in place to discover when the copy operation is completed.
There are two common methods for handling completion of an asynchronous operation: polling; and interrupt. In the “polling” method, the component that is performing the asynchronous operation (e.g., a DMA engine) updates a register (e.g., completion status register) when the operation is completed. In this scheme, to positively confirm the completion of the asynchronous operation, the requesting entity or requester (e.g., the entity processing a packet) continuously polls the value of the completion register (e.g., by reading it) until the state of the register indicates that the operation is complete.
In the “interrupt” method, the component that performs the asynchronous operation (e.g., a DMA engine) interrupts the CPU when the operation (e.g., a DMA copy) is complete. A DMA engine driver typically handles the interrupt for an asynchronous DMA copy operation. Typically, when a DMA copy is complete, the DMA engine driver calls a pre-registered function in the requesting entity to notify the entity that the requested copy operation is complete.
There are advantages and disadvantages to both the polling method and the interrupt method. The polling method can be very inefficient if the polling is performed too early, but very cheap if the requester polls only when the operation is most likely completed. The interrupt method is generally expensive as it needs a lot of processing by a host CPU, but it can be useful in situations where checking for completion happens infrequently. Both completion models can be used with a DMA engine.
As mentioned above, the cost of a polling method depends heavily on the timing of the poll operation. For an entity (e.g., a layer of a TCP/IP stack) that employs TCP processing to use a DMA engine efficiently, the entity must overlap the TCP processing that it performs (e.g., analyzing the received packet, finding the active connection to which the packet belongs, acknowledging the received packet to the sender, etc.) with the performance of the DMA copy by the DMA engine, and poll for completion at an appropriate time.