Computer applications that communicate over a network require a considerable amount of central processing unit (CPU) processing power to decipher the packet-based complex Transmission Control Protocol (TCP)/Internet Protocol (IP). Each network packet must be processed through a protocol stack (multiple protocols) on transmit and receive ends of a network connection.
During the protocol stack process, multiple protocol headers (e.g., TCP, IP, Ethernet) are added in a specific order to the data payload at the transmit end of the connection. These headers are necessary for the data transmission across the network. When a packet is received at the receiver end of the connection, the packet is processed again through the protocol stack and the protocol headers are removed in an opposite order, until the data is recovered and available to a user.
Packet size is determined by the network maximum transfer unit (MTU). When data transmitted between two applications is longer than the MTU, the data is divided into multiple separated packets. More CPU resources are needed as the number of packets increases. When the speed of the network increases, the demands on the CPU escalate as well. Using a direct memory access (DMA) device can help free CPU resources by allowing the system to access the CPU memory without CPU intervention. However, DMA does not reduce the CPU protocol stack processing and usually require additional memory to organize receiving packets before sending them to the application memory. This step adds latency to the data transfer and takes up precious resources.
TCP offload engines (TOE) devices have been developed to free the CPU processing resources by performing some or all of the TCP and IP processing for the computer. The data payloads of the processed packets still need to be aggregated in order using a dedicated memory and transferred to the application memory. That is, the application expects to receive the data in order. Normally a memory is used to hold “out of order” received packets until all the “holes” in the sequential data are filled. Thus, the offloading process does not eliminate the need for data aggregation.
Direct Data Placement (DDP) is a developing protocol described in the “DDP Protocol Specification,” published by the Internet Engineering Task Force (IETF) working group on Oct. 21, 2002. DDP may enable an Upper Layer Protocol (ULP) to send data to a Data Sink (i.e. a computer or any other medium capable of receiving data) without requiring the Data Sink to place the data in an intermediate buffer. When data arrives at the Data Sink, a NIC can place the data directly into the ULP's receive buffer. This may enable the Data Sink to consume substantially less memory bandwidth than a buffered model because the Data Sink is not required to move the data from an intermediate buffer to the final destination. This can also enable the network protocol to consume substantially fewer CPU cycles than if the CPU was used to move data, and remove the bandwidth limitation of being only able to move data as fast as the CPU can copy the data.
DDP is much harder to achieve with network applications over TCP/IP (where exemplarily data can arrive out-of-order) because of the nature of the sockets application programming interface (API) used by applications. One protocol that does achieve DDP over TCP/IP is iSCSI, which transports the SCSI storage protocol over TCP/IP. The iSCSI protocol benefits from the fact that storage applications generally do not use the sockets API and are required to provide buffers for all data ahead of that being received from the network. The iSCSI protocol uses tags that indicate exactly where received data should be placed and has mechanisms to limit the expense of dealing with out-of-order TCP/IP data. However, SCSI is a network storage protocol, not a communication protocol.
Various attempts to solve some of the problems above are known in the art. For example, US Patent Application No. 20040249998 by Rajagopalan et al. deals with uploading TCP frame data to user buffers and buffers in system memory. The payload data is uploaded to user buffers in system memory and partially processed frame data is uploaded to legacy buffers allocated in operating system memory space. U.S. Pat. No. 7,012,918 to Williams deals with DDP, disclosing a system comprising a host and a NIC or host bus adapter. The host is configured to perform transport protocol processing. The NIC is configured to directly place data from a network into a buffer memory in the host. U.S. Pat. No. 7,010,626 to Kahle deals with data pre-fetch, disclosing a method and an apparatus for pre-fetching data from a system memory to a cache for a DMA mechanism in a computer system. U.S. Pat. No. 6,996,070 to Starr et al deals with a TCP/IP offload device with reduced sequential processing and discloses a TOE device that includes a state machine that performs TCP/IP protocol processing operations in parallel. If some of these solutions write the data directly to the memory, they either need to process the TCP stack or use additional memory for data payload aggregation.
There is therefore a widely recognized need for, and it would be highly advantageous to have new, efficient ways to approach an application memory. In particular, there is a need to find inexpensive solutions or methods of transferring data to and from the application memory with less CPU processing power or less dedicated processing time for protocol processing and with minimum latency.