One classic development in computing systems is direct memory access (DMA), in which a device can access the main memory directly while the CPU is free to perform other tasks. In a network with remote direct memory access (RDMA), for example, the network 110 as shown in FIG. 1, the data transfer between the two computers or devices 100a and 100b can be achieved in the following process: the sending host channel adapter (HCA) 106a uses DMA to read data in a user-specified buffer in the main memory 104a and transmits the data as a self-contained message across the network 110; then the receiving HCA 106b uses DMA to place the data into another user-specified buffer of the main memory 104b. Throughout this process, there is no intermediary copying and all of these actions occur without involvement of the CPUs 102a and 102b, which has an added benefit of lower CPU utilization. As demonstrated in FIG. 1, RDMA supports zero-copy networking by enabling the network adapter to transfer data directly to or from application memory, thereby eliminating the need to copy data between the application memory and data buffers in the operating system or any work to be done by CPUs, caches, or context switches. In other words, RDMA allows data transfers to continue in parallel with other system operations and thus reduces latency in message transfers across the network.
However, the acceptance of RDMA is currently limited by the need to install a different networking infrastructure. To solve this problem, new standards have been developed to enable RDMA implementation by utilizing the Ethernet at the physical layer and existing protocols such as TCP/IP for transport purposes. As a result, the performance and latency advantages of RDMA can be combined with a low-cost, standards-based solution.
Among these standards, one protocol is called Internet Wide Area RDMA Protocol (iWARP), which essentially implements RDMA over TCP/IP. The iWARP protocol is typically implemented in hardware RDMA NICs because a kernel implementation of the TCP stack is seen as a bottleneck. Furthermore, the handling of iWARP specific protocol details is often isolated from the TCP implementation to allow the NIC to be used for both as RDMA offload and TCP offload, and the portion of the hardware implementation used for implementing the TCP protocol is known as the TCP Offload Engine (TOE). Another standard that enables RDMA implementations over the Ethernet is RDMA over Converged Ethernet (RoCE). This protocol essentially implements RDMA over the InfiniBand® Architecture (IBA) by utilizing the transport services defined in the InfiniBand® Architecture Specification (e.g., InfiniBand® Architecture Specification Volume 1, Release 1.2.1 and Supplement to InfiniBand® Architecture Specification Volume 1, Release 1.2.1—RoCE Annex A16), including Reliable Connected (RC) service, Reliable Datagram (RD) service, Unreliable Connected (UC) service, Unreliable Datagram (UD) service and Extended Reliable Connected (XRC) service.
In comparison, the above two RDMA implementations, one with RoCE and the other with iWARP, both operate on top of reliable transport services, IBA and TCP, respectively, although each protocol may define different packet formats, headers, verbs, etc. For example, the entire RDMA iWARP header consists of Ethernet Header, IP/TCP and MPA/DDP/RDMAP, while the RoCE header consists of Ethernet Header, Global Routing Header (GRH) and IBA transport headers. Particularly, the IBA transport headers include a Base Transport Header (BTH) that contains a Packet Sequence Number (PSN), which is usually used by reliable transport services in request and respond directions to determine packet delivery order, duplicate packets and out-of-sequence/missing packets from being processed. Different from TCP, IBA does not support selective packet retransmission or the out-of-order reception of packets. In addition, IBA provides a packet boundary on MTU size for a given connection. This simplifies packet retransmission and reception handling in the transport level, which renders RoCE a more hardware-friendly solution. A need also arises to further improve RoCE implementations in hardware, e.g., a network adapter, for accelerating RoCE packet sequence transmission and reception.