The present application is in the field of Remote Direct Memory Access (RDMA) and, in particular, relates to RDMA WRITE completion indications.
Remote Direct Memory Access (RDMA) is a protocol via which data can be moved directly from the memory of a first computing device to the memory of another computing device, coupled to the first computing device via a network, generally without involvement of either the operating system of the first computing device or the operating system of the second computing device. More generically, this is known as “kernel bypass.” This permits high-throughput and low-latency networking.
RDMA provides a channel interface to an application running on the first computing device, traditionally providing the following three RDMA data transfer mechanisms:                RDMA WRITE        RDMA READ        Sequenced reliable datagram (Send)        
For example, considering the first computing device to be “Station A” and the second computing device to be “Station B,” an RDMA WRITE data transfer operates to transfer data directly from a source buffer of Station A to a sink buffer of Station B. In particular, an application on Station A may post a write work request (WR) into a send work request queue, and Station A then notifies a RDMA Network Interface Controller (RDMA NIC) attached to Station A, such as by a doorbell mechanism, that a work request is available in the Send Work Request queue (SQ) to be processed. The RDMA NIC fetches the work request that specifies that an RDMA WRITE operation to Station B is to be performed for the payload pointed to by the SQ WR. The RDMA NIC subsequently performs one or more DMA read operations, encapsulates the data within headers appropriate for communication over the network (e.g. Ethernet/TCP/IP packet(s)) and sends the encapsulated data (e.g., Ethernet frame) over the network to station B.
According to various RDMA protocols, when the RDMA WR operation has completed, a completion queue entry CQE is placed in the WC (Work Request Completion Queue) of the station that posted the RDMA WR. That is, the consumer (which in general is an application or a upper layer protocol (ULP)) can request that a completion is generated. Generally, an implementation generates a completion from the hardware into the completion queue. However, the device driver/library will only indicate completion for operations for which a completion request has been made or that have an implicit completion (such as RDMA READs).
While the above general discussion is correct for various different RDMA protocols, such as IETF RDDP and Infiniband, particular protocol specifications dictate different particular completion semantics. Referring still to the example of the RDMA WRITE data transfer to transfer data directly from a source buffer of Station A to a sink buffer of Station B, the CQE can be created, for example, when:                the last byte of the data src reaches the RDMA NIC associated with Station A (this is the IETF RDDP specification for RDMA NIC semantics); or        the last byte of the data src reaches the RDMA NIC associated with Station B (this is the Infiniband specification for RDMA NIC semantics); or        the last byte of the data reaches the sink memory of Station B.Other options are possible as well.        
Thus, for example, with respect to IETF RDDP completion semantics, an RDMA WRITE completes at the source, Station A, as soon as the source buffer has been DMA read and the RDMA NIC does not need to access the source buffer anymore. The Upper Layer Protocol (ULP)/Application is free to reuse the buffer (and potentially change the buffer) as soon as the RDMA WRITE has been completed. If there is a transport error that prevents the source data from being delivered from the source RDMA NIC Station A to the sink RDMA NIC Station B, then the source RDMA NIC raises an asynchronous error to inform the ULP/Application about the failure, and places an indication of the event in the asynchronous error queue AE. As another example, the RDMA NIC on station A could also fail without station A failing and, in that case, there will not be any AE, and it thus would not be known in general if the data reached station B.