The present invention relates to remote direct memory access (RDMA) and, more specifically, optimization of RDMA with cache aligned operations.
In computing, RDMA relates to direct memory access operations from the real local memory of one computer into the real local memory of another computer without the need to involve certain components of either computer's operating system. This permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters. Applications of RDMA support zero-copy networking by enabling local network adapters to transfer data directly to or from application memory, thereby eliminating a need to copy data between application memory and data buffers in the operating system. Such transfers require no work to be done by central processing units (CPUs), caches or context switches, and allow for transfers to continue in parallel with other system operations. That is, when an application performs an RDMA read or write operation, the relevant application data is delivered directly to the peer's physical memory via the network to reduce latency and enable fast message or data transfer.
RDMA technology broadly supports write, read and autonomous updates of computer system memory and there are many communication protocols that allow application programming interfaces (APIs) to enable exploitation of RDMA based technology over various communications media, such as Infiniband, Ethernet and long distant networks (WAN). However, when RDMA based technology is to be exploited, there are numerous performance considerations relating to remote memory access processes that should be addressed.
One such consideration is that RDMA operations (e.g., RDMA-write accesses) should if possible be handled on a processor cache line basis and applies to both the local and the remote hosts. That is, when data is written to a remote peer's memory, it may be
beneficial to perform write operations on a cache line boundary and on a full cache line basis (vs. non-aligned or partial write operations when possible) since the penalty for not aligning the write operations can result in moderate to severe latency with respect to the local host computer's DMA operations to the local host memory sub-system. Indeed, an unaligned large write operation can result in hundreds of unaligned DMA write operations (depending on total transfer and packet size) with the eventual amount of latency varying based on the remote peer's platform hardware and memory sub-system (i.e., the remote peer's adapter card, PCIe bus, memory sub-system architecture, etc.).
The injected latency in DMA operations can cause local congestion that results in overall network latency and even packet loss that in turn results in retransmission, pause frames and other congestion control actions that lead to poor overall performance.