Remote Direct Memory Access (RDMA) is an existing standard that supports one-sided memory-transfer operations that access data within user buffers. RDMA is a computer-to-computer transfer of data. Data is transmitted from the memory of one computer to the memory of another computer typically over a network. This is unlike direct memory access which is a memory access that is performed internally in a single computer.
RDMA provides for “zero-copy” networking, which lets a network interface controller (NIC) in a computer transfer data directly to or from system memory, eliminating the need to copy data between system memory and the kernel in the operating system (OS). This permits high-throughput, low-latency networking, which is especially useful for clustering and storage in data centers. Note that “zero-copy” is sometimes referred to as copy free. Also note that RDMA provides for autonomous data transfer as well as copy free. Autonomous means that a remote process is not interrupted while an RDMA get or put is in process.
RDMA was first implemented for InfiniBand followed by an iWARP implementation for Ethernet. In these implementations, RDMA involves two steps: registration and the actual put or get (i.e., write or read, respectively). Registration prepares a virtual buffer for RDMA. To register a buffer, a task or application makes a system call into a kernel component of the RDMA service. The kernel initializes the control data and creates a handle that encodes buffer access rights. It then swaps in and pins all buffer pages. The kernel component sends the NIC the mapping between the handle and the pinned physical pages and waits for a reply. In other words, the NIC gets a mapping of the virtual-to-physical addresses of the virtual buffer, and the kernel pins all the pages. Finally, the buffer handle is passed back to the local user process, which in turn sends it to a remote process on another computer system to be used in a subsequent RDMA get or put request.
This registration procedure is not only very expensive, in terms of CPU cycles, but it also pre-pins all pages regardless of how far in the future they are needed for RDMA. The pre-pinning operation is performed to make sure that the memory area involved in the communications will be kept in main memory, at least during the duration of the transfer. This approach facilitates OS-bypass, because RDMA includes direct access to application data without disturbing the OS. However, this approach wastes physical memory, and the total size of registered buffers cannot exceed the physical memory size. This leads to several major drawbacks.
One major drawback is that the pre-pinning of pages may cause the OS to crash due to lack of available memory. For example, if many processes on a local computer are registering a virtual buffer for an RDMA, it results in pinning a large amount of the system memory, even if the pinned pages are not going to be used for a significant number of cycles. This may result in reducing the amount of available memory below an amount needed by OS to operate, resulting in the OS crashing.
Another drawback is that it complicates software, as programmers must tune applications to work with a known amount of physical memory. While this may be reasonable for high-performance computing, it is less desirable within data centers. Such tuning affects portability as applications which run well on one platform may not execute on machines with less physical memory. It is hard to time-multiplex complex applications as RDMA memory is not virtualized and physical memory constraints must be jointly satisfied across all processes.