1. Field of the Invention
This invention generally relates to memory management in computer systems, and more specifically, to methods and systems for registration and deregistration of memory pages. The preferred embodiment of the invention relates to such methods and systems for use in multi-node, distributed computer systems that employ remote direct memory access to transfer data between the nodes.
2. Background Art
An important factor in the performance of a computer or a network of computers is the ease or difficulty with which data is accessed when needed during processing. To this end, direct memory access (DMA) was developed early on, to avoid a central processing unit (CPU) of a computer from having to manage transfers of data between long-term memory such as magnetic or optical memory, and short-term memory such as dynamic random access memory (DRAM), static random access memory (SRAM) or cache of the computer. Accordingly, memory controllers such as DMA controllers, cache controllers, hard disk controllers and optical disc controllers were developed to manage the transfer of data between such memory units, to allow the CPU to spend more time processing the accessed data. Such memory controllers manage the movement of data between the aforementioned memory units, in a manner that is either independent from or semi-independent from the operation of the CPU, through commands and responses to commands that are exchanged between the CPU and the respective memory controller by way of one or more lower protocol layers of an operating system that operate in background and take up little resources (time, memory) of the CPU.
However, in the case of networked computers, access to data located on other computers, referred to as “nodes”, has traditionally required management by an upper communication protocol layer running on the CPU of a node on the network. The lower layers of traditional asynchronous packet mode protocols, e.g., User Datagram Protocol (UDP) and Transport Control Protocol/Internet Protocol (TCP/IP), which run on a network adapter element of each node today, do not have sufficient capabilities to independently (without host side engagement in the movement of data) manage direct transfers of stored data between nodes of a network, referred to as “remote DMA” or “RDMA operations.” In addition, characteristics with respect to the transport of packets through a network were considered too unreliable to permit RDMA operations in such types of networks. In most asynchronous networks, packets that are inserted into a network in one order of transmission are subject to being received in a different order than the order in which they are transmitted. This occurs chiefly because networks almost always provide multiple paths between nodes, and some paths involve a greater number of hops between intermediate nodes, e.g., bridges, routers, etc., than other paths and some paths may be more congested than others.
To support RDMA in pinning based networks (for example, Infiniband (see Infiniband Architecture Specification, Infiniband Trade Association, 2004)), Myrinet (see Myricom, Inc, “Myrinet”, [http://www.myrinet.com]), pages that need to be transferred from the sender to the receiver must have the source/destination buffers pinned (registered) to physical memory for the duration of RDMA. Unpinning involves deregistering the memory at some later point of time, after the transfer has completed, mainly because of the fact that only a fraction of the actual physical memory can be pinned. But, pinning/unpinning (registration/deregistration) pages in memory is a costly operation, adding to the overhead of message passing interfaces like MPI (see MPI: A Message Passing Interface Standard, MPI forum). As used herein, the terms registration or pinning (and deregistration or unpinning) are used synonymously.
To address this overhead of pinning/unpinning and enable better computation-communication overlap in MPI-based code, various MPI implementations or layers underneath, which are entrusted with the task of registering or deregistering pages, may employ one of several solutions.
One approach is to restrict RDMA operations to a static memory region. This helps to register the memory region once and amortize this cost over a possibly large number of RDMA operations. But this approach restricts the application to a static memory region. For many applications, this is inappropriate and forces the user to copy to/from the registered memory. For larger messages, copy costs quickly become a bottleneck. However, this policy may still be applied to “eager” messages (See, High Performance RDMA-based MPI implementation over Infiniband, [ICS 2003], J. Liu, J. Wu, S. Kini, P. Wyckoff, et al.).
Another approach is to register memory on the fly. The source and destination buffers are registered before the RDMA operation and then deregistered upon completion of transfer. This approach unfortunately has a high cost of registering the memory prior to each RDMA operation. A third approach is to maintain some sort of a cache. In OpenMPI implementation this is called a Rcache (registration cache) (see Infiniband Scalability in Open MPI [IPDPS 2006], Galen M. Shipman, Tim S. Woodall, Rich L. Graham, Arthur B. Maccabe and Patrick G. Bridges). Once a new unregistered address is encountered and is entered in the cache, subsequent accesses can avoid the overhead of registration. For applications which regularly reuse target and destination buffers (exhibit temporal locality) for RDMA operations, the cost of the initial registration is effectively amortized over later RDMA operations. This approach was first available in MPICH-GM.
The first two solutions are not generic or effective enough. Regarding the cache-based solution, in many instances, even with a cache present, registration/deregistration overhead becomes unavoidable due to absence of temporal locality of the pages accessed in a message. For example, when adjacent pages of an array are accessed in a loop, this kind of situation may arise. OpenMPI has tried to overcome this problem (partially) for large messages by trying to pipeline the RDMA/registering process (see High Performance RDMA Protocols in HPC [Euro-PVM-MPI Conf. 2006], Tim S. Woodall, Galen M. Shipman, George Bosilca and Arthur B. Maccabe). It breaks up a large message into several units and registers future chunks that will be sent, as well as RDMAing the current chunks—all at the same time. But, shorter messages cannot be handled by this mechanism. Results reported (see High Performance RDMA Protocols in HPC [Euro-PVM-MPI Conf. 2006], Tim S. Woodall, Galen M. Shipman, George Bosilca and Arthur B. Maccabe), show that the pipelined strategy works well for message sizes of 100K bytes or more. Also, current registration/deregistration implementations are synchronous, resulting in more delay.
On the deregistration side, the cache-based strategy suffers from the usual cache eviction problem of when and what to deregister. In addition, for dynamically allocated pages, deregistration must happen before the pages are deallocated. This is difficult to do at run-time because a program can deallocate either non-registered or registered pages. The usual strategy is to rewrite allocation libraries like free( ) etc. for deregistration so that during a free operation, registration cache is checked to see whether the freed pages are present in the registration cache. This results in undue overhead and complications (see Infiniband Scalability in Open MPI [IPDPS 2006}, Galen M. Shipman, Tim S. Woodall, Rich L. Graham, Arthur B. Maccabe and Patrick G. Bridges). In Wyckoff, et al., work has been done to address the deregistration issue for arbitrary allocation/deallocation by providing for special register/deregister functions (dreg_register/dreg_deregister) that call a kernel module dreg. The register/deregister functions are available in user space and the dreg module in the kernel keeps track of VM (virtual memory) allocations and deallocations. By setting up a polling/signaling mechanism between the dreg module and the register/deregister function, the registration cache can be maintained consistently.
The major drawback of all the current strategies used to reduce the overhead of pinning/unpinning is due to the implementation of the pinning/unpinning by layers like MPI or ones below it. These layers do not have a view of the locality of the message pages accessed as can be observed at the higher abstraction level of a program.