The present invention is generally directed to the transfer of information residing in one computer system or on one data processing node to another data processing node. The present invention is more particularly directed to data transfers in which data is transferred by the network adapter directly into the target user buffer in the address space of the receiving system or node from the address space of the sending system or node. This is referred to as remote direct memory access (RDMA). Even more particularly, the present invention is directed to systems and methods for carrying out RDMA without the automatic assumption that data which has been sent is data which has also been received. This assumption is referred to as reliable transport but which should really be thought of as the “reliable transport assumption.” As used herein, the concept of reliable transport refers to a communication modality which is based upon a “send and forget” model for Upper Layer Protocols (ULPs) running on the data processing nodes themselves, as opposed to adapter operations. Correspondingly, the concept of unreliable transport refers to a communication modality which is not “send and forget” with respect to the ULP. Also, as used herein the term “datagram” refers to a message packet that is both self-contained as to content and essential heading descriptions and which is not guaranteed to arrive at any given time.
For a proper understanding of the contributions made to the data communication arts by the present invention it should be fully appreciated that the present invention is designed to operate not only in an environment which employs DMA data transfers, but that this data transfer occurs across a network, that is, remotely. Accordingly, the context of RDMA data transfer is an important aspect for understanding the operation and benefits of the present invention. In the RDMA environment, the programming model allows the end user or middleware user to issue a read or write command (or request) directed to specific virtual memory locations defined at both a sending node and at a remote data processing node. The node issuing the command is called (for the purposes of the RDMA transaction) the master node; the other node is referred to as the slave node. For purposes of better understanding the advantages offered by the present invention, it is noted here that the existing RDMA state of the art paradigm includes no functionality for referencing more than one remote node. It is also understood that the RDMA model assumes that there is no software in the host processor at the slave end of the transaction which operates to affect RDMA data transfers. There are no intermediate packet arrival interrupts, nor is there any opportunity for even notifying the master side that a certain portion of the RDMA data sent has now been received by the target. There is no mechanism in existing RDMA transport mechanisms to accept out-of-order packet delivery.
An example of the existing state of the art in RDMA technology is seen in the Infiniband architecture (also referred to as IB).
The state of the art RDMA paradigm also includes the limitation that data packets sent over the communications fabric are received in the same order as they were sent since they assume the underlying network transport to be reliable for RDMA to function correctly. This means that transmitted packets can easily accumulate in the sending side network communications adapters waiting for acknowledgment. This behavior has the tendency to create situations in which, at any given time, there are a large number of “packets in flight” that are buffered at the sending side network adapter waiting to be acknowledged. This tends to bog down adapter operation and produces its own form of bandwidth limiting effect in addition to the bandwidth limiting effect caused by the fact that the source and destination nodes are constrained by having all of the packets pass in order through a single communications path. In addition, adapter design itself is unnecessarily complicated since this paradigm requires the buffering of unacknowledged in-flight packets.
The DMA and RDMA environments are essentially hardware environments. This provides advantages but it also entails some risk and limitations. Since the RDMA function is provided in hardware, RDMA data transfers possess significant advantages in terms of data transfer rates and, as with any DMA operation, data transfer workload is offloaded from central processing units (CPUs) at both ends of the transfer. RDMA also helps reduce the load on the memory subsystem. Furthermore, the conventional RDMA model is based upon the assumption that data packets are received in the order that they are sent. Just as importantly, the “send and forget” RDMA model (RDMA with the underlying reliable network transport assumption) unnecessarily limits bandwidth and precludes the use of many other features and functions such as efficient striping multiple packets across a plurality of paths. These features also include data striping, broadcasting, multicasting, third party RDMA operations, conditional RDMA operations, half RDMA and half FIFO operations, safe and efficient failover operations, and “lazy” deregistration. None of these functions can be carried out as efficiently within the existing state of the art RDMA “send and forget” paradigm as they are herein.
The RDMA feature is also referred to in the art as “memory semantics” for communication across a cluster network, or as “hardware put/get” or as “remote read/write” or as “Remote Direct Memory Access (RDMA).”
It should also be understood that the typical environment in which the present invention is employed is one in which a plurality of data processing nodes communicate with one another through a switch, across a network, or through some other form of communication fabric. In the present description, these terms are used essentially synonymously since the only requirement imposed on these devices is the ability to transfer data from source to destination, as defined in a data packet passing through the switch. Additionally the typical environment for the operation of the present invention includes communication adapters coupled between the data processing nodes and the switch (network, fabric). It is also noted that while a node contains at least one central processing unit (CPU), it may contain a plurality of such units. In data processing systems in the pSeries line of products currently offered by the assignee of the present invention a node possibly contains up to thirty-two CPUs on Power4 based systems and up to sixty-four CPUs on Power5 based systems. (Power4 and Power5 are microprocessor systems offered by the assignee of the present invention). To ensure good balance between computational and communication capacity, nodes are typically equipped with one RDMA capable network adapter for every four CPUs. Each node, however, possesses its own address space. That is, no global shared memory is assumed to exist for access from across the entire cluster. This address space includes random access memory (RAM) and larger scale, but slower external direct access storage devices (DASD) typically deployed in the form of rotating magnetic disk media which works with the CPUs to provide a virtual address space in accordance with well known memory management principles. Other nonvolatile storage mechanisms such as tape are also typically employed in data processing systems as well.
The use of Direct Memory Address (DMA) technology provides an extremely useful mechanism for reducing CPU (processor) workload in the management of memory operations. Workload that would normally have to be processed by the CPU is handled instead by the DMA engine. However, the use of DMA technology has been limited by the need for tight hardware controls and coordination of memory operations. The tight coupling between memory operations and CPU operations poses some challenges, however, when the data processing system comprises a plurality of processing nodes that communicate with one another over a network. These challenges include the need for the sending side to have awareness of remote address spaces, multiple protection domains, locked down memory requirements (also called pinning), notification, striping and recovery models. The present invention is directed to a mechanism for reliable RDMA protocol over a possibly unreliable network transport model.
If one wishes to provide the ability to perform reliable RDMA transport operations over a possibly unreliable underlying network transport path, there are many important issues that should be addressed. For example, how does one accomplish efficient data striping over multiple network interfaces available on a node by using RDMA? How does one provide an efficient notification mechanism on either end (master and slave) regarding the completion of RDMA operations? How would one define an RDMA interface that lends itself to efficient implementation? How does one design a recovery model for RDMA operations (in the event when a single network interface exists and in the event when multiple network interfaces exist)? How does one implement an efficient third party transfer model using RDMA for DLMs (Distributed Lock Managers) and other parallel subsystems? How does one implement an efficient resource management model for RDMA resources? How does one design a lazy deregistration model for efficient implementation of the management of the registered memory for RDMA? The answers to these questions and to other related problems, that should be addressed as part of a complete, overall RDMA model, are presented herein.
As pointed out above, prior art RDMA models (such as Infiniband referred to above) do not tolerate receipt of packets in other than their order of transmission. In such systems, an RDMA message containing data written to or read from one node to another node is segmented into multiple packets and transmitted across a network between the two nodes. The size of data blocks which are being transferred, together with the packet size supported by the network or fabric, are the driving force for the partitioning of the data into multiple packets. In short, the need to transmit multiple packets as the result of a single read or write request is important given the constraints and state of the art of existing communication network fabrics. Furthermore, at another level, it is advantageous in the method of the present invention to divide the transfer into several independent multi-packet segments to enable striping across multiple paths in the communication fabric. At the node receiving the message (the receiving node), the packets are then placed in a buffer in the order received and the data payload is extracted from the packets and is assembled directly into the memory of the receiving node. The existing state of the art mechanisms are built on the assumption that the receipt of packets occurs in the same order in which they were transmitted. If this assumption is not true, then the communication transport protocols could mistake the earlier arriving packets as being the earlier transmitted packets, even though earlier arriving packets might actually have been transmitted relatively late in the cycle. If a packet was received in a different order than it was transmitted, serious data integrity problems could result. This occurs, for example, if a packet containing data that is intended to be written to a higher range of addresses of a memory, is received prior to another packet containing data that is intended to be written to a lower range of addresses. If the reversed order of delivery went undetected, the data intended for the higher range of addresses could be written to the lower range of addresses, and vice versa, as well. In addition, in existing RDMA schemes, a packet belonging to a current more recently initiated operation could be mistaken for one belonging to an earlier operation that is about to finish.
Accordingly, prior art RDMA schemes focused on enhancing network transport function to guarantee reliable delivery of packets across the network. With reliable datagram and reliably connected transport mechanisms such as this, the packets of a message are assured of arriving in the same order in which they are transmitted, thus avoiding the serious data integrity problems which could otherwise result. The present invention provides a method to overcome this dependence on reliable transport and on the in-order delivery of packets requirements and is implemented over an unreliable datagram network transport.
The prior art “reliably connected and reliable datagram” RDMA model also has many other drawbacks. Transport of message packets or “datagram” between the sending and receiving nodes is limited to a single communication path over the network that is selected prior to beginning data transmission from one node to the other. In addition, the reliable delivery model requires that no more than a few packets (equal to the buffering capability on the sending side network adapter) be outstanding at any one time. Thus, in order to prevent packets from being received out of transmission order, transactions in existing RDMA technologies have to be assigned small time-out values, so that a time-out is forced to occur unless the expected action (namely, the receipt of an acknowledgment of the packet from the receiver to the sender) occurs within an undesirably short period of time. All of these restrictions impact the effective bandwidth that is apparent to a node for the transmission of RDMA messages across the network. The present invention provides solutions to all of these problems.