The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for persistent memory replication in remote direct memory access (RDMA) capable networks.
InfiniBand™ is an industry-standard specification that defines an input/output architecture used to interconnect servers, communications infrastructure equipment, storage and embedded systems. A true fabric architecture, InfiniBand (IB) leverages switched, point-to-point channels with data transfers that generally lead the industry, both in chassis backplane applications as well as through external copper and optical fiber connections. Reliable messaging (send/receive) and memory manipulation semantics (remote direct memory access (RDMA)) without software intervention in the data movement path ensure the lowest latency and highest application performance. Infiniband APIs and protocols can be used on Ethernet Fabric, when ROCE transport (RDMA over Converged Ethernet) is deployed.
This low-latency, high-bandwidth interconnect requires only minimal processing overhead and is ideal to carry multiple traffic types (clustering, communications, storage, management) over a single connection. As a mature and field-proven technology, InfiniBand is used in thousands of data centers, high-performance compute clusters and embedded applications that scale from two nodes up to clusters utilizing thousands of nodes. Through the availability of long reach InfiniBand and Fast Ethernet over Metro and wide area network (WAN) technologies, InfiniBand and ROCE are able to efficiently move large data between data centers across the campus to around the globe.
DMA can also be used for “memory to memory” copying or moving of data within memory. Either source or destination memory can be IO memory that belongs to a hardware device (for example PCI IO memory). DMA can offload expensive memory operations, such as large copies or scatter-gather operations, from the CPU to a dedicated DMA engine. An implementation example is the I/O Acceleration Technology. Without DMA, when the CPU is using programmed input/output, it is typically fully occupied for the entire duration of the read or write operation, and is thus unavailable to perform other work. With DMA, the DMA master first initiates the transfer, and then it does other operations while the transfer is in progress, and it finally receives notification from the DMA slave when the operation is done. IO accelerators typically have dedicated DMA master engines, which allow the hardware to copy data without loading the CPU. This feature is useful at any time that the CPU cannot keep up with the rate of data transfer, or when the CPU needs to perform useful work while waiting for a relatively slow I/O data transfer. Many hardware systems use DMA, including disk drive controllers, graphics cards, network cards and sound cards. DMA is also used for intra-chip data transfer in multi-core processors. Computers that have DMA channels can transfer data to and from devices with much less CPU overhead than computers without DMA channels. Similarly, a processing element inside a multi-core processor can transfer data to and from its local memory without occupying its processor time, allowing computation and data transfer to proceed in parallel.
Remote direct memory access (RDMA) is a direct memory access from the memory of one computer into that of another without involving either one's operating system. This permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters. RDMA supports zero-copy networking by enabling the network adapter to transfer data directly to or from application memory, eliminating the need to copy data between application memory and the data buffers in the operating system. Such transfers require little work to be done by CPUs, or context switches, and transfers continue in parallel with other system operations. When an application performs an RDMA Read or Write request, the application data is delivered directly to the network, reducing latency and enabling fast message transfer. However, this strategy presents several problems related to the fact that the target node is not notified of the completion of the request (single-sided communications).
RDMA capable applications exchange messages via objects called queue pairs (QPs). Each QP comprises of send and receive queue, and in order to exchange messages, the local and remote QPs need to connect to each other. The process of connection establishment involves sending and receiving connection management (CM) management datagrams (MADs) and is covered by Infiniband™ Specification.
Applications can use RDMA technology only after they have established reliable connections. Modern RDMA adapters are powerful, and it is not possible to utilize their power without use of multiple hardware event queues and multiple application threads. For example, a dual-port 100 Gbit adapter can process 6 million sends and 6 million receives per second (using message sizes of 4 KB). Such adapters have at least 100 events queues, and commodity servers with that many CPUs are widely available. One of the scalable approaches to utilize Interconnect and CPU performance is to use multi-domain approach, where each application thread opens its own device context and binds to its own device event queue. Each thread can pin to a given CPU and pin event queue to receive interrupts on the same CPU. This approach minimizes context switches, cross-CPU communication and cross-CPU locks, allowing maximization of system performance. At the same time, it requires each application thread to establish connections of its own.
To implement failover and data redundancy, modern data-center applications may replicate memory. For example, storage write transactions can be replicated to a number of backup nodes before acknowledgment of the write request is returned to the initiator. Trade transactions can be mirrored to backup trading servers before being acknowledged. Databases may replicate journal or other transactions before completing the store operations. All these applications strive to achieve minimal latency while consuming minimal CPU resources. The use of RDMA for these applications allows meeting these requirements.
Applications that use RDMA for memory replication typically deploy one of the two approaches:
1. Use of conventional storage protocols that supports RDMA. Examples of such protocols include SRP (SCSI RDMA protocol), ISER (ISCSI RDMA Extensions) or XBAND protocol deployed by XIV enterprise storage. In these protocols, initiator (a party that wants to replicate), sends a request to target. Request specifies the source addresses and their keys, and the information regarding what is being replicated. When using SRP or ISCSI—which are standard storage protocols—the destination may be a virtual storage volume in memory (the volume ID and offset within the volume), that correspond to the source memory that is replicated. When using XBAND, a more direct representation of what is the transaction that is being replicated is possible. The target then may allocate memory at destination and perform a set of RDMA read operations from initiator to target. When RDMA read application are complete, a reply message is sent to the initiator regarding the status of the transfer. This approach suffers from several performance limitations:
Multiple messages are done for one transfer that consume resources on both initiator and target: initiator send—target receive—target RDMA read—target send reply—initiator receive reply. This is opposed to the single RDMA transaction (if it can be done) to a pre-negotiated address from initiator to target.
RDMA reads are more expansive then RDMA writes. Implementation that can do RDMA write for memory replication would be more efficient.
Memory allocations per IO on target can be expansive.
2. Use of active-to-passive memory replication to a static memory log on passive remote is another approach. In this approach, a standby instance of the application runs on a remote node. When new passive instance is started, the active instance and remote instances connect. Remote instance allocates a static memory log and exchanges the size of log and its address with the active instance. More than one instance of memory window, as their dynamic addition or resizing is possible. The active instance of the application will replicate its transactions to one or more memory windows provided by the target. Shall active application fail, the standby application will assume active role and will restart transactions from the last known positions in the memory logs. This approach has advantage of good performance (no allocations per IO, RDMA writes as opposed to RDMA reads, and single initiator operation on initiator). The disadvantages of this approach are inability to deploy active-to-active implementations and poor error recovery. Upon a single replication error to a standby instance, it is assumed that a whole memory log is lost and it needs to be re-synchronized.