RDMA technology reduces processor workload in the transmission and reception of data across a network between two computer nodes by transferring data directly from memory of local computer node to memory of a remote computer node without involving remote node CPU. RDMA technology is typically implemented by a specialized hardware which resides on each computer node. An RDMA write operation transfers data from the memory of a local computer node directly to the memory of a remote computer node; an RDMA read operation requests transfer of data from the memory of a remote computer node directly to the memory of local computer node. Each RDMA connection uses a pair of memory data structures, a send queue and a receive queue, that allows the computer node to post work requests to the RDMA capable hardware. There is also a completion queue that stores completion notifications for the submitted work requests. A send queue, a receive queue and a completion queue are referred to as a queue structure (QS) throughout this document. Once the RDMA connection is established, a computer node can post a request in a queue (send or receive queue). Each queue stores a request from the time it is posted by the node until the time it is processed. An interconnect adapter on the node is then notified by an interconnect driver on the same node that the request is posted. It reads the request in the queue and does the actual data transfer over a network. After the data are received, the interconnect adapter at the computer node that receives the data writes data directly to destination memory at the second computer node. Then a completion result is sent back to the first computer node. The interconnect adapter at the first computer node posts the result to its completion queue.
According to RDMA protocols and known implementations, when an error occurs in a queue structure (QS), all pending requests in the QS are flushed and returned in error. Then QS management logic destroys the QS in error and creates a new QS for the purpose of establishing a new connection. The error status is communicated to an upper subsystem module (such as a file system, for example), which stops posting requests until the new QS is created. This, in turn, disrupts operation of the applications using the RDMA connection. Accordingly, it is highly desirable to maintain RDMA connection between two or more computer nodes barring legitimate error cases, e.g.—transient software or hardware errors when processing an I/O request.
One useful application of RDMA technology is controller failover in a cluster storage environment in which a first computer node may have a predetermined failover “partner” node (a second computer node) that may take over or resume storage services of the first computer node upon failure at the first computer node. For received write requests from a client(s), a node may produce write logs and store them in its non-volatile storage device (from which, the node may at a later time flush the write logs to the storage devices). To ensure data consistency and provide high data availability, the write logs may also be stored remotely to a non-volatile storage device at a partner node. The transfer of write logs between two partner nodes in a cluster storage system typically takes place using the RDMA technology so that data in a local non-volatile storage device at a first computer node may be transferred directly to a non-volatile storage device of a second computer node to provide failover protection (e.g., in case the first computer node crashes).
Currently, when an error occurs on an RDMA connection (for example, during the transfer of write logs to a partner node), the error status is communicated to the upper subsystem on the first computer node. Since the RDMA connection is in error, the first computer node no longer transfers the write logs to its partner node, thereby making the logs unsynchronized. As a result, high availability functionality is no longer available to the clients accessing the nodes. As a corollary to this, one computer node can no longer initiate takeover of its partner node, thereby causing disruption to the clients when something happens to either one of the nodes. Accordingly, it is desirable to significantly increase RDMA connection uptime between the nodes for purposes of RDMA transfer.