Mass data storage systems are used for many purposes, including storing user and system data for data processing, backup and transmission applications. A typical mass storage system includes numerous computer disk drives that cooperatively store data, for example, as a single logically contiguous storage space, often referred to as a volume or a logical unit. One or more such volumes/logical units may be configured in a storage system. The storage system therefore performs much like that of a single computer disk drive when viewed by a host computer system. For example, the host computer system can access data of the storage system much like it would access data of a single internal disk drive, in essence, without regard to the substantially transparent underlying control of the storage system.
A mass storage system may include one or more storage modules with each individual storage module comprising multiple disk drives coupled to one or more storage controllers. In one common configuration, a storage module may be coupled through its storage controller(s) directly to a host system as a standalone storage module. Typical storage controllers include significant cache memory capacity to improve performance of the I/O operation. Write requests may be completed when the supplied data is written to the higher speed cache memory. At some later point, the data in cache memory may be flushed or posted to the persistent storage of the storage modules. Also, read requests may often be satisfied by accessing data already resident in the higher speed cache memory of the storage controller.
In a standalone configuration, it is common to enhance reliability and performance by providing a redundant pair of storage controllers. The redundant pair of controllers enhances reliability in that an inactive storage controller may assume control when an active controller is sensed to have failed in some manner. Redundant pairs of storage controllers may also enhance performance of the standalone storage system in that both storage controllers may be active each acting as backup for the other while both simultaneously processing different I/O requests or different portions of an I/O request. In such a configuration with redundant storage controllers, the storage controllers typically exchange information to maintain coherency of data between the cache memories resident in each controller. Some storage systems use the communication path between the controllers and the storage modules for the additional cache coherency information exchanges.
In another standard system configuration, a storage module may be part of a larger storage network or “cluster.” For a cluster-type architecture, multiple storage modules and corresponding storage controllers are typically coupled through a switched network communication medium, known as a “fabric,” to one or more host systems. This form of storage module system is often referred to as a Storage Area Network (SAN) architecture and the switching fabric is, concomitantly, referred to as a SAN switching fabric. In such a clustered configuration, it is common that all of the storage controllers exchange coherency information and other information for load balancing of I/O request processing and other control information. Such control information may be exchanged over the same network fabric that couples the storage controllers to the host systems (e.g., a “front end” connection) or over another fabric that couples the storage controllers to the storage modules (e.g., a “back-end” connection).
RDMA technology, also referred to as “RDMA protocol,” provides a useful method for reducing processor workload in the transmission of data in network-related processing. In general, RDMA technology reduces central processing unit (CPU) workload in the transmission and reception of data across a network between two computer nodes by transferring data directly from memory of a local computer node to memory of a remote computer node without continuously involving the CPU of the remote node. RDMA technology is typically used by, for example, commercial data centers and mass data storage systems that support high performance computing services. It is often required that specialized hardware be provided on both the client (remote computer node) and the server (local computer node) to implement the RDMA protocol. Network interface card (NIC) hardware fabricated to implement RDMA technology, for example, can process operations that were previously performed by a CPU.
An RDMA write operation transfers data from the memory of a local computer node to the memory of a remote computer node. An RDMA read operation, in contrast, requests transfer of data from the memory of a remote computer node to the memory of a local computer node. Each RDMA connection typically uses a pair of memory data structures, a send queue, and a receive queue, that allow the computer node to post work requests to the RDMA capable hardware. There is also a completion queue that stores completion notifications for the submitted work requests. A send queue, a receive queue, and a completion queue are oftentimes collectively referred to as a queue structure (QS). Once the RDMA connection is established, a computer node can post a request in one of the queues the send or receive queue. Each queue stores a request from the time it is posted by the node until the time it is processed. An interconnect adapter on the node is then notified by an interconnect driver on the same node that the request is posted; it reads the request in the queue and does the actual data transfer over a network. After receipt of the requested data is completed, the interconnect adapter at the computer node that receives the data writes data directly to destination memory at the second computer node. Then a completion result is sent back to the first computer node. The interconnect adapter at the first computer node posts the result to its completion queue.
RDMA upper layer protocol (ULP), such as server message block direct (SMBD) protocols and like application-layer network protocols, typically uses a model in which the initiator (client) requests an RDMA operation after registering memory. The host server is then expected to complete the operation using RDMA. Clients connecting to a scale-out file server may oftentimes choose to connect to any node in a cluster depending on the load balancing model. While this option aids “scale out” of the system—e.g., the ability to incrementally increase storage capacity (storage modules) of the system—there is a performance penalty associated with having to go over the cluster interconnect. Typically, requests that go to a remote node can result in higher client-perceived latency. There is therefore a need for RDMA protocol that reduces latency while minimizing utilization of the cluster interconnect.
The present disclosure is susceptible to various modifications and alternative forms, and some representative embodiments have been shown by way of example in the drawings and will be described in detail herein. It should be understood, however, that the aspects and features of this disclosure are not limited to the particular forms illustrated in the drawings. Rather, the disclosure is to cover all modifications equivalents and alternatives falling within the scope of the disclosure as defined by the appended claims.