A storage system typically comprises one or more storage devices into which information may be entered, and from which information may be obtained, as desired. The storage system includes a storage operating system that functionally organizes the system by, inter alia, invoking storage operations in support of a storage service implemented by the system. The storage system may be implemented in accordance with a variety of storage architectures including, but not limited to, a network-attached storage (NAS) environment, a storage area network (SAN) and a disk assembly directly attached to a client or host computer. The storage devices are typically disk drives organized as a disk array, wherein the term “disk” commonly describes a self-contained rotating magnetic media storage device. The term disk in this context is synonymous with hard disk drive (HDD) or direct access storage device (DASD). Storage devices may also comprise solid state devices, such as flash memory, battery backed up non-volatile random access memory, etc. As such, the description of storage devices being disks should be taken as exemplary only.
The storage operating system of the storage system may implement a high-level module, such as a file system, to logically organize the information stored on volumes as a hierarchical structure of data containers, such as files and logical unit numbers (luns). For example, each “on-disk” file may be implemented as set of data structures, i.e., disk blocks, configured to store information, such as the actual data for the file. These data blocks are organized within a volume block number (vbn) space that is maintained by the file system. The file system may also assign each data block in the file a corresponding “file offset” or file block number (fbn). The file system typically assigns sequences of fbns on a per-file basis, whereas vbns are assigned over a larger volume address space. The file system organizes the data blocks within the vbn space as a “logical volume”; each logical volume may be, although is not necessarily, associated with its own file system.
A known type of file system is a write-anywhere file system that does not overwrite data on disks. If a data block is retrieved (read) from disk into a memory of the storage system and “dirtied” (i.e., updated or modified) with new data, the data block is thereafter stored (written) to a new location on disk to optimize write performance. A write-anywhere file system may initially assume an optimal layout such that the data is substantially contiguously arranged on disks. The optimal disk layout results in efficient access operations, particularly for sequential read operations, directed to the disks. An is example of a write-anywhere file system that is configured to operate on a storage system is the Write Anywhere File Layout (WAFL®) file system available from NetApp, Inc., Sunnyvale, Calif.
The storage system may be further configured to operate according to a client/server inodel of information delivery to thereby allow many clients to access data containers stored on the system. In this inodel, the client may comprise an application, such as a database application, executing on a computer that “connects” to the storage system over a computer network, such as a point-to-point link, shared local area network (LAN), wide area network (WAN), or virtual private network (VPN) implemented over a public network such as the Internet. Each client may request the services of the storage system by issuing file-based and block-based protocol messages (in the form of packets) to the system over the network.
To enable high performance communications among storage systems, a plurality of storage systems may be organized as nodes of a cluster that is configured to implement distributed operations to increase overall bandwidth. Intra-cluster communications typically require high-performance communication pathways. An example of such pathways that cluster node members may utilize is remote direct memory access (RDMA) networks to enable high-performance communications. Typically, RDMA networks use network protocol offloads and/or direct access interfaces to reduce the load on a main processor of a cluster member. To achieve network protocol offload, an RDMA compatible network adapter typically implements network protocol processing up to and including the transport layer. Offloading protocol processing from the cluster member's main processor provides additional compute cycles for other tasks.
In addition to its protocol offload capabilities, an RDMA compatible network adapter may provide direct access interface to applications via specialized hardware and/or operating system coordination. As part of its direct access interface, the RDMA compatible network adapter typically provides a plurality of communication primitives, e.g., RDMA READ and RDMA WRITE operations. An RDMA READ operation requests that a data buffer on a target node (e.g., a remote cluster member) be transferred (or read) into a local destination buffer of a source node (e.g., a local cluster member). That is, an RDMA READ operation causes data stored in a defined memory region, i.e., a buffer, on the target node to be transferred to a buffer that is allocated on the source node, i.e., the node that originated the RDMA READ operation. An RDMA WRITE operation transfers a local data buffer to a remote destination buffer.
In a typical implementation, an RDMA READ operation consumes more resources and is slower than an RDMA write operation. Unlike RDMA WRITE operations, RDMA READ operations require dedicated resources on the RDMA hardware of the target and source node when transferring the data into the local destination buffer. If not managed appropriately, consumption of such resources may adversely impact (e.g., throttle) RDMA operations. For this reason, typical RDMA network adapters limit the number of RDMA READ operations that can be issued in parallel on a single connection, i.e., the total number of such operations that may be outstanding at any time. For example, RDMA adapters typically only allow a small number of RDMA READ operations to be outstanding at a time compared to the number of outstanding RDMA WRITE operations that may be outstanding at a time. RDMA READ operations are also typically slower than RDMA WRITE operations because they typically require a transaction on the target system's I/O bus (e.g., the PCI bus, PCI-X bus, PCI Express bus, etc.) before the target's RDMA adapter can send an acknowledgement completing the RDMA READ operation. As will be appreciated by one skilled in the art, this presents a challenge to data access protocols that rely on RDMA READ operations, as those protocols must use RDMA READ operations sparingly to avoid being throttled due to RDMA hardware limitations.