A storage system typically comprises one or more storage devices into which information may be entered, and from which information may be obtained, as desired. The storage system includes a storage operating system that functionally organizes the system by, inter alia, invoking storage operations in support of a storage service implemented by the system. The storage system may be implemented in accordance with a variety of storage architectures including, but not limited to, a network-attached storage environment, a storage area network and a disk assembly directly attached to a client or host computer. A “disk” may refer to a hard disk drive (HDD), a solid state drive (SSD) or any other persistent data storage technology.
The storage system may be configured to operate according to a client/server model of information delivery to thereby allow many clients to access data containers stored on the system. In this model, the client may comprise an application, such as a database application, executing on a computer that “connects” to the storage system over a computer network, such as a point-to-point link, shared local area network (LAN), wide area network (WAN), or virtual private network (VPN) implemented over a public network such as the Internet. Each client may request the services of the storage system by issuing access requests (read/write requests) as file-based and block-based protocol messages (in the form of packets) to the system over the network.
One type of data storage system configured to operate on a client/server model is remote direct memory access (RDMA). RDMA allows a local computer to directly access the memory of a remote computer without involving the remote computer's operating system. RDMA permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters. When an application performs an RDMA Read or Write request, the application data is delivered directly to the network, reducing latency and enabling fast message transfer.
In an RDMA system, the local computer or local “node” is not notified of the completion of the operation when a request is posted. The completions on I/O operations are reported asynchronously. Completions are usually reported by events or completions can be polled using CPU cycles, but these mechanisms increase the memory footprint and network latency.
RDMA may be useful in applications such as remote mirroring of data. Currently, remote mirroring of data implements an “in-order delivery” (IOD) requirement, whereby mirroring applications and connections between the nodes typically support in-order delivery of data between the nodes. For in-order delivery of data, the data is expected to be received at the remote node in the same time order as it was sent at the local node. For example, if data sets are sent at the local node in a time order comprising data sets W, X, and then Y, the IOD requirement requires that the remote node receives the data sets in the same time order (i.e., receive in order W, X, and then Y). IOD of data results when there is a single connection path between the local and remote nodes.
In contrast, “out-of-order delivery” (OOD) of data results when there are multiple connection paths between the local and remote nodes. Multiple connection paths may be implemented to increase data throughput and bandwidth between nodes. For OOD of data, the data is not expected to be received at the remote node in the same time order as it was sent at the local node and may arrive in any order. As such, in the above example, data set Y may arrive at the remote node prior to data sets W and X in OOD.
OOD of data from the local node to the remote node may compromise data integrity at the remote node. Typically, for a group of related data sets (e.g., data sets W, X, Y), there may also be a metadata set (e.g., metadata set Z) that describes each of the related data sets (e.g., metadata set Z describes data sets W, X, Y), the metadata set to also be stored to the local and remote non-volatile storage devices. As used herein, a “related group” of data and metadata sets may comprise one or more data sets and one metadata set that describes and is associated with each of the one or more data sets. As used herein, “data integrity” exists when the metadata set of a related group is written to the remote non-volatile storage device only after each of the data sets within the related group is written to the remote non-volatile storage device. If the metadata set of a related group is written before each of the data sets within the same related group is written, data corruption and inconsistency in the remote non-volatile storage device may result.
For example, the data sets of a related group may comprise data sets W, X, Y and metadata set Z, where metadata set Z specifies that there are 3 valid data sets and the time order of transmitting to the remote node is W, X, Y, and then Z. A “valid” data set may comprise client data that is pending to be stored to the local and remote non-volatile storage devices. In IOD of data, data integrity is intact since the time order of receiving and writing to the remote node is also W, X, Y, and then Z (where metadata set Z is written to the remote non-volatile storage device only after data sets W, X, and Y are written). When the metadata set Z is written to the remote non-volatile storage device, this indicates that 3 valid data sets have already been successfully written to the remote non-volatile storage device. As such, in IOD of data, the data and metadata stored at the remote node would be consistent as metadata set Z written to the remote non-volatile storage device would accurately reflect that 3 valid data sets W, X, and Y have been written to the remote non-volatile storage device.
However, in OOD of data, data integrity may not exist if, for example, metadata set Z is received and written to the remote node prior to data sets X and Y. In this example, the data and metadata stored at the remote node would not be consistent as metadata set Z being written to the remote non-volatile storage device would indicate that the 3 valid data sets W, X, and Y have already been written to the remote non-volatile storage device, when this in fact is not true. If a crash were to occur at the remote node before data sets X and Y were written to the remote non-volatile storage device, data corruption at the remote non-volatile storage device would result. As such, use of OOD of data typically does not provide data integrity at the remote non-volatile storage device at each point in time.
IOD for remote mirroring has significant drawbacks. For example, multiple connection paths between the nodes may be used to increase data throughput and connection bandwidth between nodes. However, multiple connection paths between nodes may cause OOD of data. As such, IOD of data for remote mirroring may not take advantage of the increased data throughput and connection bandwidth provided by multiple connection paths between the nodes and OOD of data. However, in implementations of OOD, data integrity is at risk because the sending or local node does not have any indication that all data has been received. The local node may therefore send subsequent data write requests or metadata write requests before data has been written to a persistent data storage device, or even before all previous data write requests have been received. As such, there is a need for an improved method for remote mirroring of data and metadata between nodes of a cluster storage system. Consequently, it would be advantageous if a method and apparatus existed that are suitable for enforcing data integrity during OOD delivery through an execution thread on a remote node in a RDMA data storage system.