A storage system typically comprises one or more storage devices into which information may be entered, and from which information may be obtained, as desired. The storage system includes a storage operating system that functionally organizes the system by, inter alia, invoking storage operations in support of a storage service implemented by the system. The storage system may be implemented in accordance with a variety of storage architectures including, but not limited to, a network-attached storage environment, a storage area network and a disk assembly directly attached to a client or host computer. The storage devices are typically disk drives organized as a disk array, wherein the term “disk” commonly describes a self-contained rotating magnetic media storage device. The term disk in this context is synonymous with hard disk drive (HDD) or direct access storage device (DASD).
The storage operating system of the storage system may implement a high-level module, such as a file system, to logically organize the information stored on volumes as a hierarchical structure of data containers, such as files and logical units (LUs). For example, each “on-disk” file may be implemented as set of data structures, i.e., disk blocks, configured to store information, such as the actual data for the file. These data blocks are organized within a volume block number (vbn) space that is maintained by the file system. The file system may also assign each data block in the file a corresponding “file offset” or file block number (fbn). The file system typically assigns sequences of fbns on a per-file basis, whereas vbns are assigned over a larger volume address space. The file system organizes the data blocks within the vbn space as a “logical volume”; each logical volume may be, although is not necessarily, associated with its own file system.
A known type of file system is a write-anywhere file system that does not overwrite data on disks. If a data block is retrieved (read) from disk into a memory of the storage system and “dirtied” (i.e., updated or modified) with new data, the data block is thereafter stored (written) to a new location on disk to optimize write performance. A write-anywhere file system may initially assume an optimal layout such that the data is substantially contiguously arranged on disks. The optimal disk layout results in efficient access operations, particularly for sequential read operations, directed to the disks. An example of a write-anywhere file system that is configured to operate on a storage system is the Write Anywhere File Layout (WAFL®) file system available from NetApp, Inc. Sunnyvale, Calif.
The storage system may be further configured to operate according to a client/server model of information delivery to thereby allow many clients to access data containers stored on the system. In this model, the client may comprise an application, such as a database application, executing on a computer that “connects” to the storage system over a computer network, such as a point-to-point link, shared local area network (LAN), wide area network (WAN), or virtual private network (VPN) implemented over a public network such as the Internet. Each client may request the services of the storage system by issuing access requests (read/write requests) as file-based and block-based protocol messages (in the form of packets) to the system over the network.
A plurality of storage systems may be interconnected to provide a storage system architecture configured to service many clients. In some embodiments, the storage system architecture provides one or more aggregates and one or more volumes distributed across a plurality of nodes interconnected as a cluster. The aggregates may be configured to contain one or more volumes. The volumes may be configured to store content of data containers, such as files and logical units, served by the cluster in response to multi-protocol data access requests issued by clients.
Each node of the cluster may include (i) a storage server (referred to as a “D-blade”) adapted to service a particular aggregate or volume and (ii) a multi-protocol engine (referred to as an “N-blade”) adapted to redirect the data access requests to any storage server of the cluster. In the illustrative embodiment, the storage server of each node is embodied as a disk element (D-blade) and the multi-protocol engine is embodied as a network element (N-blade). The N-blade receives a multi-protocol data access request from a client, converts that access request into a cluster fabric (CF) message and redirects the message to an appropriate D-blade of the cluster.
The nodes of the cluster may be configured to communicate with one another to act collectively to increase performance or to offset any single node failure within the cluster. Each node in the cluster may have a predetermined failover “partner” node that may take over/resume storage functions of the node upon failure of the node. When a node failure occurs (where the failed node is no longer capable of processing access requests for clients), the access requests sent to the failed node may be re-directed to the partner node for processing. As such, the cluster may be configured such that a partner node may take over the work load of a failed node. As used herein, a local/source node may have data and metadata that is mirrored/copied to a remote/destination node in the cluster storage system (as discussed below). The remote node may comprise a predetermined failover partner node of the local node. As used herein, various components residing on the local node may likewise be referred to as a local component (e.g., local memory, local de-staging layer, etc.) and various components residing on a remote node may likewise be referred to as a remote component (e.g., remote memory, remote de-staging layer, etc.).
A cluster provides data-access service to clients by providing access to shared storage (comprising a set of storage devices). Typically, clients will connect with a node of the cluster for data-access sessions with the node. During a data-access session with a node, a client may submit access requests (read/write requests) that are received and performed by the node. For the received write requests, the node may produce write logs that represent the write requests and locally store the write logs to a volatile storage device (from which, the node may at a later time perform the write logs on the storage devices).
To ensure data consistency and provide high data availability, the write logs may also be stored to two non-volatile storage devices. Typically, the write logs of the node may be locally stored to a non-volatile storage device and also be stored remotely to a non-volatile storage device at a partner node (sometimes referred to herein as mirroring data to a remote node). As such, if the local node fails, the remote partner node will have a copy of the write logs and will still be able to perform the write logs on the storage devices. Also, if the write logs stored at the partner node is corrupted or lost, the write logs stored locally in the non-volatile storage device at the local node can be extracted/retrieved and used to perform the write logs on the storage devices.
As such, data in a local non-volatile storage device at a local node may be mirrored to a remote non-volatile storage device of a remote node to provide failover protection (e.g., in case the local node crashes) and high availability of data in the cluster storage system. The mirrored data may comprise write logs, or any other data that is to be stored to the non-volatile storage devices.
Currently, remote mirroring of data implements an “in-order delivery” (IOD) requirement, whereby mirroring applications and connections between the nodes typically support in-order delivery of data between the nodes. For in-order delivery of data, the data is expected to be received at the remote node in the same time order as it was sent at the local node. For example, if data sets are sent at the local node in a time order comprising data sets W, X, and then Y, the IOD requirement requires that the remote node receives the data sets in the same time order (i.e., receive in order W, X, and then Y). IOD of data results when there is a single connection path between the local and remote nodes.
In contrast, “out-of-order delivery” (OOD) of data results when there are multiple connection paths between the local and remote nodes. Multiple connection paths may be implemented to increase data throughput and bandwidth between nodes. For OOD of data, the data is not expected to be received at the remote node in the same time order as it was sent at the local node and may arrive in any order. As such, in the above example, data set Y may arrive at the remote node prior to data sets W and X in OOD.
OOD of data from the local node to the remote node may compromise data integrity at the remote node. Typically, for a group of related data sets (e.g., data sets W, X, Y), there may also be a metadata set (e.g., metadata set Z) that describes each of the related data sets (e.g., metadata set Z describes data sets W, X, Y), the metadata set to also be stored to the local and remote non-volatile storage devices. As used herein, a “related group” of data and metadata sets may comprise one or more data sets and one metadata set that describes and is associated with each of the one or more data sets. As used herein, “data integrity” exists when the metadata set of a related group is written to the remote non-volatile storage device only after each of the data sets within the related group is written to the remote non-volatile storage device. If the metadata set of a related group is written before each of the data sets within the same related group is written, data corruption and inconsistency in the remote non-volatile storage device may result.
For example, the data sets of a related group may comprise data sets W, X, Y and metadata set Z, where metadata set Z specifies that there are 3 valid data sets and the time order of transmitting to the remote node is W, X, Y, and then Z. A “valid” data set may comprise client data that is pending to be stored to the local and remote non-volatile storage devices. In IOD of data, data integrity is intact since the time order of receiving and writing to the remote node is also W, X, Y, and then Z (where metadata set Z is written to the remote non-volatile storage device only after data sets W, X, and Y are written). When the metadata set Z is written to the remote non-volatile storage device, this indicates that 3 valid data sets have already been successfully written to the remote non-volatile storage device. As such, in IOD of data, the data and metadata stored at the remote node would be consistent as metadata set Z written to the remote non-volatile storage device would accurately reflect that 3 valid data sets W, X, and Y have been written to the remote non-volatile storage device.
However, in OOD of data, data integrity may not exist if, for example, metadata set Z is received and written to the remote node prior to data sets X and Y. In this example, the data and metadata stored at the remote node would not be consistent as metadata set Z being written to the remote non-volatile storage device would indicate that the 3 valid data sets W, X, and Y have already been written to the remote non-volatile storage device, when this in fact is not true. If a crash were to occur at the remote node before data sets X and Y were written to the remote non-volatile storage device, data corruption at the remote non-volatile storage device would result. As such, use of OOD of data typically does not provide data integrity at the remote non-volatile storage device at each point in time.
Thus, IOD is typically used for remote mirroring as it provides data integrity at the remote node at any point in time. However, use of IOD for remote mirroring has significant drawbacks. For example, multiple connection paths between the nodes may be used to increase data throughput and connection bandwidth between nodes. However, multiple connection paths between nodes may cause OOD of data. As such, IOD of data for remote mirroring may not take advantage of the increased data throughput and connection bandwidth provided by multiple connection paths between the nodes and OOD of data. As such, there is a need for an improved method for remote mirroring of data and metadata between nodes of a cluster storage system.