A storage system typically comprises one or more storage devices into which information may be entered, and from which information may be obtained. The storage system includes a storage operating system that functionally organizes the system by invoking storage operations in support of a storage service implemented by the system. The storage system may be implemented in accordance with a variety of storage architectures including, but not limited to, a network-attached storage (NAS) environment, a storage area network (SAN) and a storage device assembly directly attached to a client or host computer. The storage devices are typically disk drives organized as a storage array, although other solid state devices including flash memory may also constitute the storage array.
The storage operating system may implement a high-level module or layer of abstraction (e.g., file system), to logically organize the information as a hierarchical structure of data containers, such as volumes, directories, and files. For example, each “on-disk” file may be implemented as set of data structures, e.g., disk blocks, configured to store information, such as the actual data for the file. The disk blocks may be organized within a volume block number (vbn) space that is maintained by the file system, which may also assign each disk block in a file a corresponding “file offset” or file block number (fbn). Sequences of fbns are typically assigned by the file system on a per-file basis, whereas vbns are assigned over a larger volume address space. In addition, the file system may organize the disk blocks within the vbn space as a volume such that each volume may be, although is not necessarily, associated with its own file system.
The storage system may be configured to operate according to a client/server model of information delivery to allow many clients to access data containers stored on the system. In this model, the client may comprise an application, such as a database application, executing on a computer that “connects” to the storage system over a computer network, such as a point-to-point link, shared local area network (LAN), wide area network (WAN), or virtual private network (VPN) implemented over a public network such as the Internet. Each client may request the services of the storage system by issuing access requests (a read or write request) as file-based and block-based protocol messages to the system over the network.
Multiple storage systems may be interconnected to provide a clustered storage system (cluster) configured to service many clients, where each storage system may be a “node” of the cluster. Advantages of a clustered architecture include the nodes being configured to communicate with one another to act collectively to increase performance of or to offset any single node failure within the cluster. For instance, one node of the cluster (local node) may have a predetermined failover “partner” node (remote node) that may take over or resume storage services of the local node upon failure at the local node. When a failure occurs, access requests intended for the local node may be re-directed to the remote node for servicing. For ease of explanation, components residing on the local node may be referred to as a local component, whereas components residing on the remote node may be referred to as a remote component.
Preferably, data access services may be provided by the cluster using shared storage of the cluster, constituting a set of storage devices commonly accessible by the nodes. Clients may then connect to a node of the cluster to submit read or write requests that are received and serviced by the node accessing the shared storage. With write requests, the node may further implement a write log in local non-volatile storage (local log cache) for aggregating write requests and writing data to shared storage at a later point in time. In one instance, deferring write operations may be desirable to optimize system resources during peak access request periods by clients.
When configured to protect against a node failure, the remote node of the cluster may implement a remote log cache which mirrors the local log cache. The remote log cache may then be accessed by the remote node to carry out any remaining write requests on the shared storage. A consistent view between log caches is thus desirable to ensure that write operations not executed by the local node are executed by the remote node in a failover. Similarly, if write logs in the local log cache become corrupt or lost, the remote node may access the remote log cache to carry out any remaining write operations on behalf of the local node.
Known techniques for mirroring log caches between nodes include an “out-of-order delivery” (OOD) of data across multiple connection paths between the nodes to enforce a particular ordering of write logs at the remote node. In contrast with “in-order delivery” (IOD), where data is transmitted to the remote node across a single connection in the same order as it is received at a local node, data may now arrive at the remote node in any order due to the multiple connection paths between nodes. The local node must thus manage the timing of data transmission to enforce the same ordering constraints at the remote node. As an example, data may be transmitted to the remote node in the form of data sets, where multiple data sets (e.g., W, X, and Y) may be grouped together and associated with a metadata set (Z). Metadata set Z may describe each of the related data sets and, for instance, indicate the number of data sets in the group (i.e., 3) and the order of transmission of the data sets (i.e., W, X, Y, and then Z) to enable a consistent view between log caches.
In certain instances, however, corruption and inconsistency may occur if metadata set Z is received by the remote node and written to storage before data sets W, X, and Y. To avoid this problem the local node may implement strict data set ordering constraints by maintaining a request queue for managing the data sets already sent to the remote node. Once confirmation of a successful mirroring operation is returned by the remote node, the local node may remove the completed request from the queue and then transmit the related metadata set. Avoidance of data corruption and inconsistency may thus be achieved since data sets are stored at the remote node prior to transmission of its related metadata set thereby corresponding to the transmission order by the local node.
One known disadvantage with the conventional techniques involves the additional processing required at the local node in enforcing ordering constraints on the data and metadata sets. For OOD as well as IOD, the remote node may further be heavily involved in processing incoming write requests from clients. To avoid remote processing overhead, specialized data placement protocols such as remote direct memory access (RDMA) may be implemented to facilitate a consistent view of log caches between nodes. With RDMA, log writes may be transmitted from the local log cache to the remote log cache and stored directly to a specified location in the remote log cache without consuming processing overhead at the remote node.
An RDMA implementation involves, however, highly specialized interconnects and data placement semantics which may be complex and costly to implement. As storage demands grow, specialized components and semantics are an impractical approach to a flexible and scalable cluster since such configurations are typically vendor-specific and require a tradeoff between high data availability (e.g., protection against node failures) and cost. As such, there is a need for an improved method for mirroring log cache data while optimizing performance and scalability of the cluster.