A storage system is a computer that provides storage service relating to the organization of information on storage devices, such as disks. The storage system may be deployed within a network attached storage (NAS) environment and, as such, may be embodied as a file server. The file server or filer includes a storage operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on the disks. Each “on-disk” file may be implemented as a set of data structures, e.g., disk blocks, configured to store information. A directory, on the other hand, may be implemented as a specially formatted file in which information about other files and directories are stored.
A filer may be further configured to operate according to a client/server model of information delivery to thereby allow many clients to access files stored on a server, e.g., the filer. In this model, the client may comprise an application, such as a database application, executing on a computer that “connects” to the filer over a computer network, such as a point-to-point link, shared local area network (LAN), wide area network (WAN), or virtual private network (VPN) implemented over a public network such as the Internet. Each client may request the services of the file system on the filer by issuing file system protocol messages (in the form of packets) to the filer over the network.
A common type of file system is a “write in-place” file system, an example of which is the conventional Berkeley fast file system. In a write in-place file system, the locations of the data structures, such as inodes and data blocks, on disk are typically fixed. An inode is a data structure used to store information, such as meta-data, about a file, whereas the data blocks are structures used to store the actual data for the file. The information contained in an inode may include, e.g., ownership of the file, access permission for the file, size of the file, file type and references to locations on disk of the data blocks for the file. The references to the locations of the file data are provided by pointers, which may further reference indirect blocks that, in turn, reference the data blocks, depending upon the quantity of data in the file. Changes to the inodes and data blocks are made “in-place” in accordance with the write in-place file system. If an update to a file extends the quantity of data for the file, an additional data block is allocated and the appropriate inode is updated to reference that data block.
Another type of file system is a write-anywhere file system that does not overwrite data on disks. If a data block on disk is retrieved (read) from disk into memory and “dirtied” with new data, the data is stored (written) to a new location on disk to thereby optimize write performance. A write-anywhere file system may initially assume an optimal layout such that the data is substantially contiguously arranged on disks. The optimal disk layout results in efficient access operations, particularly for sequential read operations, directed to the disks. A particular example of a hybrid write-anywhere file system that is configured to operate on a filer is the SpinFS file system available from Network Appliance, Inc. of Sunnyvale, Calif. The exemplary SpinFS file system utilizes a write anywhere technique for user and directory data but writes metadata using a write in place technique. The SpinFS file system is implemented within a storage operating system of the filer as part of the overall protocol stack and associated disk storage.
Disk storage is typically implemented as one or more storage “volumes” that comprise physical storage disks, defining an overall logical arrangement of storage space. Currently available filer implementations can serve a large number of discrete volumes (150 or more, for example). Each volume is associated with its own file system and, for purposes hereof, volume and file system shall generally be used synonymously. The disks within a volume are typically organized as one or more groups of Redundant Array of Independent (or Inexpensive) Disks (RAID). RAID implementations enhance the reliability/integrity of data storage through the redundant writing of data “stripes” across a given number of physical disks in the RAID group, and the appropriate caching of parity information with respect to the striped data. As described herein, a volume typically comprises at least one data disk and one associated parity disk (or possibly data/parity partitions in a single disk) arranged according to a RAID 4, or equivalent high-reliability, implementation.
It is advantageous for the services and data provided by a filer or storage system to be available for access to the greatest degree possible. Accordingly, some storage system environments provide a plurality of storage systems (i.e. nodes) in a cluster where data access request processing may be distributed among the various nodes of the cluster. Executing on each node is a collection of management processes that provides management of configuration information (management data) “services” for the nodes. Each of these processes has an interface to a replicated database (RDB) that provides a persistent object store for the management data. In addition, the RDB replicates and synchronizes changes (updates) made to the management data by the management processes across all nodes to thereby provide services based on replicated data throughout the cluster. This data replication is the key to providing management from any node in the cluster (a single system image). To be clear, these are not “replicated services” in the normal use of the term (independent services without data sharing); rather, these are independent services which utilize replicated data as a means to enhance availability and autonomous capabilities within a cluster-wide single system image.
In cluster environments the concept of a quorum exists to ensure the correctness of the data replication algorithm, even in the event of a failure of one or more nodes of the cluster. By “quorum” it is meant generally a majority of the “healthy” (i.e. operational) nodes of the cluster. That is, a cluster is in quorum when a majority of the nodes are operational and have connectivity to other nodes; in addition, all nodes in the quorum have read/write (RW) access to the replicated management data (i.e. can participate in incremental updates to that data). By requiring that each update be synchronously propagated to a majority of the nodes (a quorum), the replication algorithm is guaranteed to retain all updates despite failures.
Broadly stated, a quorum of nodes is established by (1) ensuring connectivity among a majority of operational nodes; (2) synchronizing a baseline of management data among the nodes and (3) allowing a majority of operational nodes to participate in incremental changes to that baseline data. In clusters containing an even number of nodes, e.g., four nodes, one of the nodes is typically assigned an epsilon value to its quorum weight, thereby enabling quorum formation without a strict majority, e.g., (2+e)/4 is sufficient. In a two-node cluster, (1+e)/2 is sufficient (the single epsilon node). The epsilon assignment is an aspect of cluster configuration; all the nodes must agree on the epsilon assignment.
The value of requiring quorum for update in a cluster lies in the correctness and completeness of replication. This is illustrated by the “partition” problem. A partition occurs when connectivity is lost to one set of nodes as a result of a power failure or other failures to the cluster. The cluster may continue to operate with the remaining set of nodes; all nodes can read their management data, and if that set is sufficient to meet quorum requirements, all the member nodes can update the data. If connectivity is then lost within this set, and is subsequently restored to the other set of nodes, it is possible that the second set will have access only to the management data present before the first connectivity failure. In this case, it will not be able to form quorum, and will not be able to perform updates. If there were a sufficient number of nodes to establish quorum, then at least one of them would have seen the latest updates (as a participant in the prior quorum), and all the nodes in the second set would have the update capability. The quorum requirement guarantees that updates can only be made when the latest data is available.
A noted disadvantage of such quorum-based data replication systems is the inability to meet quorum requirements, thereby preventing any update (write) operations from occurring. This can occur as a result of communication failures, problems with health of individual node(s), or a combination of these problems. In clusters that utilize these services, the inability to form quorum may prevent an administrator from modifying the management data so as to reconfigure the cluster into an operational state.
This is particularly problematic in the case of a two node cluster, wherein the failure of either node forces the cluster out of quorum (1/2), as a single node is not a majority of the number of nodes in the cluster. A solution that enables the two node cluster to achieve a measure of failover is the use of the epsilon value. Here, one of the nodes in the cluster is assigned the epsilon value in addition to its own voting value. If the non-epsilon node fails, the surviving node may continue to operate in quorum as it has a (1+e)/2 quorum voting value (weight). However, a noted disadvantage of the use of an epsilon value is that failover may only occur in one direction, i.e. to the node with epsilon. That is, if the node with epsilon fails, the cluster will not failover to the node without epsilon as it would not be in quorum (1/2). Thus, the use of an epsilon factor in assigning quorum voting weights only partially alleviates the problem of two node clusters; i.e., only a unidirectional failover is possible when using an epsilon-assigned quorum voting system.