The proliferation of computers and computing systems has resulted in a continually growing need for efficient and reliable data storage. Storage servers are often used to provide storage services related to the organization and storage of data, to one or more clients. The data is typically stored on writable persistent storage media, such as non-volatile memories and disks. A storage server is configured to operate according to a client/server model of information delivery to enable one or more clients (devices or applications) to access the data served by the system. A storage server can employ a storage architecture that serves the data with both random and streaming access patterns at either a file level, as in network attached storage (NAS) environments, or at a block level, as in a storage area network (SAN).
In some data storage systems, groups of storage servers field input/out (I/O) operations (i.e., reads and writes) independently, but are exposed to hosts or clients as a single device. A group of storage servers operating in this manner is often called a “storage cluster.” Each storage server in a cluster may be called a “storage node,” a “data node,” or just a “node.” It is common to “stripe” data across storage nodes in a manner similar to how data is striped across disks in RAID arrays. Striping the data across nodes in this manner can provide improvements to random I/O performance without decreasing sequential I/O performance. In this configuration, each stripe of data may be called a storage zone, a data zone, or simply a zone. Each node may contain multiple zones. In some cases, error detection or correction information may also be stored in one or more of the nodes in a cluster. The error detection or correction information is often stored in dedicated stripes which are often referred to as checksum zones or parity zones.
In an erasure coded data system, forward error correction codes are used to improve data reliability and improve the ability to recover from data errors. Erasure coding transforms a data set containing n data elements into a longer data set containing m additional data elements that are often referred to as checksum elements. The checksum elements are generated in a manner such that the original n data elements can be recovered from one or more subsets of the combined m+n data elements. Similar to the parity concept in RAID systems, the checksum elements provide an error protection scheme for the data elements. In case one or more data elements is inaccessible, fails, or contains erroneous data, the checksum elements may be utilized in combination with the remaining valid data elements to correct the error or restore the data elements. In this way, the original data can be recovered even though some of the original m data elements may be lost or corrupted.
In a distributed erasure coded data system, the data zones and the checksum zones are spread across multiple nodes. The various nodes that contain the data zones and the checksum zones for a data set are often referred to as a reliability group. Each data zone in a reliability group may reside on a separate node, or several data zones in the reliability group may reside on the same node. In addition, the parity zones may also reside on separate nodes. In some cases, the nodes associated with a reliability group are each in a different physical location.
In order to properly recover from an error at any point in time, updates to the data zones and the associated checksum zones must typically remain synchronized. If an attempt to recover from an error in a recently changed data element is made using a checksum zone that has not yet been updated with respect to a change in an associated data zone, the recovery attempt will likely fail or produce an incorrect result.
The traditional method for maintaining data synchronization or consistency across independent storage nodes in a distributed storage system is through the use of multi-phase commit protocols, for example two-phase and three-phase commit protocols. In multi-phase commit protocols, data elements and checksum elements are updated in lockstep such that decisions to commit changes or to roll back to previous versions of the data are made in a coordinated, atomic manner. Using these protocols, a data element will typically not commit data to storage until data or checksum elements in other nodes have indicated that the nodes are ready to perform corresponding data storage steps at the same time.
While multi-phase commit protocols provide a number of benefits, they also suffer from a variety of problems. First, as the name suggests, they involve multiple rounds of communication. These multiple rounds of communication among the nodes in a cluster introduce additional latency and resource demands. Second, the error scenarios that can occur when using multi-phase commit protocols are often complex. Third, when a group of nodes is involved in a process utilizing a multi-phase commit protocol, each of the nodes in the group must move in lock-step with one another, in known techniques. Consequently, the progress made by each of the nodes in the group is limited by the node of the group that is making the least or slowest progress. In other words, synchronization requires that the nodes of a reliability group wait for other nodes of the group to complete certain steps before they can proceed.