In distributed computer systems including multiple computer nodes, data may be replicated across computer nodes and storage units to decrease the chance of data loss and or to increase the percentage of time that the systems are available as compared to non-replicated systems. When replicating, many applications desire single copy consistency semantics where all clients see the same version of data and data writes, which may have been observed, do not revert to a prior state. For example, consider a single register with replicas A and B with an initial value 1. A client changes the register value to 2. Once the value 2 is observed, no reader is allowed to observe the value 1 regardless of which replica is read, even if the observation occurs indirectly, such as by knowing that the write completed. A split brain scenario where some clients read the value 1 and others read the value 2 is avoided.
This is sometimes solved by designating one replica as the “master” and additional replicas as “slaves,” with a more reliable hardware and software component storing the replica which is the current master and slaves which may become masters. When a slave fails, the current master uses the component (i.e., the more reliable hardware and software component) to designate the failed slave non-authoritative before completing additional data writes. However, when the master fails, an authoritative slave is made master and the old master is marked as non-authoritative by the component before input-output (IO) requests are satisfied. This scheme may be undesirable because some embodiments of the component can still be single points of failure. The scheme may also be intolerant of sequential failures which are common due to correlated causes causing simultaneous failures to manifest sequentially. For example, consider three replicas A, B, and C with A acting as master. Correlated failures such as overheating may cause abnormal shutdowns of all three nodes far enough apart in time for B to replace A and then C to replace B before C fails. When the fault causes a permanent failure to C all data is lost because neither A nor B is authoritative.
Consensus protocols such as Paxos can be applied to solve the problem, exploiting the mathematical property of every majority (>n/2 in an n-replica system) sharing at least one member in common with every other majority. The system remains available through any sequence of failures leaving a majority reachable and reliable as long as a complete data set exists regardless of what sequential failures occurred. When replication is implemented with a consensus protocol, reads and writes complete when a majority agree on the current value. Additional meta-data in the form of sequence numbers or time stamps are included to identify which disagreeing replica is correct when a different quorum participates in a read. The replication is often implemented as distributed state machine with an instance of the consensus protocol determining the Nth command, which may be “write key A=value B” where the current value of A is the latest of its writes, “replica 1 is no longer authoritative”, or “add node 23 to the cluster”. Naive implementations explicitly store sequence numbers for each command, use separate storage for undecided commands, and always store at least three copies of data. Due to these space and time overheads, consensus is often applied only to determining which replicas are authoritative. While this avoids replica authority determination as a single point of failure, the system may still be vulnerable to sequential failures.
A reallocate-on-write policy may be implemented with a scheme that implies the temporal order of writes, such as a log ordering the writes, or sequence numbers on written blocks. The reallocate-on-write policy may be used to provide low-latency IO to storages requiring a separate erase phase and/or to accommodate storages that may have bad blocks, such as flash memories. The reallocate-on-write policy implicitly retains old copies of data. The mechanism used for reallocate-on-write may imply ordering which can be used for consensus processing without requiring that additional consensus sequence numbers be stored for the consensus protocol. Time stamps or sequence numbers stored with blocks of data could be used for consensus ordering. The order of blocks in a log implemented as a linked list could be used. Offset into a block or region could be used alone or with one of these other methods. However, there is a need for techniques that allow consensus-based replication tolerant of more sequential failure modes to be implemented with the same time and space overhead as simpler master-slave replication schemes.