In computer systems, redundant copies of important data may be used to provide data availability, reliability and consistency. One way of recovering data from a failure (e.g., crash of a disk, disk array, or other storage device, corruption of a database, application program, or operating system, and the like), is to scan entire datasets to determine and then reconcile inconsistencies. This process, however, may be expensive and introduce significant recovery delays. Another conventional approach to data recovery is to maintain a log of updates that have not been applied to all replicas of a data item. This approach requires only the data segments in the log to be examined to determine and reconcile any differences between replicas, but requires additional disk writes in order to maintain the log.
Conventional systems may involve one or more data item replicas, stored on different storage devices (e.g., physical device, disk drives, disk arrays, RAIDs, solid state memories, and the like). A conventional system may access data on the physical devices, using commands (e.g., “data=read (replica, offset),” “write (replica, offset, data),” and others). For example, the command “dataBuffer=read (A,4352)” reads data stored at offset 4352 on device A into a buffer. An offset provides an indication of a storage location for a particular set of data. A replica is a copy or duplicate of a data item, individual field, record, or other item within a dataset. For example, the command “write (B,2343, ‘wombat’)” would write the character string “wombat” at offset 2343 on device B. In a system where a data item is replicated at two or more independent locations (also referred to as “replicas”), each replica must be updated in such a way as to maintain consistency with all others replicas. That is, changes made to an item in the dataset must be reflected identically in all replicas of that dataset. We refer to this consistency guarantee as the “replication invariant.”
Replicating data items using write commands, however, presents several problems when failures occur. For example, the replication invariant is violated when the system fails after completing the write to device A but before completing the write to device B. To correct this problem, the system must perform an expensive recovery procedure after a failure. In such cases, data on devices A and B must be read to determine whether there are any differences in the copies of the replicated data item stored on those devices. Any differences in the copies resulting from the failure of a storage device require corrections to properly restore the copies.
The process by which all copies of the replicated data items are made identical to each other is referred to as reconciliation. Reconciliation is performed by copying device A's version of the data to device B or vice versa. In this way, a complete and correct copy of the data item is restored. Repeating this process for each replicated data item, however, may be costly both in terms of time and effort, since the entire dataset must be analyzed. As discussed above, another conventional technique for implementing replicated write on a data item is to use a replication log, which keeps track of the offsets that have not been consistently updated on all replicas.
In cases where a replication log is used, a replicated Write(offset, data) operation is logically equivalent to three sub-operations: “log(offset); write(data, A), write(data, B); unlog(offset),” where the unlog operation erases the prior log operation. This type of replicated write operation enables quick recovery from a failure of a physical storage device. In the event of a failure, log entries are examined for updates that were in progress at the time of the failure, and affected replicas are reconciled. Using replication logs is quicker than analyzing the entire set of data for consistency because only a subset of the data items needs to be analyzed (i.e. that portion that was being modified at the time of the failure). In order for the log to be persistent (and thus survive failures), however, it must be written to stable storage. Thus, an extra disk write is required for every replicated write operation using a replication log, thereby significantly increasing the latency of write operations, which results in a severe performance degradation of write-intensive operations.
Thus, what is needed is a solution for recovering from a failure without incurring the extra log write for each replicated write.