Information is the most crucial asset of many businesses, and any disruption to the access of this information may cause extensive damage. Some businesses, such as banks, airlines (with e-tickets), auction sites, and on-line merchants, may actually stop functioning without access to their information. No matter how reliable a data center is, there can still be site failures—floods, earthquakes, fires, etc.—that can destroy the data stored on a storage device and any co-located backup media.
Geographic replication is the only way to avoid service disruptions. Geographic replication has challenges: performance needs to be maintained; different sites might run at different speeds, and have different latencies. Having multiple remote copies may increase reliability, but for most purposes the replicas need to be kept in sync, in real time. If a site fails and comes back on-line, its data need to be recovered without an excessive impact on the rest of the system. Generally, when a failure occurs, a replica that does not contain up-to-date data needs to recover from another site that does contain up-to-date data.
In a replicated storage system, it is desirable to keep a plurality of replicas of data consistent in spite of failures. The data consists of plural data items, which may be disk blocks or any other information. In a replicated storage system, a source, such as a host computer, issues a sequence of requests, which may be either read or write requests to a particular data item or to a group of data items. A data item is the smallest unit that can be read or written in one request. Read and write requests are atomic in the sense that they are either executed completely or not at all. In other words, there is no possibility that a data item will contain a mixture of old and new information after a write request of this data item is executed.
Generally, a replicated storage system consists of a source and a plurality of replicas connected with a communication network, as for example, an IP network such as the Internet. The source, such as a host computer, receives the requests from the outside world, and/or generates them internally. The source sends write requests to all of the replicas in the system, and sends read requests to one or more of the replicas. The replicas keep the data in a non-volatile storage device, such as a magnetic disk or a non-volatile memory (e.g., a non-volatile random access memory [NVRAM]), so that the data is not lost when the replica fails. Requests are sent to all replicas in the same order.
The source communicates with the replicas using a reliable communication protocol, such as TCP/IP, which ensures that the information sent by the source is received by the replicas without reordering or corruption. In case of a full or partial network failure, the affected replicas are disconnected from the source and do not receive any further communication from it. When a replica fails, it stops updating its data (a fail-stop assumption).
A system containing multiple replicas is said to be consistent if all of the replicas contain the same data after the source stops sending write requests and all outstanding write requests have been processed by all of the replicas. In other words, a definition of consistency is that each replica contains the identical data after all of the replicas have finished processing the same sequence of write requests.
A replicated storage system needs to keep the data consistent after a failure. A failure can be of the source, the network, or one or more of the replicas. The system thus needs to recover any missed changes to the data in a replica due to failures. If the source failed, no data was changed during the failure. Thus, after the source recovers, no additional operation needs be performed with respect to any of the replicas to maintain consistency. Similarly, a complete network failure requires no additional operation with respect to the replicas since no data was changed during the network failure, because all write requests must pass through the network in order to reach the replicas. A replica failure causes the affected replica to miss some write requests that the other replicas performed; the failed replica needs to recover these requests in order to become consistent again. A partial network failure prevents communication between the source and some of the replicas. Since the affected replicas fail to receive some write requests, recovery of the affected replicas needs to be handled in the same way as a replica failure.
A recovery process is needed to ensure that by the end of the recovery the affected replica will contain the same data as the other replicas that did not fail. In other words, a recovery process is used to achieve consistency after a replica failure or a partial network failure by allowing the affected replica to recover the changes to the data that it missed during the failure. A problem arises, however, in that during recovery period, new write requests to the data are likely made from the source.
In some prior art systems (see, e.g., Sun Microsystems, “Sun StoreEdge™ Network Data Replicator Software Boosts Data Center Resilience”, White Paper, http://www.sun.com/storage/white-papers/sndr.html), only two replicas (primary and remote copies) are possible since the replicas use a scoreboard of bit values to track data changes during a single failure. This prior art system does not specify if the source can issue write requests while the recovery is in progress. In other prior art methods (see, e.g., Richard P. King, Nagui Halim, Hector Garcia-Molina, and Christos A. Polyzois, “Management of a Remote Backup Copy for Disaster Recovery”, ACM Transactions on Database Systems, Volume 16, Number 2, June 1991, pp. 338–368), the source is allowed to continue making write requests while the recovery is in progress, but requires that the entire contents of the current replica to be copied to the affected replica. This may cause the current replica to transfer more data than necessary, which lengthens the duration of the recovery. A longer recovery reduces the reliability of the entire system, since a replica that is not operational cannot protect the system from further failures. In prior art database recovery (see, e.g., Abraham Silberschatz, Henry F. Korth, and S. Sudarshan, “Database System Concepts (3rd Edition)”, McGraw-Hill, 1997, Chapter 15, pp. 511–531), a single copy of the data is recovered from the transactions log. Also, the source is prohibited from making write requests while the recovery is in progress.
An efficient recovery process is needed for a replicated data storage system, therefore, that does not require the source to stop generating write requests while recovery is taking place, and minimizes the amount of information that is transferred to the recovering replica in order to make it consistent with the other replicas in the system.