1. Field of the Invention
The present invention relates in general to the field of computer system backup, and more particularly to a system and method for remote recovery with checkpoints and intention logs.
2. Description of the Related Art
Business enterprises often maintain disaster recovery computing resources that can take over when primary computing resources fail, such as an application that runs on a server to support multiple clients in performing critical tasks. Although disaster recovery resources can be co-located with primary resources, businesses often use remote disaster recovery resources to improve disaster recovery reliability. For instance, co-located primary and disaster recovery resources will both fail if power to their common location is interrupted while simultaneous power failures at primary and remote locations are unlikely.
Remote disaster recovery of a server application workload often involves a hot or cold start of the application workload at a server of the remote location and a pairing with an inexact mirror of the application workload data available at the remote location. Typically, the copy of application workload data available at the remote location is an “inexact mirror” because even synchronous mirroring of the application workload data to the remote location will sometimes result in incomplete copies of the application workload data at the remote location if a failure occurs before completion of processing of a mirror update of currently changing data. Thus, a hot or cold start of a workload at a remote location means that the application at the remote location must include code that understands and can recover from an inexact mirror, such as by starting the application from a known data recovery rendezvous point. The rendezvous point allows the application at the remote recovery location to bring the application workload data to a consistent state and to proceed forward with new processing after the recovery. A disadvantage of this approach is that the remote recovery software is typically complex and application specific.
An alternative to remote recovery with an inexact mirror is the use of hypervisor checkpointed workload instances as a form of hot start of an application. By pairing a workload checkpoint with a checkpoint of the workload data storage at the same instant, the need for a data recovery rendezvous point is removed. This type of checkpointing is used to save a workload and its data for a subsequent restart, such as for end of year compliance, but is not practical for use in a remote disaster recovery situation because communicating a checkpoint of a workload and its data through a network to a remote location is resource and time prohibitive. An example that illustrates this difficulty in a transaction sensitive environment is a set of ATM clients of a bank that request transactions to a database server. In the example, a database server that handles ATM transactions is checkpointed along with its data at time T1. The checkpoint is mirrored to a remote disaster recovery site with the mirror accomplished at time T2. For purposes of the example, checkpoints continue at odd intervals and mirroring continues at even intervals, although these operations could be interleaved at different intervals in other example embodiments. If an ATM client receives confirmation of an intermediate transaction to the database between a checkpoint and a mirror, i.e., between T1 and T2, and a failure occurs after the checkpoint but before the mirroring, the disaster recovery site will not have the confirmed transaction as a part of its hot start because the mirror is incomplete. Thus, as the result of the failure at a primary server, a bank customer who made a deposit at the ATM will not get credit for the deposit. Most solutions that attempt to avoid such lost transactions are complex and very application specific.