1. Field of the Invention
This invention relates to computer systems and, more particularly, to disaster recovery of computer systems.
2. Description of the Related Art
Many business organizations and governmental entities today increasingly rely upon mission-critical applications to provide services to both internal and external customers. Large data centers in such organizations may support complex mission-critical applications utilizing hundreds of processors and terabytes of data. Application down time, e.g., due to hardware or software failures, bugs, malicious intruders, or events such as power outages or natural disasters, may sometimes result in substantial revenue losses and/or loss of good will among customers in such environments. The importance of maintaining a high level of application availability has therefore been increasing over time.
Various approaches may be taken to increase the availability of the computing services provided at a data center, such as the use of redundant and/or fault-tolerant hardware and software, the deployment of security software and/or hardware such as anti-virus programs, firewalls and the like, extensive debugging of software prior to deployment in a production environment, etc. However, it may be hard or impossible to completely eliminate the occurrence of certain types of events at a given site, such as earthquakes, floods, fires, tornadoes, large-scale power outages, or terrorist attacks, any of which may lead to substantial application down time. In order to be able to respond to such situations effectively, enterprises often choose to implement disaster recovery techniques of various kinds.
A typical disaster recovery technique may include replicating the data of a production application at a physically remote site from the primary data center where the production application is running. Such remote replication may be motivated by the consideration that, in the event that a disaster at the primary data center were to occur, a second instance or copy of the production application may be started, and such a second instance may continue providing the desired services using the replica of the production application data. Updates to the production application data may often be replicated as soon as they occur at the primary data center; that is, the replica of the production data may be kept synchronized, or close to synchronized, with the production version of the data. This replication may be fairly expensive in itself, because the remote site must typically maintain at least as much storage space as is being used at the primary data center. For example, if the production application data requires X terabytes of storage, an additional X terabytes of storage may be needed to replicate the production application data at the remote site.
However, even replicating the entire production data set at the remote site may be insufficient to ensure effective and reliable disaster recovery for complex applications. Mission-critical applications may require non-trivial configuration or setup steps, and may rely upon numerous software packages with frequently changing versions. Simply maintaining a replica of the application data, without exercising the application in combination with the replicated data from time to time to ensure that it operates correctly, may not result in the desired level of confidence that application recovery would actually succeed in the event of a disaster. In order to exercise the application, disaster recovery rehearsals may be performed from time to time. To simulate disaster conditions as closely as possible, and to ensure that the production data replication continues while a disaster recovery rehearsal is performed, a second replica of the production application data may therefore have to be maintained at the remote site for use during disaster recovery rehearsals. Updates performed at the primary data site may continue to be replicated at the first replica, so that the first replica remains current with respect to the production application data. The second replica, used for disaster recovery rehearsal, may be a fixed or point-in-time copy of the first replica. Creating such a second replica may greatly increase the expense involved in disaster recovery: for example, if the production application data requires X terabytes of storage, X terabytes of storage may be needed for the first replica, and an additional X terabytes of data may be needed for the replica used for disaster recovery rehearsals. Especially for applications with large data sets, storage costs associated with disaster recovery rehearsals may therefore become prohibitive.