Data and applications utilizing data form a fundamental base for a multitude of activities in everyday life, such as banking records, travel itineraries, shipping schedules, and medical records.
Despite advances in performance longevity of computer systems and network elements, events can and will occur that will disrupt operation. For example, floods, fires earthquakes and storms are all natural disasters that in recent times have resulted in the loss of computer systems and their data. Of course disasters are not restricted to events of nature. Anything that leads to loss of data or loss of access to the data may be considered a disaster. For example, computer viruses, power surges, human errors, software malfunctions, and component failures are also relatively common events that can and do plague computer systems.
When critical data and applications are involved, it is now common to employ a variety of different data protection and recovery methodologies. Generally speaking, “recovery” is viewed as the restoration of the system through the use of copied data. Recovery may employ reconstruction of the system from the copied data, a switch to a mirror system (e.g., a failover), and/or combinations of different processes utilizing the copied data.
A primary copy of data is protected by making one or more secondary copies of data on the same or separate hardware, such as other disk arrays, a tape library, optical storage, memory banks, or even paper printouts. As used herein, the term secondary copy is understood and appreciated to refer to all data (i.e., application binaries, application metadata and user data) utilized by a functioning system. To ensure survival of the primary copy, there may even be local copies (e.g., a redundant disk array on site) as well as remote copies (e.g., a tape warehoused in a secure vault). There may also be copies that are currently up-to-date and copies that represent specific points in past time.
When a system or application is lost, for whatever reason, more often than not there are associated financial penalties. Especially with business systems, such penalties may include penalties for downtime, penalties for loss of most recent updates, penalties for vulnerability to subsequent failures during the recovery effort, and or other such penalties.
Many factors influence how fast recovery is performed, how complete the recovery is, and how reliable the restored system will be. Available bandwidth, shipping time from offsite storage locations, and other factors may affect the speed of recovery. In the event of data corruption, up-to-date copies may provide less recent data loss, but may include the same virus or elements that resulted in the data corruption, whereas an older copy made prior to the virus infection may provide a clean, but less up-to-date, system. Moreover, in many settings there is a choice between restoring from a local copy that might not be up-to-date, or a remote copy that is very current but expensive to transfer.
When faced with a need to determine a recovery schedule in the face of a disaster, system administrators often use rules of thumb, such as serializing recovery of workloads and data based on criticality to the business at hand. So as to minimize the chaos likely to ensue after a disaster, many businesses institute disaster recovery plans or business continuity plans. These plans describe what steps to take after a disaster and who the responsible parties are for each step. Recovery plans centralize information from a variety of disparate sources about the structure of the information technology system and how to recover the system.
These recovery plans are designed ahead of time and thoroughly tested to ensure that the personnel are familiar enough with the plans and procedures to be able to execute them smoothly during a time of high stress and emotion. This process is a lengthy one. Updating a recovery plan because of a change in system hardware or location, changes in the business requirements, changes in the service providers, or other major or minor changes affecting the overall system is often not an easy task.
Teams generally develop a collection of plans, each tailored to respond to the most likely disasters. Developed in advance and for a general disaster, these plans, if implemented, may not specifically address the current needs and requirements of the business, let alone the nature of the disaster at issue. The traditional methods for revisions to these pre-existing plans may well require time that is not available, or at the very least time that increases penalty costs for the delay in achieving the recovery.
Various guides are commercially available to offer rules of thumb and high-level guidance for the design of dependable storage systems and various methods for recovering the dependable storage system in the event of a disaster. However, these books typically focus mainly on the logistical, organizational and human aspects of disaster recovery, such as the need for a documented disaster recovery plan, the information it should contain, the need to maintain a list of critical personnel, and other such information. More specifically, these guides do not cover recovery of a system in any sufficient detail to be of practical use in an actual disaster recovery operation.
Hence, there is a need for a method to determine a recovery schedule for recovering data, such as from dependable storage, that overcomes one or more of the drawbacks identified above.