1. Field of the Invention
This invention is related to the field of data protection and recovery in computer systems.
2. Description of the Related Art
Data protection for computer systems is an important part of ensuring that the information generated on a computer system and/or stored on the computer system is not lost due to the occurrence of a hardware failure, a software failure, user error, or other environmental event (e.g. power outage, natural disaster, intentionally-caused disaster, accidental disaster, etc.). Generally, events that the data protection scheme is designed to protect against are referred to herein as disaster events. The data protection scheme attempts to make redundant copies of the data and locate those copies such that the data is safe from the disaster events and such that the data can be restored to the computer system or to another computer system rapidly enough to be acceptable given the nature of the data, its importance to the creator of the data, etc.
There are numerous data protection products available in the marketplace, implementing various protection methods and having different options. For example, the protection methods may include clustering, backup, snapshot, and replication.
The cluster method is implemented across multiple computer systems, usually configured substantially identically. Cluster server software monitors the systems to detect failure, and fails over applications from a failing system to a different system so that applications keep executing even if a system failure occurs.
The backup method generally includes copying the data stored on non-volatile storage in a system (or a selected subset of the data), usually according to a backup schedule and often at times when utilization of the system is expected to be lower (e.g. at night, on weekends, etc.). Backup methods include both full backups, in which a copy of the entirety of the selected data is made, and incremental or differential backups, in which only data that has been changed since the most recent backup is copied. In some cases, a backup includes in-memory state as well.
Snapshot methods generally attempt to make a synchronized copy of the state of a computer system at a particular point in time, typically including the state of any processes executing at the time and the in-memory state of the computer system in addition to the data stored in non-volatile storage. In other cases, snapshot methods make a synchronized copy of the state of an application that may be executing on one or more computer systems. If the application is executing on more than one computer system, the snapshot image may be a logical image that comprises one or more physical images of storage objects from the various computer systems. Snapshots are often created with a higher frequency than backup, and often while the system is under higher utilization. The definition of the snapshot state varies from product to product. For example, the state may include a file system, a volume, a disk drive, all of the disk drives in a computer system, all of the disk drives and the in-memory state, etc. Additionally, some snapshot products support creating snapshots to remote computer systems rather than local media.
Replication methods generally replicate data objects from a computer system to another computer system over time. Data objects may be defined differently in different implementations. For example, a data object may be one of the following, in various implementations: a file, a directory structure of files, a volume, a disk block, etc. Replication methods may be incremental, in which the changes to the data object are replicated, or may replicate an entire data object when a change or changes have been made to the data object.
Increasingly, organizations are adopting formal service level agreements (SLAs) with their information technology (IT) departments or third party IT providers. Disaster recovery planners (and/or business continuity planners) in the organization assign recovery requirements to various information assets based on the importance of the information assets to the continued functioning of the organization. Currently, the disaster recovery planners specify a recovery point objective (RPO) and a recovery time objective (RTO). The RPO indicates, relative to a specified point in time, how close in time that it must be possible to recover the state of the corresponding information asset. For example, an RPO of 0 indicates that it must be possible to recover the state of the information asset at any point in time. On the other hand, an RPO of 30 minutes indicates that it must be possible to recover the state of the information asset to a state within 30 minutes of the specified point in time. The RTO specifies the maximum amount of time that the recovery operation may take.
The RTO and RPO are objectives aligned to the organization's needs, but they may not actually be achievable given data protection technology, budgetary constraints, etc. Accordingly, corresponding recovery targets (recovery time target (RTT) and recovery point target (RPT)) are negotiated by the disaster recovery planners with the IT department/provider. The RTT and the RPT are formalized as the SLA. Typically, SLAs only cover the immediate recovery of the current state of an asset in response to a disaster event.
Once the SLAs are in place, the IT department/provider must then establish a protection scheme for the information assets that will meet the SLA. As mentioned above, there are myriad protection methods and protection products available which may provide pieces of an overall protection solution that would meet an SLA. However, the number of combinations and permutations of schemes is dauntingly large. Additionally, protection schemes and products are typically focused on the protection provided, not on the recovery metrics that may be achievable using the schemes/products to recover from a disaster event. Consideration must generally be given to the available resources and/or the resources to be consumed to implement the desired protection. Additionally, each protection method/product may have various restrictions (e.g. the supported operating system platforms and/or support hardware platforms, the supported media, etc.). The data to be protected may have its own similar set of restrictions. Thus, it is difficult to determine a protection scheme that may meet a given SLA with an acceptable consumption of resources and conformance with restrictions. The process of determining and implementing a protection solution may be complex, time-consuming, and error-prone. In many cases, the selected protection solution may be insufficient or over-provisioned due to the inability to properly weigh the various factors in implementing a protection solution.