With the heavy reliance on computing needs by businesses and individuals, the need for uninterrupted computing service has become increasingly vital. Many organizations develop business continuity plans to ensure that critical business functions will enjoy continuous operation and remain available in the face of machine malfunctions, power outages, natural disasters, and other disruptions that can sever normal business continuity.
Local disruptions may be caused, for example, by hardware or other failures in local servers, software or firmware issues that result in system stoppage and/or re-boot, etc. Local solutions may include server clustering and virtualization techniques to facilitate failover. Local failover techniques using virtualization provide the ability to continue operating on a different machine or virtual machine if the original machine or virtual machine fails. Software can recognize that an operating system and/or application is no longer working, and another instance of the operating system and application(s) can be initiated in another machine or virtual machine to pick up where the previous one left off For example, a hypervisor may be configured to determine that an operating system is no longer running, or application management software may determine that an application is no longer working which may in turn notify a hypervisor or operating system that an application is no longer running. High availability solutions may configure failover to occur, for example, from one machine to another at a common site, or as described below from one site to another.
Disaster recovery relates to maintaining business continuity on a larger scale. Certain failure scenarios impact more than an operating system, virtual machine, or physical machine. Malfunctions at a higher level can cause power failures or other problems that affect an entire site, such as a business's information technology (IT) or other computing center. Natural and other disasters can impact an enterprise that can cause some, and often all, of a site's computing systems to go down. To provide disaster recovery, enterprises today may back up a running system onto tape or other physical media, and mail or otherwise deliver it to another site. The backup copies can also be electronically provided to a remote location. By providing a duplicate copy of the data, applications can be resumed at the remote location when disaster strikes the source server site.
When using virtual machines, disaster recovery may involve tracking changes to virtual disks in order to replicate these changes at the remote site. Current approaches for tracking changes result in additional read and write overhead for data that has changed. These change tracking mechanisms consume additional storage input/output operations per second (IOPS) from those otherwise available for server workloads. For example, differencing disks have primary purposes in areas such as test and development, and may not have been developed with tracking changes and replication in mind. While differencing disks enable changes to be written to them, processing differencing disks for the purpose of replication is I/O-intensive. Where response times of the workloads are impacted, the overall value of a replication solution is adversely affected.
Limited network bandwidth can affect a replication solution and negatively impact the recovery point objective (RPO). If the network bandwidth is insufficient, it can take a long time to transfer large virtual disk files. Compounding the problem is that a virtual disk block identified as changed may be larger than the actual quantity of data that changed, resulting in even higher quantities of data needing transfer. For example, a two megabyte (2 Mb) block may be created to capture changes. Even if only a small change is made (e.g., 4 Kb), the 2 Mb block is used. These and other inefficiencies and shortcomings of the prior art create still more concern for the RPO.