With the heavy reliance on computing needs by businesses and individuals, the need for uninterrupted computing service has become increasingly vital. Many organizations develop business continuity plans to ensure that critical business functions will enjoy continuous operation and remain available in the face of machine malfunctions, power outages, natural disasters, and other disruptions that can sever normal business continuity.
Local disruptions may be caused, for example, by hardware or other failures in local servers, software or firmware issues that result in system stoppage and/or re-boot, etc. Local solutions may include server clustering and virtualization techniques to facilitate failover. Local failover techniques using virtualization provide the ability to continue operating on a different machine or virtual machine if the original machine or virtual machine fails. Software can recognize that an operating system and/or application is no longer working, and another instance of the operating system and application(s) can be initiated in another machine or virtual machine to pick up where the previous one left off. For example, a hypervisor may be configured to determine that an operating system is no longer running, or application management software may determine that an application is no longer working which may in turn notify a hypervisor or operating system that an application is no longer running. High availability solutions may configure failover to occur, for example, from one machine to another at a common site, or as described below from one site to another. Other failover configurations are also possible for other purposes such as testing, where failover may even be enabled from one virtual machine to another virtual machine within the same machine.
Disaster recovery relates to maintaining business continuity even in the event of large scale disruptions. For example, certain failure scenarios impact more than an operating system, virtual machine, or physical machine. Malfunctions at a higher level can cause power failures or other problems that affect multiple machines, or an entire site such as a business's information technology (IT) or other computing center. Natural and other disasters can impact an enterprise that may cause some, and often all, of a site's computing systems to go down. To provide disaster recovery, enterprises may replicate information from one or more computing systems at a first or “primary” site to one or more computing systems at a remote, secondary or “recovery” site. Replicating information may involve continuous, or at least repeated, updates of information from the primary to the recovery site.
To provide high availability, either or both of the primary and recovery sites may utilize failover clustering as described above, where a virtual machine or other information may remain available even when its host server fails. The use of both disaster recovery techniques between sites, in combination with clustering techniques between servers at either/each site, creates some complexities. For example, the use of failover clustering techniques at the recovery site may involve running another instance of a first recovery server's virtual machine in at least one other recovery server, such as when the first recovery server fails or otherwise becomes unavailable. When this first recovery server, and possibly some or all of the other recovery servers, are offline due to planned or unplanned events, the source or “primary” server would be unable to send any further replicas (e.g., replicated virtual machine base information and/or updates thereto) to the offline recovery server(s). The virtual machine replication would be suspended, but the virtual machine at the primary site would continue its workload, which would result in changes to the virtual machine. These changes to the virtual disk will continue to accumulate at the primary site, as the recovery server has become unavailable to receive the otherwise more frequent replicas. When the offline recovery node becomes available again, there would be spikes in the resource utilization as the amount of data to be sent could be very large. In cases of prolonged downtime of the recovery server, a complete replication may need to be started from scratch resulting in loss of data and exposing the business to an extended unprotected period. This could further impact operations as the initial replication may be significantly larger than “delta” replicas, and the virtual machine may require additional configurations in view of the initial replication. Further, if disaster strikes at the primary site during the time the recovery server is down, business continuity would be lost. A significant amount of data would likely be lost as well, as the data on the recovery server would be substantially behind the primary server due to the interruption of the replication process.