Business continuity and disaster recovery refers to the capability to restore normal (or near-normal) business operations, from a critical business application perspective, after the occurrence of a disaster that interrupts business operations. Business continuity and disaster recovery may require the ability to bring up mission-critical applications and the data these applications depend on and make them available to users as quickly as business requirements dictate. In cases where downtime is costly, the process may involve automation. For mission-critical applications that demand minimal downtime, the disaster recovery process may need to be highly automated and resilient. Clustering technologies may provide such highly automated and resilient disaster recovery.
Clusters may include multiple systems connected in various combinations to shared storage devices. Cluster server software may monitor and control applications running in the cluster and may restart applications in response to a variety of hardware or software faults. For failover service groups running in traditional clusters, the time to failover includes the time needed to offline all the resources of the service group from the failed node plus the time needed to online all the resources of the service group on the failover node. Unfortunately, waiting until a service group is completely offline to begin the processing of brining the service group back online may be inefficient and may result in failure to comply with a service level agreement. What is needed, therefore, is a more efficient mechanism for failing over service groups in cluster environments.