Data centers generally rely on high-availability computer clusters, e.g., IBM High Availability Cluster Multi-Processing (HACMP) systems, to provide continuous computing services. High-availability computer clusters are networks of computers that are specially configured for the purpose of providing uninterrupted availability of computing resources in case of failures in one or more resources. These resources may include servers, networks interconnecting the servers, storage subsystems, storage area networks (SANs) and other hardware and software components supporting the operation of the cluster storage network switches, and server application programs. Computer clusters typically employ redundant components, e.g., server (nodes), networks, switches and data storage systems, that are set up to automatically switch over to a functioning component when a component of a similar type fails. The automatic switching to operational resources to provide uninterrupted services is referred to as a failover. The failover allows the cluster to continue providing the same computing services to hardware and software components that were receiving services from the failed component before the failure. Normally, when a computer hosting a particular application fails, the application will be unavailable to the users until the failed computer is serviced. A high-availability cluster avoids service interruption by monitoring for faults in the computers and networks, and immediately restarting the application on another computer in the cluster when it detects a fault without requiring administrative intervention.
In order to provide failover support, a data center typically employs a cluster management software component that closely manages resources and groups of resources in the cluster. The cluster management software component configures resources in a cluster before their operation and monitors their status and performance characteristics during the operation. In the event of a failure of a resource, the services provided by the failed resource are migrated to another cluster resource to continue supporting components and applications receiving the services. Typically, resources and resource groups are transferred to a surviving computing node without regard for whether or not that action will resolve the issue that caused the failover in the first place.
In geographically separated clusters, an efficient failover process becomes more significant when both local and remote components are involved in a failover and affect the effectiveness of the failover. Most geo-cluster configurations will first try to failover resource groups locally. Only in the case of local failure of a resource group that the resource group is geographically migrated to the remote site. Since each resource group failover causes the resource group to be stopped and restarted on the node to which it is migrated, application downtime results for each attempt at resource group failover. Obviously, minimizing the number of failover attempts reduces the total application outage.