A cluster is a set of interconnected computer system servers arranged as nodes that provide access to resources such as server application programs. One reason to have a server cluster is that multiple linked computer systems significantly improve computing availability and reliability, as well as having more processing power, speed and other resources by way of distributing the load.
With respect to availability and reliability in a cluster, if one node or a hosted application fails, its resources failover to other surviving nodes, where in general, failover means that the other nodes provide host applications that correspond to those that were previously provided by the now-failed node. Types of failures include a computer system crash, a break in a communications link between nodes, intentional shutdowns for maintenance or the like, inadvertent shutdowns such as accidentally unplugging power or a communications cable, and so on.
To handle failures in some controlled way so that failed applications properly restart on other nodes, one attempt was made to have groups of resources (each referred to as a resource group, which is a collection of one or more resources such as application programs and related resources such as network names, IP addresses and the like that is managed as a single unit with respect to failover) failover to a preferred node based on a list of preferred nodes. However, this tended to overwhelm certain nodes because many resource groups had the same default configuration for their preferred nodes. To avoid this problem, present clustering technology provides that when more than one surviving node is available, an algorithm based on random numbers is used to choose the destination node for resource groups if no configuration is provided for the preferred owners list for a resource group (at least among nodes that are capable of hosting the groups), so that no one node is overwhelmed by taking on too many resource groups of the failed node or nodes.
With respect to computing power/speed, physically close computing systems in a cluster are typically linked by very high bandwidth network connections. However, not all cluster nodes are physically close, as enterprises (particularly large enterprises) often separate two or more subsets of such closely-interconnected clustered nodes from one another by relatively large geographic distances. A purpose of this is disaster protection, so as to still have some number of nodes operating in the event of a hurricane, fire, earthquake or the like that can cause an entire physically close subset of interconnected nodes to fail as a whole, whether the reason for the failure is the actual failure of the nodes, or a break in the transmission medium between that subset of nodes and other distant nodes.
A problem with disaster protection by geographic separation is that the communications bandwidth between one subset of closely-interconnected nodes with that of another subset is far lower than the communications bandwidth within the subset. As a result, some cluster administrators do not necessarily want resource groups to automatically failover from one closely-interconnected subset to another (unless an entire subset fails), because the time and expense of failing over resources from even one node is significant, given the low-bandwidth connection. Instead, cluster administrators often would prefer to have the resource groups failover only to closely interconnected nodes. In the event that an entire subset fails, some administrators would prefer to assess the cause and fix the problem (e.g., an unplugged cable) if possible, and only manually failover the resource groups if necessary, which may require some reconfiguration of the other subset to accept the failed over resource groups. Still other administrators want failover to be automatic, at least to an extent, if an entire subset fails. Further, when dealing with consolidation clusters, which are clusters hosting multiple applications, many administrators would like to constrain the set of nodes on which an application composed of various components may be hosted.
However, with the above-described random failover mechanism that was heretofore in place, as well as other prior mechanisms, administrators are not able to configure their clusters for failover in the way that is desired. In fact, with the random mechanism there is no distinction between physically close or physically distant nodes when failing over resource groups. What is needed is a flexible way for cluster administrators to manage the automatic actions that a cluster will take on failures.