The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for dynamically updating failover polices in a high availability cluster of computing devices so as to increase application availability.
A high availability cluster is a group of loosely coupled processors, computing devices, or the like, collectively referred to as “nodes,” that all work together to ensure a reliable service to clients, e.g., client computing devices, processors, or the like. Each node in the high availability cluster runs a clusterware produce, such as High Availability Cluster MultiProcessing (HACMP), available from International Business Machines Corporation of Armonk, N.Y., which detects node, network, or communication adapter failures and ensures that applications are automatically restarted on a backup node. With HACMP, up to 32 nodes may be running HACMP and may be either actively running an application or waiting to take over should another node fail. Data on file systems of the nodes can be shared between the nodes in the cluster. With HACMP, daemon applications are used to monitor the state of the nodes of the cluster and coordinate responses to events.
In the event of a failure of one of the nodes, the HACMP clusterware selects one of the surviving nodes of the cluster as a target for application recovery based on a predefined node failover order. Alternatively, HACMP may dynamically determine the target node for failover based on free processor resources, free memory, traffic considerations, or the like. This order of failover is referred to as the “failover policy.”
Many high availability clusters are implemented within a single site, i.e. the nodes of the cluster are geographically local to one another. However, some high availability cluster mechanisms extend the scope of high availability from within a lab or datacenter to sites separated by-geographical distances. This ensures that even when an entire cluster in a site/location fails, applications will failover to a node in another site located miles away.
Application failover within a site is fast and seamless because the clusters within a site have more reliable and redundant heartbeat networks. Furthermore, shared disk setup enables applications to get access to the latest copy of data after the recovery is performed. However, with distributed high availability clusters that span multiple sites that are geographically remote from one another, heartbeat paths across the sites are limited to Internet Protocol (IP) networks, wide area networks, or the like, which is less reliable than the connections within a single site. Moreover, the slower rate of data mirroring between the sites in distributed high availability clusters denies the application access to the latest copy of data after the recovery is performed. Therefore, failover to nodes in a remote site is-preferred only when all nodes in the local site are down.
To better handle such failover preferences, failover scopes are defined, e.g., a local scope and a global scope, each comprising a subset of identified nodes of a cluster. Applications may be associated with an ordered list of one or more failover scopes. When a failover occurs, each application automatically fails over to a surviving node that is listed within its failover scope specified in the ordered list. Based on the ordering, such failover occurs sequentially until the failover is performed successfully. For example, failover may first be attempted to each of the nodes within the current failover scope (local site) before attempting failover to a node in a next failover scope (remote site). If no node within the first failover scope is able to accept the failover, e.g., none of the nodes have survived the failure, the resource group may be set to automatically failover to a node listed in the next failover scope and so on until there is no failure detected.