High Availability (HA) Clusters are a class of distributed systems that provide high availability for applications. The high availability is achieved using hardware redundancy to recover from single points of failure. HA clusters generally include two or more computer systems called “nodes.” For this reason, HA Clusters are generally referred to as Node Availability Management Systems. Node Availability Management Systems manage both nodes and applications running on the nodes. Each node runs a local operating system kernel. The cluster software, which may be considered an extension of the operating system, starts applications on one or more nodes of the cluster and monitors various aspects of the software and hardware stack. The component of the software that handles application availability is generally referred to as an Availability Manager (AM).
In the event of hardware or software failure, the AM automatically restarts applications on the same node or “fails over” the applications to other nodes in order to keep the applications available. In addition, the AM is able to bring applications online or offline in response to administrative requests. The AM can be thought of as reacting to events. These events can generally include administrative commands and error notifications from other parts of the system (e.g., application death, node death, application non-responsiveness, etc.). HA Clusters typically have a single node, referred to as the president node, that makes all the decisions regarding actions to execute following an event. The president node dictates orders to the remaining nodes, referred to as worker or slave nodes, in order to carry out the execution of the actions.
Numerous execution models are available for carrying out decisions made by the president node. A common model is a standard procedural approach, where each decision is processed by a separate code path in the president node. When the president node wants to dictate orders to the slave nodes, it makes decision-specific inter-node communication calls to the worker nodes to process the event.
Any HA Cluster that uses a president node must consider the possible failure or death of the president node. A common approach to this possibility involves “checkpointing” or “state propagation”. Using this approach, state information is saved to other nodes or to persistent storage so a new president may take over operations following a president node death or malfunction.