In some computer systems, it is important to maximize the availability of critical services and applications. Generally, this is achieved by using a fault tolerant system or by using high availability (“HA”) software, which is implemented on a cluster of multiple nodes. Both types of systems are described briefly in “A High-Availability Cluster for Linux,” Phil Lewis (May 2, 2000).
A fault tolerant computer system includes duplicate hardware and software. For example, a fault tolerant server may have redundant power supplies, storage devices, fans, network interface cards, and so on. When one or more of these components fails, the fault is detected, and a redundant component takes over to correct the problem. In many cases, fault tolerant systems are able to provide failure recovery which is nearly seamless (i.e., unperceivable to system users). However, because these systems rely on duplicate hardware, they tend to be expensive. In addition, these systems typically are proprietary, and are tightly coupled to the operating system, whatever that system may be.
HA software also provides fault detection and correction procedures. In contrast to fault tolerant systems, HA software is implemented on two or more nodes, which are arranged in a “cluster” and communicate over a link (e.g., a network). Typically, one node operates as the “master” for a particular application, where the master is responsible for executing the application. One or more other nodes within the cluster are “slaves” for that application, where each slave is available to take over the application from a failed master, if necessary.
Generally, an HA software implementation is loosely coupled to the operating system, and therefore may be more portable to different types of systems and nodes than a fault tolerant system would be. However, one disadvantage to an HA system is that failure recovery typically takes much longer than it would with a fault tolerant system. Therefore, significant system downtimes may be perceived by system users.
One reason for the relatively slow failure recovery times is the way that failures are detected and responded to. In some systems, each slave periodically “pings” other nodes to determine whether they are reachable. If a slave determines that a master node is unreachable before expiration of a certain timeout period, the slave declares a failure and attempts to take over as master. Because this process relies on timeout periods and network communications, it provides slower recovery than is possible using fault tolerant systems. Besides being somewhat slower to recover, another disadvantage to these systems is that it is not possible to detect a failure of a single application within a master node. Instead, the entire node must fail in order for a failure to be detected.
Alternatively, a node within an HA system may periodically send out a “heartbeat” message for an application that it is executing as a master. The heartbeat message indicates that the master node continues to be able to execute the application. If a slave node does not receive a heartbeat message for a particular application within a certain timeout period, then the slave assumes that the master has failed, and an election process is initiated to determine which slave should take over as master.
The “Time Synchronization Protocol” (TSP) is an example of such an HA protocol, which is used by the clock synchronization programs timed and TEMPO. TSP is described in detail in “The Berkeley UNIX Time Synchronization Protocol,” Gusella, et al. (1986). TSP supports messages for the election that occurs among slaves when, for any reason, the master disappears, as is described in detail in “An Election Algorithm for a Distributed Clock Synchronization Program,” Gusella et al. (December 1985). Basically, the election process chooses a new master from among the available slaves when the original master ceases to send out heartbeat messages.
One major disadvantage to TSP is that synchronization messages are sent out at a very slow rate (e.g., on the order of once every several minutes). Therefore, if a master does fail, it may take several minutes for a slave to respond and conduct an election. This characteristic of TSP can result in an unacceptably slow assumption of master tasks.
The TSP protocol functions well in the context of supporting messages and elections that occur as part of a clock synchronization program. However, its portability to other types of applications is limited, and it is not well adapted to inclusion in modern systems for several reasons.
In modern networked computer systems, each machine may be capable of simultaneously running multiple applications, each of which is executed using a master-slave configuration. In such systems, it may be necessary to exchange status information between machines for each task and/or application. The primary limitation of TSP's application to modern systems is that TSP is capable of supporting message transfer for only a single application (e.g., a time daemon) per machine. Accordingly, TSP is not adapted to exchange status information for multiple tasks or applications between machines. In addition, when a task or application has updateable configuration information associated with it, TSP has no facility to monitor or support the transfer of new configuration information between nodes. Therefore, TSP is not an acceptable protocol for providing status and configuration messaging capabilities for modern networked computer systems.
What is needed is a protocol and method that can provide efficient failure recovery and configuration information exchanges between nodes of a networked computer system. Further needed is a protocol and method that is scalable and efficient, so that heartbeat and configuration messaging between nodes can be performed for potentially many tasks and applications without burdening the network (or networks) with excessive network traffic.