The determination of adapter and node liveness lies at the heart of any highly available distributed data processing network in which the nodes are dividable into clusters which are typically employed to operate on dedicated applications. In order to provide high availability services, a cluster system should be able to determine which nodes, networks, and network adapters in the system are working. The failure in any such component should be detected early and the resultant information passed on to a higher level software subsystem and, if possible, recovery operations should be initiated by a cluster recovery manager and application level software.
Determination of node, network, and network adapter liveness is often made through the use of daemon processes running in each node of the distributed system. Daemons run distributed protocols and exchange liveness messages that are forced through the different network paths in the system. If no such liveness messages are received within a predetermined interval then the sending node or network adapter is assumed not to be working (“dead”) by the others.
This method of liveness determination imposes real-time constraints for the corresponding daemons: if a daemon gets delayed for any reason this may result in the hosting node being falsely detected as dead—a “false down” event. False down events result in unnecessary, and often costly, recovery procedures which can disrupt the operations of the cluster.
Making daemons obey these real time constraints is often far from trivial, however, since the underlying operating system is seldom real-time. Only real-time operating systems can guarantee finite response times under any circumstances.
If the load on one of the nodes of the system is such that the physical memory needs greatly exceed the amount of memory present, heavy paging starts to occur, which occasionally leads to processes making little progress. In terms of the liveness determination daemon, these paging operations can operate to prevent it from sending liveness messages in a timely fashion.
Although some operating systems do provide primitives that allow processes to keep their pages from being “stolen” by other applications, in practice this solution is not perfect: either the primitives do not work on the entire addressing space (for example, they may not work with shared libraries) or the operating system itself is often pageable.
Besides memory starvation, other causes are known to prevent processes from making adequate progress: high interrupt rate, which blocks any process in the system from running, and the presence of high-priority processes that monopolize CPU utilization.
Different approaches could be used in order to prevent these “false down” events caused by process blockage:                1) Increasing the threshold of the number of missing incoming liveness messages before the remote entity is declared “down”;        2) Making the daemon as real time as possible, employing real-time scheduling priority and operating system primitives to prevent paging; and        3) Incorporating the code responsible for sending the liveness messages into the kernel.        
The first method has the drawback that real failures take longer to be detected, which for (real) failures may result in longer periods during which the end-user service offered by the cluster is unavailable.
The second method is only partially effective. Not only does it require the use of multiple real-time primitives offered by the operating system, but also careful design to avoid known causes of blocking, such as communication and I/O. Still the operating system may be unable to guarantee that the process will always make progress.
The third method may produce good results, but at a sometimes prohibitive development cost, since code needs to be introduced into the operating system kernel, which severely impairs portability and serviceability. A subtle problem with this approach is that it can only provide “kernel liveness,” being ill-suited to detect situations where the kernel is able to run but not user programs. Under such situations, the node becomes useless and declaring it dead is likely to be the correct decision.