Distributed Computing Environments (DCE) include a group of machines (nodes) that interact in a network to solve a common problem. To ensure proper operation of the environment, each machine needs to know that the other peer machines are alive and active. As such, each machine employs health detection logic to determine whether a peer machine is alive or not.
Existing detection technologies involve some form of pinging, where every machine in the group sends a ping message (e.g., “Are you Alive”) at periodic intervals (Ping Interval) to every other machine and expects an acknowledgement reply (e.g., which means “I am alive”) if the machine is alive and operating. If the requesting machine does not receive a reply for some number of consecutive ping messages, the requesting machine declares the peer machine dead. Subsequently, the DCE reconfigures the network topology to exclude the dead machine and resumes work with the current group of active machines.
If a machine dies in an unplanned manner, for example, as a result of a power reset, force reboot or kernel panic, the DCE could end up freezing for some time period. This may occur when the dead machine holds locks that guard shared resources (e.g., storage device, database records, memories, and so on). Since the machine is now dead, the locks held by the dead machine cannot be acquired and thus the shared resources cannot be used until the situation is detected and resolved. This causes a brownout.
The delay in detecting the dead machine may occur as follows. For example, suppose pings are sent every 2 seconds and a machine is not declared as dead until 15 consecutive pings do not receive a reply (e.g., Ping Interval=2 and Miss Count=15). The brownout upon machine death would be anywhere from 30 to 32 seconds. The ping and miss count values cannot be set overly aggressive as it can result in false positives. It is possible that an active machine does not reply to a ping for various reasons (e.g., ping does not reach the machine, process that replies to ping messages has crashed and is being restarted, etc. . . . ). Thus failure to respond to one or two ping messages by itself is not an accurate determination that a machine is dead.
Being able to detect a machine's death as quickly and as reliably as possible may improve the availability of resources on a network.