Distributed computing systems include multiple services and/or applications that operate on different machines (computing devices) that are connected via a network. Some services or applications may rely on other services and/or applications to operate. However, machines, and services and applications that operate on the machines, may occasionally become unavailable (e.g., when a machine loses power, an application crashes, a network connection to the machine is lost, etc.).
In some distributed computing systems, to determine which machines, services and applications are operative at a given time, each machine in the distributed computing system can periodically transmit status inquiry messages, which are typically referred to as “are-you-alive messages” or “heartbeat messages.” The status inquiry message is a small control message that is generated and sent between machines or services on machines. A queried machine that receives the status inquiry message generates a status response message. The status response message is then sent back to the original querying machine that sent the status inquiry message. The querying machine can then receive the status response message, which provides confirmation that the queried machine and/or service is still active. Such status inquiry and status response messages may be continuously transmitted between machines within a distributed computing system at a specified frequency.
Conventional distributed computing systems can determine whether a machine or a service operating on a machine has failed. However, conventional distributed computing systems cannot detect failure at a fine grained level, such as failure of a container that houses a service or failure of individual threads within a service. Therefore, conventional distributed computing systems offer only course grained failure detection.