Distributed computing systems include multiple services and/or applications that operate on different machines (computing devices) that are connected via a network. Some services or applications may rely on other services and/or applications to operate. However, machines, and services and applications that operate on the machines, may occasionally become unavailable (e.g., when a machine loses power, an application crashes, a network connection to the machine is lost, etc.).
In some distributed computing systems, to determine which machines, services and applications are operative at a given time, each machine in the distributed computing system can periodically transmit status inquiry messages, which are typically referred to as “are-you-alive messages” or “heartbeat messages.” The status inquiry message is a small control message that is generated and sent between machines or services on machines (services may fail independently of machines, so simply detecting that the machine is alive may not be sufficient). A queried machine that receives the status inquiry message generates a status response message. The status response message is then sent back to the original querying machine that sent the status inquiry message. The querying machine can then receive the status response message, which provides confirmation that the queried machine and/or service is still active. Such status inquiry and status response messages may be continuously transmitted between machines within a distributed computing system at a specified frequency.
Each machine within a distributed computing system typically includes a management application that monitors the activities of other applications, services and machines in the distributed computing system. The management applications generate and exchange management messages that typically include management information about services that are available within the distributed computing system, such as how long a service has been active, how many users a service has had, the present and past workload of the service, software versions of the service, etc., and about the machines on which the services operate, such as a number of services that operate on the machine, capabilities of the machine, etc. The management messages exchanged by the management applications are separate and distinct from the status inquiry and status response messages that are transmitted between machines. Each of the status inquiry messages, status response messages, and management messages consume bandwidth of the distributed computing system.