Monitoring the availability of a data processing system (and especially a large network of computers) is a key issue in several applications. This activity allows detecting any unit of the system that is not working properly, so that suitable actions can be taken in an attempt to remedy the situation. For example, it is possible to replace a crashed unit with a corresponding backup unit (previously in a standby mode), or to distribute the workload of the system across the other units (working properly). As a result, a high degree of fault-tolerance can be achieved, thereby avoiding any interruption in a service offered by the system; this is of the utmost importance in systems that implement critical applications (for example, financial transactions, air traffic control, and the like).
A commonplace solution for monitoring the availability of the system is that of having each unit transmit a heartbeat signal at regular intervals to a central monitor. The heartbeat signal indicates that the unit is alive; therefore, if the central monitor does not receive the heartbeat signal as expected, it assumes a crash of the unit. Different policies can be adopted to make the detection of the crash more flexible; for example, this happens when a predefined number of heartbeat signals have not been received in a significant time frame.
A drawback of the above-described technique is that it involves an overload of the system. For example, this reduces the bandwidth of a network that is also used for the actual flow of data. As a result, the performance of the applications running on the system is adversely affected.
A possible solution is that of reducing the transmission rate of the heartbeat signals. However, in this case the delay between any crash and its detection is accordingly increased (being equal to the heartbeat signal period in the worst situation). The resulting degradation in the availability of the system is unacceptable in several situations (for example, when the system implements critical applications).
A different solution is disclosed in U.S. Pat. No. 6,370,656. This document proposes varying the heartbeat rate of each unit adaptively. For example, the heartbeat rate is updated according to the age of the unit, its temperature, or the number of errors occurred in the past. In this way, it is possible to have a low heartbeat rate for units that are unlikely to experience any problem; at the same time, the heartbeat rate increases as the probability of crashes rises.
However, even this solution is not completely satisfactory. Indeed, the system always suffers an overload that is not negligible. Particularly, when a unit becomes too old the corresponding heartbeat rate may get so high to be untenable.
An additional drawback of the solutions known in the art is that they can lead to wrong conclusions about the conditions of the different units. Particularly, each unit may be considered crashed even if it is working properly; for example, this happens when an agent running on the unit is unable to transmit the heartbeat signals as required (because it is blocked or temporarily busy), when a transport infrastructure of the heartbeat signals is unavailable (for example, because of a lack of connection between the unit and the central monitor), and the like.