In a multi-node system, a node may fail from time to time. Typically, failure of a node is not an orderly planned event and has to be inferred by other nodes in the multi-node system. To detect a node failure, each node in the multi-node system may employ a heartbeat mechanism in which the node expects each other node to send a heartbeat message periodically. If any node has missed sending heartbeat messages for an extended period of time, that node may be deemed as failed by other nodes in the multi-node system.
However, a node may be a complicated computing system with many entities involved. There may be many different failure modes for such a computing system. In some failure modes, a node may fail rapidly, while in some other failure modes, a node may not fail rapidly but rather degrade slowly over a period of time. A node may fail as a result of a culmination of one or more problems that have been developing over an extended period of time. Even though the problems could be easily fixed when addressed early on, under the existing techniques, detection of these problems are often too late and actions to deal with the problems too drastic, thereby causing many adverse side effects and overall performance degradation. For example, a process on a node may have a memory leak problem due to a programming error. This problem can be easily resolved if addressed early on, for example, by killing this process and starting a new one. Left unaddressed, however, an operation that should be completed by the process within 10 seconds may at some point of time take 100 seconds to complete, because the process may engage in numerous page fault operations related to the memory leak. The process may eventually become unresponsive, leading to additional failures of counterpart processes that interact with the process. Sometimes, this may result in a node failure or a cluster failure.
As clearly shown, techniques are needed to dynamically monitor process activeness in computer systems.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.