A computer cluster is a collection of one or more computer systems or nodes, having associated processes, that work together to provide a single, unified computing capability, such that the cluster operates as though it were a single system. Computing tasks can be distributed across multiple nodes within the cluster.
In such a distributed computing environment with many computing nodes hosting services, it is critical that each node used to provide the services be “healthy”. One way to determine whether a node is unhealthy is to compare some “health parameter” of the node against some established threshold limits. Unfortunately, determining whether a node is unhealthy based on static information, such as a threshold limits, may not accurately reflect the health of the node. For example, a node may be operating within the threshold limits, but still be performing significantly worse than the node should be performing under the current runtime conditions. A need exists for improved ways to monitor distributed computing systems and to identify and correct unhealthy or underperforming nodes.