Typically, multi-node, high performance computing (HPC) systems can include several thousand nodes. This increases the probability that one or more nodes and/or components in the system could be “bad”, meaning there could be problems with the processor core, memory subsystem, I/O subsystem, messaging framework, and the like. With all of the nodes or components, it can be very difficult to identify which parts of a multi-node system need to be fixed or replaced. The identification process can take a long time and become very costly.
Currently, several methods have been proposed to identify bad nodes in a multi-node system. One such method is a voting system that identifies and votes off bad nodes. These voting systems are very costly in terms of resources. In one scenario, good nodes are forced to expend resources while bad nodes are not. This will lessen the burden on the good nodes and therefore is less costly.
While these methods describe how to dynamically identify dependencies between components of the system, they do not scale down to the lower hardware and software levels required when identifying bad nodes in a multi-node high-performance computing system. In addition, none of the current methods are able to compare the same results in different contexts.