1. Technical Field
This disclosure relates to failure detection and more particularly to fault detection of computing machines that utilize multiple processor cores and accelerators.
2. Related Art
The task of detecting processor faults is complicated and increases in complexity as processors are added to computing systems. As more processors are added, more sockets, layers of memory, memory buses, HyperTransport (HT) links, and other components are needed. Due to the complexities of these architectures and the large number of components, the systems may fail in many ways including when Arithmetic and Logic Units (ALUs), memory elements, transport components, etc. fail. Some failures are difficult to detect at the node and the board level, and consequently do not allow for corrective measures such as check point recovery or process migration, when failures occur. As a result, computations may run until they are completed with little or no indication that a fault has occurred.
Such results may be observed in some high performance systems, such as supercomputers and large compute clusters. When software is executed repeatedly, faults can negatively influence the results, leading to different outputs for the same input. Repeated execution of existing application codes with known outputs for diagnosis purposes can require significant execution times and is not always capable of detecting many errors, including errors at the board or node level.