In distributed processing systems, multiple processors communicate with each other and with memory devices to perform a shared computation. Because the types of computations involved are generally very complex or require a great deal of processing power, this type of communication often must be very high speed.
High-performance computing (“HPC”) systems further increase speed by using specialized hardware that is not generally available commercially off-the-shelf for use in, for example, desktop or server computers. This specialized hardware often includes a plurality of computing nodes having customized application-specific integrated circuits (“ASICs”) with a number of communications channels for communicating with other ASICS on other nodes (and components on the same node). Such hardware also includes the processors, memory, and other specialized hardware unique to implement a tightly-coupled HPC system. HPC systems thus often divide execution of complex computations across multiple of these interconnected nodes.
HPC systems produce significant amounts of heat. As such, proper cooling is important to their effective functioning. HPC systems also are prone to error conditions that can impair their ability to complete a task.