Massively parallel processing (“MPP”) systems may have tens of thousands of nodes connected via a communications mechanism. Each node may include a processor (e.g., an AMD Opteron processor), memory (e.g., between 1-8 gigabytes), a communications interface (e.g., HyperTransport technology), and a router with routing ports. Each router may be connected to some number of routers and then to other nodes via their routing ports to form a routing topology (e.g., torus, hypercube, and fat tree) that is the primary system network interconnect. Each router may include a routing table specifying how to route incoming packets from a source node to a destination node. The nodes may be organized into modules (e.g., a board) with a certain number (e.g., 4) of nodes each, and the modules may be organized into cabinets with multiple (e.g., 64) modules in each cabinet. Such systems may be considered scalable when an increase in the number of nodes results in a proportional increase in their computational capacity.
The nodes of an MPP system may be designated as service nodes or compute nodes. Compute nodes are primarily used to perform computations. A service node may be dedicated to providing operating system and programming environment services (e.g., file systems, external I/O, compilation, editing, etc.) to application programs executing on the compute nodes and to users logged in to the service nodes. The operating system services may include I/O services (e.g., access to mass storage), processor allocation services, log in capabilities, and so on. The service nodes and compute nodes may employ different operating systems that are customized to support the processing performed by the node.
MPP systems may be susceptible to many different types of hardware and software failures. Each of the thousands of nodes may have many different hardware components that may fail including processors, memory, routers, cooling systems, power supplies, physical connections, disk drives, and so on. In addition, the operating system and other system services executing on each node may have various software components that may fail including message handlers, monitors, memory allocators, and so on.
Because MPP systems have thousands of nodes with many different possible points of failure, it is likely failures would not be uncommon. If an MPP system could effectively detect failures, it may be able to take appropriate remedial action to mitigate the effect of the failure. For example, if the processor of a compute node fails while executing a certain task of an application program, the processor allocation service of the operating system may select another compute node and restart execution of the task on the new compute node. As another example, if a connection between two routers breaks, the operating system may adjust the routing tables of the routers to bypass the break.
Extensive message passing between the nodes of an MPP system is typically needed to monitor failures. Such message passing for the purpose of monitoring failures, however, may place an unacceptably high burden on the primary system network interconnect. As a result of this burden, the performance of the application programs executing on the compute node and the system services provided by the service nodes may be significantly diminished.