1. Field of the Invention
The present invention relates generally to the provision of fault tolerance in a parallel computer's interconnection networks by software controlled dynamic repartitioning.
2. Discussion of the Prior Art
A large class of important computations can be performed by massively parallel computer systems. Such systems consist of many identical compute nodes, each of which typically consist of one or more CPUs, memory, and one or more network interfaces to connect it with other nodes.
The computer described in related U.S. provisional application Ser. No. 60/271,124, filed Feb. 24, 2001, for A Massively Parallel Supercomputer, leverages system-on-a-chip (SOC) technology to create a scalable cost-efficient computing system with high throughput. SOC technology has made it feasible to build an entire multiprocessor node on a single chip using libraries of embedded components, including CPU cores with integrated, first-level caches. Such packaging greatly reduces the component count of a node, allowing for the creation of a reliable, large-scale machine.