1. Field of the Invention
The present invention is directed to a system that allows a multiprocessor system with cpu-set capability to incur a hardware failure and continue running and, particularly, to assign the processes of the cpu-set with the failed processor to a new cpu-set
2. Description of the Related Art
Large multiprocessor systems have complex operating systems that allow multiple processors (CPUs) to work on the same problem or data set. These systems often have 512 processors that are used to tackle one or more tasks. It is often the case that when one cpu of such a system fails it causes the system to fail.
One means of preventing a complete system failure is to use partitioning capability to subdivide a large system into a cluster of smaller systems. This can be effective at fire walling a single processor failure to the partition node. However, partitioning changes require a reboot of all nodes to reconfigure, and a large number of parallel programming applications cannot readily run across a cluster. What is needed is a system that does not require such rebooting overhead but that can firewall a failed cpu.
These large systems can also be divided into sets of CPUs (cpu-sets) that can also be allocated to performing particular functions. The cpu-set feature is very dynamic and provides rapid run time ability to soft partition a large system into subsets, yet reconfigure literally on the fly. This reduces the rebooting overhead but certain fatal hardware errors, such as CPU instruction cache errors, can still cause the entire system to halt.
What is needed to reduce total system failures and reduce overhead is a cpu-set type system where hardware errors can instead halt some processors while the remaining system continues to run.