A computer system that performs a high-performance calculation includes a node connection network in which a plurality of processors called nodes are connected by links. The plurality of nodes allocated with a job such as calculation processing perform processing in parallel while communicating with one another. The parallel computer system has higher performance as the number of processors increases. At the same time, it is more likely that a failure occurs somewhere in the system.
As failures in a parallel computer system in which a large number of processors are connected by links, there are failures of the processors and memories in nodes, failures of routers in the nodes, and disconnection of the links that connect the nodes. When a failure occurs somewhere during job execution, some measures need to be taken because execution of a job being executed in a region including a failure location and a job executed using a communication route passing the failure location is hindered.
Japanese Translation of PCT Application No. 2007-533031 and Japanese Patent Application Laid-Open No. H06-266684 describe processing for, when a node failure occurs, stopping a job of a subset including the failed node, allocating a node to the job anew, and executing the job, processing for securing a channel for avoiding a failure route when a communication route among processors fails, and interconnect of a parallel computer system.