In a parallel computer system provided with multiple nodes each including a processor, the nodes are communicably to one another via a topology network exemplified by a fat tree consisting of leaf switches and spine switches (see FIG. 1).
In such a network, each leaf switch is connected thereto multiple nodes and multiple spine switches are each connected to the multiple leaf switches via multiple leaf switches (see FIG. 1).
In a fat tree, the number of subordinate nodes connected to each leaf switch is the same as the number of spine switches (links) connected to each leaf switch. This make is possible to secure a bandwidth for inter-node communication that is to be carried out between a leaf switch and a spine switch via a link.
In the above parallel computer system, a user job is allocated to one or more nodes that are to be process the user job by a job scheduler and is then processed in batch processing.
In the batch processing, serial numbers are provided to the multiple nodes, and when a user job is to be allocated to two or more nodes, the job scheduler secures two or more nodes having successive serial numbers among nodes (unoccupied nodes) in an unoccupied state of not being allocated thereto a job. Then, the job scheduler allocates the job to the two or more secured nodes having successive serial numbers (see FIG. 2).
Patent Literature 1: Japanese National Publication of International Patent Application No. 2012-532481
During the system operation, it is ordinary that some of the links between leaf switches and spine switches are not correctly operating and may have failure. When a failure occurs on a link between a leaf switch and a spine switch, the bandwidth after the allocation of jobs narrows to generate a conflict (see FIG. 3), resulting in deterioration of system performance.