A parallel computer has a plurality of computer nodes (hereinafter simply called nodes) connected to each other via a network. Each of the nodes has a calculation processing unit (or a CPU), a memory (or a main memory), and an interconnect (IC) that controls communications with other nodes connected via the network and the routing of packets between the nodes.
When a trouble or a failure occurs in a node of a parallel computer, the parallel computer performs dump processing to transfer data (or a dump file) inside the memory of the failure node to a storage unit such as a HDD and analyzes the data subjected to the dump processing to examine the failure. The dump processing of a failure node is described in Japanese Patent Application Laid-open No. 2010-176345.
In general, the dump processing of data inside the memory of a failure node (hereinafter simply called dump processing) is performed on a single file server. For example, when a node fails, a dump kernel inside an operating system (OS) starts, executes dump processing to transfer data inside a memory to a single file server, and shuts down the failure node. After that, the failure node restarts to turn into an ordinary node and serves as a target to which a new job is to be allocated.