A parallel computer can process large-scale calculation, for example, by connecting a plurality of computers (hereinafter, referred to as calculation nodes) through a network, distributing a calculation job among separate calculation nodes, and executing the calculation job in parallel. Accordingly, the demand for the parallel computer is increasing rapidly.
In general, a parallel computer includes a node managing a calculation node group (hereinafter, simply referred to as a management node) including a plurality of calculation nodes. The parallel computer may require technology that enables a management node side to recognize information such as usage of respective resources such as a CPU, a memory and a file used in each calculation node by a currently-executed calculation job, and the number of commands executed by the calculation job (hereinafter, simply referred to as job information).
Thus, each calculation node executing a calculation job may need to acquire job information of the same time, that is, a snapshot. FIG. 14 is an illustration diagram illustrating a snapshot acquisition method for a parallel computer. In a parallel computer 110 illustrated in FIG. 14, a management node 112 managing a plurality of calculation nodes 111 manages the current time, and requests each calculation node 111 to acquire job information when the current time arrives at a predetermined time (step S211). In response to the job information acquisition request, each calculation node 111 acquires own job information (step S212). When acquiring the job information, each calculation node 111 transmits the acquired job information to the management node 112 (step S213). As a result, the management node 112 of the parallel computer 110 illustrated in FIG. 14 can acquire job information of the same time (timing) of each calculation node 111, that is, a snapshot.
FIG. 15 is an illustration diagram illustrating another snapshot acquisition method for a parallel computer 120. In the parallel computer 120 illustrated in FIG. 15, each calculation node 121 manages the current time. When the current time arrives at a predetermined time, each calculation node 121 acquires own job information (step S221). When acquiring own job information, each calculation node 121 transmits the acquired job information to a management node 122 (step S222). As a result, the management node 122 of the parallel computer 120 illustrated in FIG. 15 can acquire job information of the same time (timing) of each calculation node 121, that is, a snapshot.
Patent Document 1: Japanese Laid-open Patent Publication No. 8-44680
Patent Document 2: Japanese Laid-open Patent Publication No. 63-136176
In the parallel computer 110 illustrated in FIG. 14, when a gap occurs in the timing until the arrival of the job information acquisition request from the management node 112 at the respective calculation nodes 111, the job information acquisition timing is not synchronized between the calculation nodes 111, so that an accurate snapshot is difficult to acquire.
Also, in the parallel computer 120 illustrated in FIG. 15, since the job information is asynchronously transmitted from the respective calculation nodes 121, there may be a case where the job information of the same time (same timing) transmitted from the respective calculation nodes 121 is not received at the management node 122 till the next job information acquisition time. As a result, job information of different times may be received in a mixed manner. That is, in the parallel computer 120, since the job information of the same timing of the respective calculation nodes 121 is not known, an accurate snapshot is difficult to acquire.