A parallel computation system formed by coupling a large number of computers referred to as nodes is often used in a field of high performance computing (HPC). A node may be, for example, one chip set or the like. In recent years, parallel computer systems have been used also for deep learning or the like.
There are a mesh connection and a torus connection as forms of coupling of nodes in parallel computation systems. The mesh connection is a form of coupling in which nodes are arranged in the form of a mesh in a plurality of axial directions, and nodes adjacent to each other in each of the axial directions are coupled to each other by a high-speed network referred to as an interconnect. The torus connection is a form of coupling in which the mesh connection is made, and then nodes at both ends on each of the axes are coupled to each other. There are also networks where all of the axes have the mesh connection or the torus connection and forms of coupling such that a part of the axes have the mesh connection and the other axes have the torus connection. For example, parallel computation systems include devices having a topology as a six-dimensional torus structure.
Further, a parallel computation system may adopt a configuration that includes a plurality of system boards each having a plurality of nodes mounted thereon. A coupling between nodes arranged on a same system board is established by a high-speed dedicated interconnect. On the other hand, a coupling between nodes arranged on different system boards is established via a network switch using peripheral component interconnect (PCI) and InfiniBand (registered trademark). Here, the coupling between the nodes within the same system board will be referred to as an “inside coupling,” and the coupling between the nodes via the network switch between the different system boards will be referred to as an “outside coupling.” The inside coupling, which is established by the dedicated interconnect, has a wide bandwidth as compared with the outside coupling using PCI and InfiniBand, and thus enables communication at high speed.
Then, each of the nodes of the parallel computer system processes a program used in solving a complex problem at high speed. For example, the parallel computer system divides a job as an executable unit of the program into a plurality of processes, and allocates the divided processes to the respective nodes. Here, the processes are a program in which each node actually performs arithmetic processing. When each node obtains a process, the node performs arithmetic processing of the obtained process. When each node completes the arithmetic processing of the process, the node transmits an arithmetic result to a management server, and ends the arithmetic processing. In addition, the parallel computer system transmits a new process to the node ending the arithmetic processing, and makes the node perform arithmetic processing. Then, the parallel computer system integrates the results of the arithmetic processing performed by the respective nodes on the management server, and obtains an arithmetic result of the whole of the job.
The parallel computation system may perform the processing of Allreduce in such arithmetic processing. Allreduce is processing of integrating values calculated by respective processes, and sharing, in all of the processes, a result obtained by performing an operation using the integrated values. In this case, each node performs group communication. When the group communication is performed, the process performed by each node retains the arithmetic result of the values possessed by all of the processes. Thus, when the processing of Allreduce is performed, each node obtains the values possessed by all of the other nodes. However, a network load is increased when the value possessed by each of the nodes is transmitted to all of the other nodes, for example, in the processing of Allreduce.
It is therefore desirable to reduce communication data amounts between the nodes when the processing of Allreduce is performed. Accordingly, a Halving+Doubling method is proposed as a technology of reducing the communication data amounts in Allreduce. Halving+Doubling may be referred to also as Reduce_scatter+Allgather.
When a Halving operation in the Halving+Doubling method is performed, a communication data amount is halved in each communication step. When a Doubling operation is performed, on the other hand, the communication data amount is doubled in each communication step. For example, in the Halving+Doubling method, performing Halving after a start of processing reduces the communication data amount as the step advances, and subsequently performing Doubling increases the communication data amount as the step advances. Therefore, in the Halving+Doubling method, mutual communication of a large amount of data is performed during a small number of steps, mutual communication of a small amount of data is performed in the middle of steps, and thereafter mutual communication of a large amount of data is performed as steps are increased.
Here, as described above, in the parallel computation system having the inside coupling and the outside coupling, the outside coupling has a narrow bandwidth, and therefore the data amount of data transmitted and received in the outside coupling is desirably small. Accordingly, when the processing of Allreduce is performed in the parallel computation system having the inside coupling and the outside coupling, it is desirable to perform communication in the outside coupling after reducing the size of the data as much as possible in the inside coupling. For example, the processing of Allreduce may be performed by the following method. First, the amount of data transmitted and received is reduced by performing Halving in the inside coupling, and thereafter Allreduce processing of data in the outside coupling is performed. Thereafter, the amount of data handled is increased by performing Doubling in the inside coupling, and the processing of Allreduce is completed.
When such Allreduce processing is performed, the form of coupling of the nodes desirably has a connection forming a hypercube. An n-dimensional hypercube has the following features. The n-dimensional hypercube is constituted of 2n nodes. Then, each node has n links. Further, when a binary index is assigned to each node, each node is adjacent to and coupled to nodes different from the node in the value of one bit in the bit strings of the assigned indexes. For example, in the case where the nodes have a form of coupling constituting a hypercube, it is easy to identify data transmission destinations, and a processing load is reduced because data transmission and reception in the case of performing the processing of Allreduce becomes easy.
Further, as a technology of group communication in a parallel computer system, there is a technology that calculates an entire processing time, switches between an entire data communication and a partial data communication so as to select a shorter processing time, and performs the communication.
A related technology is disclosed in Japanese Laid-open Patent Publication No. 2001-325239.