In recent years, a cluster system constructed of a plurality of computers connected together in a high speed network has been widely known as a High Performance Computing (HPC) system. When executing a parallel processing program with this cluster system, parallel processes for the parallel processing program are distributed to a plurality of computers and then started. Thus, when performing data exchange between the parallel processes, communication may be performed between the computers. Therefore, the performance of inter-computer communication may affect the cluster system.
In order to improve the performance of inter-computer communication, for example, influences of a high-performance network, such as InfiniBand or Myrinet, and a communication library which makes full use of the high-performance network are considered. In many cases, a parallel processing program described by a communication Application Program Interface (API) called a Message Passing Interface (MPI) is executed on the cluster system. Therefore, various communication libraries based on the MPI specification have been provided.
For example, as illustrated in FIG. 1, if processes “0” to “N−1” are distributed to “N” computers and started, inter-process communication for data exchange is performed in many cases during computer processing defined by a parallel processing program. Although FIG. 1 illustrates an example in which communication is performed among all started processes, one-to-one communication between specific processes may be also performed. In the inter-process communication, an applicable module of the MPI library is called out to perform communication.
The patterns of the inter-process communication in the parallel processing programs are various while being limited by the program itself. Among the patterns, as illustrated in FIG. 1, a communication pattern “All-to-All” where data exchange is performed among all the started parallel processes is referred to as a communication pattern of interest. Then, the MPI specification also desires to implement All-to-All communication using the function MPI_AlltoAll( ). Although various algorithms are present for implementing all-to-all communication, a Ring algorithm has been used in many cases where data sizes are comparatively large and performance is rate-limited at the bandwidth of the network.
The ring algorithm will be described with reference to FIG. 2A to FIG. 2H. As illustrated in FIG. 2A, a case with eight processes, processes “0” to “7”, will be considered. In this case, as illustrated in FIG. 2B, each process receives data from the process with the preceding process number and then sends the data to the process with the next process number. Here, the process subsequent to process with the process number “7” is of the process number “0” and the process prior to process with the process number “0” is of the process number “7”.
Then, as illustrated in FIG. 2C, each process receives data from the process with a process number two places back and then sends the data to the process with the process number two places ahead. Then, as illustrated in FIG. 2D, each process receives data from the process with a process number three places ahead and then sends the data to the process with the process number three places behind. Furthermore, as illustrated in FIG. 2E, each process receives data from the process with the process number four places ahead and then sends the data to the process with the process number four places behind. Furthermore, as illustrated in FIG. 2F, each process receives data from the process with the process number five places ahead and then sends the data to the process with the process number five places behind. Furthermore, as illustrated in FIG. 2G, each process receives data from the process with the process number six places ahead and then sends the data to the process with the process number six places behind. Furthermore, as illustrated in FIG. 2H, each process receives data from the process with the process number seven places ahead and then sends the data to the process with the process number seven places behind.
In the case of employing the ring algorithm, the all-to-all communication may be efficiently performed as long as the configuration of the network does not correspond to the problematic configuration described below.
Next, the network configuration will be examined. As illustrated in FIG. 3, if the number of computers to be used is small, these computers may be connected together with a single network switch SW (hereinafter, simply referred to as a switch). In other words, eight computers are connected to one switch, and one process is started in each computer. This case is equivalent to a crossbar connection, so that no competition exists among network links even if all-to-all communication is performed among processes being started.
On the other hand, if the number of computers to be used exceeds the number that allows computers to be connected with one switch, the switch may be a multistage switch. The network with a simplified multistage switch configuration may be a tree network as illustrated in FIG. 4. In FIG. 4, the number of computers and the number of processes to be started are the same as those of the example illustrated in FIG. 3 but the number of switch SW ports is less than eight. In addition, FIG. 4 illustrates a communication state in which processes 0 to 3 perform data transmission in FIG. 2E. The switch is made in two stages and the number of the upper switches is limited to “1”. Thus, the link zone between the upper switch and the lower switch is insufficient. Therefore, if all-to-all communication is performed among eight processes, a competition of links may occur between the upper switch and the lower switch as represented by an encircled portion in FIG. 4, decreasing throughput.
Therefore, in the case of putting a high priority on the network performance, a fat tree network has been employed in many cases. An example illustrated in FIG. 5 includes four upper switches, so that the number of links from the lower switches to the upper switches (also referred to as up-links) may be equal to the number of links from the lower switches to the computers (also referred to as down-links). The data transmission to the computers where the processes “0” and “4” are started is set up to pass through the upper switch “A”. In addition, the data transmission to the computers where the processes “1” and “5” are started is set up to pass through the upper switch “B”. Furthermore, the data transmission to the computers where the processes “2” and “6” are started is set up to pass through the upper switch “C”. Furthermore, the data transmission to the computers where the processes “3” and “7” are started is set up to pass through the upper switch “D”. Thus, as illustrated in FIG. 5, communication is performed without any link competition when the computers in which the processes “0” to “3” are started transmit data in FIG. 2E.
However, in the case of a network where a packet transfer path, such as the InfiniBand as mentioned above, may not be dynamically changed (i.e., a static-routing network), communication may be concentrated on a specific link between the upper switches and the lower switches. This case is exemplified in FIG. 6. The example illustrated in FIG. 6 includes 16 computers which are connected to a fat tree network having four upper switches and four lower switches. In this system, one computer is selected depending on the execution statuses or the like of the respective computers, and the selected computer starts processing. However, packets transmitted from the selected computer may unevenly pass through the upper switches.
In the example illustrated in FIG. 6, the data transmission to the computers where the processes “0”, “4”, “6”, and “7” are respectively started is set up to pass through the upper switch “A”. The data transmission to the computers where the processes “1”, and “5” are respectively started is set up to pass through the upper switch “B”. The data transmission to the computer where the process “2” is started is set up to pass through the upper switch “C”. The data transmission to the computer where the process “3” is started is set up to pass through the upper switch “D”. Thus, in the communication state of the case of transmitting data to the respective computers where the processes “0” to “3” are started in FIG. 2E, an up-link to the upper switch “A” is used in data communication, except the data communication from the computer with the started process “1” to the computer with the started process “5”. As a result, a link competition occurs as represented by the circled portion in the figure.
In this way, the communication load may be concentrated on a certain link even though the fat tree network is applied. Thus, communication performance may be decreased. This case may occur when the static-routing network, such as the InfiniBand, is employed. In other words, as illustrated in FIG. 7, each computer is assigned only one network identifier (referred to as “LID” in the InifiniBand) used for routing in the network. However, it may become a disadvantage for each switch when the network identifier is fixedly brought into correspondence with a packet destination port. Specifically, the LID of the leftmost computer is “1”, so that a packet addressed to this computer fixedly passes through the upper switch “A”. Similarly, the LID of the rightmost computer is “16”, so that a packet addressed to this computer fixedly passes through the upper switch “D”.
Furthermore, as described above, the selection of computers that start parallel processes is performed without consideration of the relationship between the packet destination port and the LID assigned to the computer. Thus, communication paths for data transmission (specifically, the upper switches as routes for data transmission) may be unequally used.