Recently, clusters are constructed by connecting plural computers by a high-speed network or the like to realize a High Performance Computing (HPC). When a parallel processing program is executed in this cluster system, parallel processes are distributedly invoked on plural computers. Therefore, when data exchange is carried out between the parallel processes, the communication is carried out between the computers. Then, the performance of the communication between the computers influences to the performance of the cluster system.
In order to enhance the performance of the communication between the computers, the influence of a high-performance network such as InfiniBand, Myrinet or the like and communication libraries that utilize this high-performance network is considered. On the cluster system, parallel processing programs that are described by the communication Application Program Interface (API) called Message Passing Interface (MPI) are often executed. Therefore, various communication libraries based on the MPI specification are provided.
For example, as illustrated in FIG. 1, when processes 0 to N−1 are distributedly invoked on N computers, the interprocess communication for the data exchange is often carried out during a period between calculation processing defined in the parallel processing program. In FIG. 1, an example is depicted that the communication among all of the processes that were invoked is carried out. However, there is a case where peer-to-peer communication between specific processes is carried out. In this interprocess communication, a corresponding subroutine of the MPI libraries is invoked to conduct the communication.
The pattern of the interprocess communication in the parallel processing program depends on programs and is various. Among the various communication patterns, All-to-All that is a communication pattern in which the data exchange among all of the parallel processes that were invoked as illustrated in FIG. 1 is one of the communication patterns to which attention is paid. Then, it is requested also in the MPI specification that All-to-All communication is realized by a subroutine MPI_Alltoall ( ). Although various algorithms for realizing the All-to-All communication exist, a Ring algorithm is frequently utilized when the data size of the communication data is relatively large and the performance is limited by the bandwidth of the network.
The Ring algorithm will be explained by using FIGS. 2A to 2H. As illustrated in FIG. 2A, a case is considered that 8 processes, namely, processes 0 to 7, are invoked. In such a case, as illustrated in FIG. 2B, first, each process receives data from a process having the process number calculated by subtracting 1 from the process number of the corresponding process. Incidentally, in such a case, a process after the process having the process number “7” is a process having the process number “0”, and a process before the process having the process number “0” is a process having the process number “7”.
Then, as illustrated in FIG. 2C, each process receives data from a process having the process number calculated by subtracting 2 from the process number of the corresponding process. Furthermore, as illustrated in FIG. 2D, each process receives data from a process having the process number calculated by subtracting 3 from the process number of the corresponding process. Moreover, as illustrated in FIG. 2E, each process receives data from a process having the process number calculated by subtracting 4 from the process number of the corresponding process. In addition, as illustrated in FIG. 2F, each process receives data from a process having the process number calculated by subtracting 5 from the process number of the corresponding process. Moreover, as illustrated in FIG. 2G, each process receives data from a process having the process number calculated by subtracting 6 from the process number of the corresponding process. Furthermore, as illustrated in FIG. 2H, each process receives data from a process having the process number calculated by subtracting 7 from the process number of the corresponding process.
When the Ring algorithm is employed, All-to-All communication is efficiently carried out unless a network configuration having a problem as will be described in the following is employed.
Next, the network configuration is considered. As illustrated in FIG. 3, when several computers are used, these computers can be connected by one network switch SW (hereinafter, simply referred to switch). In other words, 8 computers are connected with one switch, and one process is invoked on each of the computers. In such a case, this network configuration is equivalent to the crossbar connection, and even when the All-to-All communication is conducted between the processes that are invoked, no conflict occurs at the network links.
On the other hand, when the number of computers that are used exceeds the number of the computers to which one switch can connect, a multi-stage network configuration that the switches are arranged in two or more stages is employed. A network configuration in a simple multi-stage switch configuration is a tree network as illustrated in FIG. 4. In FIG. 4, for convenience of explanation, the number of computers and the number of processes that are invoked are the same as those in FIG. 3. However, it is assumed that the number of ports of the switch SW is less than 8. In addition, FIG. 4 illustrates a communication state when the processes 0 to 3 transmit data in case of FIG. 2E. Because the number of switches in the upper-layer of the two-stage switch configuration is limited to “1”, which is the minimum, the bandwidth of the links between the upper-layer switch and the lower-layer-switch is insufficient. Therefore, when the All-to-All communication is conducted among 8 processes, the conflicts occurs at the links between the upper-layer switch and the lower-layer switch as illustrated in FIG. 4 by circles, the throughput is lowered.
Therefore, when the network performance is emphasized, a Fat-tree is often employed. In an example of FIG. 5, the number of upper-layer switches is increased to “3”, and the number of links from the lower-layer switch to the upper-layer switch (also called uplinks) is identical to the number of links from the lower-layer switch to the computer (also called downlinks). Incidentally, simple squares connected to the lower-layer switch in FIG. 5 represent the computers, and the number within parentheses in the squares represents identifiers of the computers. Moreover, circles represent processes, and the number within the circle represents a process identifier. FIG. 5 represents a state that the processes “0” to “8” are respectively invoked on the computers “3” to “11”. Furthermore, in the state of FIG. 5, an upper-layer switch having the number represented under the square of the computer is allocated to each computer. For example, the upper-layer switch 0 is allocated to the computer 3 in which the process 0 is invoked, and when data is transmitted to the computer 3, the data reaches the computer 3 through the upper-layer switch 0. In addition, the upper-layer switch 2 is allocated to the computer 17, and when data is transmitted to the computer 17, the data reaches the computer 17 through the upper-layer switch 2.
Then, a case is considered that three processes 0 to 2 respectively transmit data to processes 3 to 5, which are respectively identified by adding “3” to its own process number. In such a case, data transmission is carried out from the process 0 through the upper-layer switch 0, which is allocated to the process 3, data transmission is carried out from the process 1 through the upper-layer switch 1, which is allocated to the process 4, and data transmission is carried out from the process 2 through the upper-layer switch 2, which is allocated to the process 5. Thus, no link conflict occurs.
However, in the network such as the aforementioned InfiniBand, in which the packet transfer route cannot be dynamically changed (in other words, static routing network), the communication may be concentrated to a specific link between the upper-layer switch and the lower-layer switch. FIG. 6 represents an example of the concentration of the communication. In an example of FIG. 6, 16 computers are connected with a Fat-tree network including 4 upper-layer switches and 4 lower-layer switches. Generally, in the HPC system, computers to be used are selected according to operational states of the respective computers, and processes are invoked on these selected computers. However, depending on a combination of the selected computers, there is possibility that the deviation of the upper-layer switch through which packets to be transmitted to each of the selected computers pass occurs.
In an example of FIG. 6, settings of data transmission to the computers on which the processes “0”, “4”, “6” and “7” were invoked are made so as to pass through the upper-layer switch “0”. Settings of data transmission to the computers on which the processes “1” and “5” were invoked are made so as to pass through the upper-layer switch “1”. A setting of data transmission to the computer on which the process “2” was invoked is made so as to pass through the upper-layer switch “2”. A setting of data transmission to the computer on which the process “3” was invoked is made so as to pass through the upper-layer switch “3”. Then, in case where the computers on which the processes “0” to “3” are invoked as illustrated in FIG. 2E transmit data, an uplink to the upper-layer switch “0” is used for the communication other than data transmission to the computer on which the process “5” was invoked from the computer on which the process “1” was invoked. Therefore, as illustrated by the circle in FIG. 6, the link conflict occurs.
Thus, even when the Fat-tree network is employed, a state may occur that the communication load is concentrated to a specific link, and the communication performance is lowered. This may occur when the static routing network such as InfiniBand is employed. In other words, a problem appears when only one network identifier (called LID in InfiniBand) used for routing in the network is allocated to one computer, and the association between this network identifier and the packet transfer destination port is fixed in each switch. Specifically, as illustrated in the lowest portion of FIG. 6, the LID of the most left computer is “1”, and the packet addressed to this computer fixedly passes through the upper-layer switch “0”. Similarly, the LID of the most right computer is “16”, and the packet addressed to this computer fixedly passes through the upper-layer switch “3”.
Furthermore, as described above, the selection of the computer on which the parallel process is invoked is carried out regardless of taking into account the relation between LID and the route setting. Therefore, the deviation of the communication route at the data transmission (here, the upper-layer switch through which packets pass) may occur.
Incidentally, the All-to-All communication was explained as an example that the link conflict easily occurs. However, when the computers as illustrated in FIG. 6 are selected and the processes are invoked, the data communication, which passes through the upper-layer switch “0”, easily occurs in other cases (e.g. when peer-to-peer communication is simultaneously carried out between plural combinations of the processes), and the possibility that the link conflict occurs increases as a result.
In order to solve such a problem, a method is proposed that plural LIDs associated to the respective upper-layer switches are allocated to the respective computers, and an LID is selected so that the upper-layer switch is not duplicately allocated among the computers in which the parallel processes are invoked. FIG. 7 illustrates an example.
In FIG. 7, a Fat-tree network including 3 upper-layer switches and 6 lower-layer switches is represented, and 18 computers in total are connected to the lower-layer switches 0 to 5. Here, an LID for a communication route passing through the upper-layer switch A, LID for a communication route passing through the upper-layer switch B and LID for a communication route passing through the upper-layer switch C are allocated to each of the computers. In the lower portion of FIG. 7, for example, as for the computer “0”, an LID “4” for the communication route passing through the upper-layer switch A (“A” is illustrated in the parentheses), LID “5” for the communication route passing through the upper-layer switch B and LID “6” for the communication route passing through the upper-layer switch C are allocated. Similarly, as for the computer “10” (which is a computer disposed at the left of the computer “11”), an LID “44” for the communication route passing through the upper-layer switch A, LID “45” for the communication route passing through the upper-layer switch B and LID “46” for the communication route passing through the upper-layer switch C are allocated.
In such a state, as surrounded by a dotted line in the lower portion of FIG. 7, an LID “16” is used by the computer “3” on which the process “0” is invoked, an LID “21” is used by the computer “4” on which the process “1” is invoked, an LID “26” is used by the computer “5” on which the process “2” is invoked, an LID “28” is used by the computer “6” on which the process “3” is invoked, an LID “33” is used by the computer “7” on which the process “4” is invoked, an LID “38” is used by the computer “8” on which the process “5” is invoked, an LID “40” is used by the computer “9” on which the process “6” is invoked, and an LID “45” is used by the computer “10” on which the process “7” is invoked. Thus, LIDs are selected so that the deviation of the associated upper-layer switches does not occur. Such a method for selecting the LIDs is effective, when the number of computers on which the processes are invoked is a multiple of the number of upper-layer switches to be used. However, such a condition is not always satisfied.
For example, as specifically illustrated in FIG. 7, a case is considered that 3 upper-layer switches are used, 8 processes are invoked and the aforementioned condition is not satisfied. In such a case, it is assumed that the processes “3” to “5” transmit data to the processes “6”, “7” and “0”, whose number is identified by adding “3” to its own process number. The process “0” is selected because there is not process “8”. Then, the lower-layer switch 2, which are connected to the computers “6” to “8” on which the processes “3” to “5” are invoked, is commonly used, and both of the LID “40” of the process “6”, which is a transmission destination of the process “3” and LID “16” of the process “0”, which is a transmission destination of the process “5”, are associated to the upper-layer switch “A”. Therefore, the link conflict occurs.
There is no conventional technique, which pays attention to such a problem.