The present invention relates to a method of data communication between a plurality of computers, including parallel computers connected to a network. More specifically, the present invention relates to a method of speeding up Alltoall communications in a message passing interface MPI, which is a standard interface for data communication between computers.
Parallel computers can be broadly divided into distributed-memory parallel computers and shared-memory parallel computers. Distributed-memory parallel computers are currently more popular since they offer greater ease in scalability. The programming model for standard distributed-memory parallel computers usually involves a message-passing model. In a message-passing model, programs must explicitly specify communication functions. In other words, if data to be used in a computation is not present on the local computer, data transfer is performed by issuing a send function to the computer that holds the data and by issuing a receive function at the local computer.
In order to standardize the interface for these communication functions, the MPI Forum recently put together specifications for a message-passing interface known as MPI (Message Passing Interface). In addition to one-to-one communications between a pair of computers, MPI defines collective communications that take place between all computers. This type of collective communication includes a form of communication known as Alltoall communication. Alltoall communications involve a communication operation where each computer sends distinct data to each sending computer and is used in matrix transposition and n-body problems. The following is a description of the Alltoall function interface. Some of the arguments are simplified to facilitate the discussion. The name and arguments of the Alltoall function are as follows.
Alltoall (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype)
The arguments sendbuf, sendcount and sendtype refer, respectively, to the starting address at the local computer holding the data array to be sent, the number of data units to be sent to each computer, and the data type. The arguments recvbuf, recvcount and recvtype refer, respectively, to the starting address at the local computer holding the data to be received, the number of data units to be received from each computer, and the data type. Possible data types include INTEGER, which represents integers, REAL, which represents real numbers, and COMPLEX, which represents complex numbers. In order to be able to clearly identify communication targets, MPI assigns each computer an identification number referred to as its rank and this rank is used to specify the communication target. In the following presentation, a computer assigned rank n is referred to as the computer n.
FIG. 4 (A) shows a parallel computer in which six computers (numbered 402 through 407 and representing the computer 0 through the computer 5) are connected via a network (401). A detail drawing 408 shows the memory regions in the computer 0. Data arrays 409, 410 are the data arrays stored in the sendbuf send buffer and the recvbuf receive buffer, respectively, as specified by the parameters to Alltoall. The i-th unit 0i (where i is a number from 0 to 5) in the data array 409 is the data to be sent by the computer 0 to the computer i. The computer 0 stores data received from the computer i to the i-th unit i0 (where i is a number from 0 to 5) to the data array 410.
FIG. 2 is a representation of the data transfer operation performed in Alltoall communication between the six computers. Reference numeral 201 denotes the data arrays stored in the send buffer of each computer before Alltoall communications are begun, as ordered by rank. Reference 202 denotes the data arrays stored in the receive buffer of each computer after Alltoall communications have been performed, as ordered by rank. In this figure, data arrays 203, 205, 207, 209, 211, 213 represent the data in the send buffers of the computer 0 through the computer 5, respectively. Data arrays 204, 206, 208, 210, 212, 214 represent the data in the receive buffers of the computer 0 through the computer 5, respectively.
The following is a description of a sample data array transfer operation performed by the computer 0. The following send operations are performed by the computer 0.
(a) The 0-th unit 00 in the send buffer 203 is sent to the 0-th unit in the local receive buffer 204.
(b) The first unit 01 in the send buffer 203 is sent to the 0-th unit in the receive buffer 206.
(c) The second unit 02 in the send buffer 203 is sent to the 0-th unit in the receive buffer 208.
(d) The third unit 03 in the send buffer 203 is sent to the 0-th unit in the receive buffer 210.
(e) The fourth unit 04 in the send buffer 203 is sent to the 0-th unit in the receive buffer 212.
(f) The fifth unit 05 in the send buffer 203 is sent to the 0-th unit in the computer 5 of the receive buffer 214.
The following receive operations are performed by the computer 0.
(a) The 0-th unit 00 from the send buffer 203 at the local computer is received in the 0-th unit of the receive buffer 204.
(b) The 0-th unit 10 from the send buffer 205 at the first computer is received in the first unit of the receive buffer 204.
(c) The 0-th unit 20 from the send buffer 207 at the second computer is received in the second unit of the receive buffer 204.
(c) The 0-th unit 30 from the send buffer 209 at the third computer is received in the third unit of the receive buffer 204.
(e) The 0-th unit 40 from the send buffer 211 at the fourth computer is received in the fourth unit of the receive buffer 204.
(f) The 0-th unit 50 from the send buffer 211 at the fifth computer is received in the fifth unit of the receive buffer 204.
In general, the j-th unit from the send buffer at the computer i is sent to the i-th unit in the receive buffer of the computer j, and the i-th unit from the computer j is received in the j-th unit of the receive buffer at the computer i. In other words, looking at the figures showing the data from the send and receive buffer of each computer, the Alltoall function involves a manner of transferring data similar to performing a matrix transpose operation.
For parallel computers in which the computers can perform one-to-one communication with each other, Japanese laid-open patent publication number 8-263449 describes a method of allowing each computer to send data to all other computers. This method can be used to implement Alltoall communications in parallel computers. This method does not include an operation for the computer to send data to itself, but Alltoall communication can be performed in a parallel manner by adding an operation to send data to itself. Using this method along with this additional operation of sending data to itself, n computers require 2**L steps to perform the data transfer (with L being a positive integer where 2**(L-1)&lt;n&gt;=2 **L). Data transfer is performed by having, at each step, each computer communicating with the computer assigned to a rank number resulting from an exclusive OR operation performed on the binary representation of the local computer rank number from 0 to n-1 and the binary representation of the step number from 1 to 2**(L-1).
The following is a description of a specific example of this method when used with six computers (2**2&lt;6&lt;=2**3) as shown in FIGS. 3(A) and 3(B). FIG. 3(A) is a table showing, for each computer, the communication target computer at each step. For example, at step 2, the binary representation of the rank of each computer (0, 1, 2, 3, 4, 5) would be (000, 001, 010, 011, 100, 101). Taking the exclusive OR of these values and the binary representation of step number 2 results in (010, 011, 000, 001, 110, 111). In other words, the communication targets for the computers (0, 1, 2, 3, 4, 5) at step 2 are the computers (2, 3, 0, 1, 6, 7). Since the j-th unit from the send buffer of a computer i must be sent to the i-th unit in the receive buffer of a computer j, the following communication operations are performed at step 2.
(a) The computer 0 sends the second unit 02 from its send buffer to the 0-th unit in the receive buffer of the computer 2.
(b) The computer 1 sends the third unit 13 from its send buffer to the first unit in the receive buffer of the computer 3.
(c) The computer 2 sends the 0-th unit 20 from its send buffer to the second unit in the receive buffer of the computer 0.
(d) The computer 3 sends the first unit 31 from its send buffer to the third unit in the receive buffer of the computer 1.
Since the communication targets for the computers (4, 5) are the computers (6, 7), which do not exist, the computers (4, 5) must wait until the communication operations for the computers (0, 1, 2, 3) are completed. With operations being performed as in step 2 for the other steps, six computers would complete communication operations after seven (=2** 3-1) steps using this method. FIG. 3(B) shows the data exchanges performed at each step according to FIG. 3(A). When the operation for step 0 (each computer sends the i-the unit ii from its send buffer to the i-th unit in the receive buffer) is added to send data to the local computer, this method achieves Alltoall communication in eight steps. This method is referred to as the pairwise method in this specification.
With the pairwise method described above, the number of communication steps required for Alltoall communications in n computers is a power of 2. This means that the processing time increases in proportion to the number of computers.