In recent years, cluster systems in which a large number of small-scale computers are coupled to execute parallel processing have been available as HPC (high performance computing) systems. In particular, a cluster system called a PC (personal computer) cluster system in which IA (Intel architecture) servers are coupled through a high-speed network is widely used.
When a parallel program is to be executed in the cluster system, processes started upon execution of the parallel program are distributed to the multiple servers for execution. Thus, when data exchange between the processes is necessary, communication between the servers is required. Accordingly, an improvement in the performance of the inter-server communication is crucial in order to improve the processing performance of the cluster system. In order to achieve high performance of the inter-server communication, it is also important to prepare a high-performance communication library, in addition to a high-performance network, including InfiniBand or Myrinet. In the cluster system, a parallel program written in the format of communication API (application program interface) called MPI (message passing interface) is executed in many cases, and various MPI communication libraries have been implemented and provided.
The type of communication between processes in the parallel program varies a great deal from one program to another, and one of the types of communication that are considered particularly important is all-to-all communication. All-to-all communication is, as the name implies, a communication pattern in which all processes send and receive data between all processes. In the MPI, an all-to-all communication function is incorporated into a function MPI_Alltoall( ).
Various communication algorithms for achieving all-to-all communication are available. Of the communication algorithms, a ring algorithm is often used when the data size is relatively large and the performance is restricted by a network's bandwidth.
As a result of increased utilization of multiple cores for processors, such as IA processors, servers included in a cluster system are typically equipped with multi-core processors. In a multi-core processor, each processor core often executes a process. For example, in a cluster system including servers each having two quad-core CPUs (a total of eight cores), it is not uncommon for eight processes to be executed per server during execution of a parallel program. The number of processes per server will hereinafter be referred to as the “number of per-server processes”.
Many of currently available communication algorithms, such as the ring algorithm, are devised and implemented on the premise of a single process per server, and are not appropriate for use in a cluster system including servers equipped with multi-core processors. In practice, when effective network bandwidth is measured during all-to-all communication based on the ring algorithm using 16 servers and changing the number of per-server processes from 1, 2, 4, or 8, it may be understood that the effective network bandwidth is reduced when the number of per-server processes is large. In the case of two or more per-server processes, when all-to-all communication is performed using the ring algorithm, a conflict called HOL (head of line) blocking occurs in a network switch. This causes a reduction in the effective network bandwidth. HOL blocking is a phenomenon that occurs when packets are simultaneously transferred from multiple input ports to the same output port and that causes a packet-transfer delay due to contending for a buffer in the output port.
Thus, the known all-to-all inter-process communication algorithm is not appropriate for a cluster system including servers that each execute multiple processes. As a result, when the known algorithm is used to perform inter-process communication in such a cluster system, the performance of the entire system may not be fully exploited.