1. Field of the Invention
This invention relates to network arrangements and protocols for real-time communications. More particularly, this invention relates to organizing the transmission of messages in a fabric.
2. Description of the Related Art
The meanings of certain acronyms and abbreviations used herein are given in Table 1.
TABLE 1Acronyms and AbbreviationsDLIDDestination LID (Destination Address)HPCHigh Performance ComputingLIDLocal Identifier (Address)MPIMessage Passing InterfaceNCCNIC Communicator ControllerNICNetwork Interface CardQPQueue PairWQEWork Queue Element
Message Passing Interface (MPI) is a communication protocol that is widely used for exchange of messages among processes in high-performance computing (HPC) systems. The current MPI standard is published by the MPI Forum as the document MPI: A Message-Passing Interface Standard, Ver. 3.1; Jun. 4, 2015, which is available on the Internet and is herein incorporated by reference.
MPI supports collective communication in accordance with to a message-passing parallel programming model, in which data is moved from the address space of one process to that of another process through cooperative operations on each process in a process group. MPI provides point-to-point and collective operations that can be used by applications. These operations are associated with a defined object called a communicator. Communicators provide a mechanism to construct distinct communication spaces in which process groups can operate. Each process group is associated with a communicator and has a communicator identifier that is unique with respect to all processes inside the communicator. There is a default communicator that contains all the processes in an MPI job, which is called MPI_COMM_WORLD.
Typically high performance computing (HPC) systems contains thousands of nodes, each having tens of cores. It is common in MPI to bind each process to a core. When launching an MPI job, the user specifies the number of processes to allocate for the job. These processes are distributed among the different nodes in the system. The MPI operations alltoall and alltoallv are some of the collective operations (sometimes referred to herein as “collectives”) supported by MPI. These collective operations scatter or gather data from all members to all members of a process group. In the operation alltoall, each process in the communicator sends a fixed-size message to each of the other processes. The operation alltoallv is similar to the operation alltoall, but the messages may differ in size.
Typically, MPI jobs allocate thousands of processes, spread between thousands of nodes. The number of nodes in an MPI job is denoted as N, and the number of processes in the MPI job as P, which leads to a total number of N*P processes. Thus, in alltoall (or alltoallv) collectives between N*P processes of the MPI job, each process sends (N−1)*P messages to the other different processes. Therefore, each node outputs (N−1)*P{circumflex over ( )}2 messages to the network, leading to a total number of N*(N−1)*P{circumflex over ( )}2 messages in the fabric.
Assuming the value of N to be in the thousands and P in the tens, the number of messages in the fabric creates network congestion and incurs overhead in posting them to the network interface. The overhead becomes especially significant when the message payload is small, as each message requires both MPI and transport headers. Some MPI software implementations attempt to moderate the number of messages, but still do not make optimal use of the bandwidth of the fabric.