Parallel computing systems comprise a plurality of nodes. For instance, a parallel computing system may include a plurality of processors and/or a plurality of processor cores. Each node of a parallel computing system is capable of performing data computation independently of the other nodes of the parallel computing system. Applications written for parallel computing systems exploit this parallelism by distributing their workload across multiple nodes. Each node of a parallel computing system may independently execute one or more processes (each process being part of a larger application run on the parallel computing system). In such parallel computing systems, processes communicate with other processes to share data. A parallel computing system typically uses a communication protocol to implement this sharing of data.
A Message Passing Interface (MPI) is a language-independent communication protocol used by many parallel computing systems. An MPI may be implemented in any number of programming languages. An MPI provides virtual topology, synchronization, and communication functionality between a set of processes. Among other operations, an MPI typically supports both point-to-point and collective communications between processes. Point-to-point operations involve the communication of data between two processes. Collective operations involve the communication of data among all processes in a process group (which may include all, or an application-defined subset of, the processes running on the parallel computing system).
To increase performance, the parameters of a communication protocol, such as an MPI, may be tuned for a particular application and/or a particular parallel computing system. Manual testing and selection of these configuration parameters often requires many hours of tedious tuning work. This tuning work must be repeated for every unique application run on the parallel computing system. Any change to an application or the composition of the parallel computing system (e.g., the number of nodes) may also require re-tuning of the configuration parameters of the communication protocol.