Parallel computing systems with distributed memory are generally made up of many nodes, each of which is capable of performing data computation independently of the other nodes. Applications written to exploit this parallelism distribute their workload across multiple nodes in distinct processes. In such situations, nodes generally need to communicate with other nodes to share data. To achieve this sharing of data, a communication protocol is used.
MPI, or Message Passing Interface, is a type of language-independent communications protocol used to program parallel computers. MPI is not sanctioned by any major standards body; nevertheless, it has become the de facto standard for communication among processes that model a parallel program running on a distributed memory system. Actual distributed memory supercomputers such as computer clusters often run these programs. MPI is a specification, not an implementation. MPI has Language Independent Specifications (LIS) for the function calls and language bindings. The implementation language for MPI is different in general from the language or languages it seeks to support at runtime. Most MPI implementations are done in a combination of C, C++ and assembly language, and target C, C++, and Fortran programmers. However, the implementation language and the end-user language are in principle always decoupled.
One challenge faced by those attempting to tune the performance of a specific application using an MPI library is obtaining a representative application kernel, workload, or part thereof. Likewise, debugging works best when a small but representative program (a so called reproducer) is available to the investigative developer.
Unfortunately, more often than not, either applications, workloads, or both are either sensitive, or cannot be used without a special and very expensive license, or cannot be provided to the MPI development team for export control reasons. Likewise, generation of a debugging reproducer requires deep understanding of the application internals. This is very time consuming or outright impossible if the original developer is unavailable. Moreover, even when the reproducer is written, it may not be shared with the external parties for reasons mentioned above. Any of this makes reproduction of the computational and communication load impossible on machines that are not licensed to run the respective application.