Many computational problems can be subdivided into independent or loosely-dependent tasks, which can be distributed among a group of processors or systems (a “cluster”) and executed in parallel. This often permits the main problem to be solved faster than would be possible if all the tasks were performed by a single processor or system. Sometimes, the processing time can be reduced proportionally to the number of processors or systems working on the sub-tasks.
Cooperating processors and systems (“workers”) can be coordinated as necessary by transmitting messages between them. Messages can also be used to distribute work and to collect results. Clusters that operate by passing messages along these lines are called message-passing interface, or “MPI,” clusters.
Messages may be transferred from worker to worker over a number of different communication channels, or “fabrics.” For example, workers executing on the same physical machine may be able to communicate efficiently using shared memory. Workers on different machines may communicate through a high-speed network such as InfiniBand® (a registered trademark of the Infiniband Trade Association), Myrinet® (a registered trademark of Myricom, Inc. of Arcadia, Calif.), Scalable Coherent Interface (“SCI”), or QSNet by Quadrics, Ltd. of Bristol, United Kingdom. When no other communication channel is available, a traditional data communication network such as Ethernet may be used.
Worker systems often have more than one communication channel available. For example, a system might have both an InfiniBand® interface and an Ethernet interface. (A system with more than one network interface is called “multi-homed.”) The faster InfiniBand® interface may be preferred for exchanging messages with other workers that also have an InfiniBand® interface, while the Ethernet interface may be used to communicate with a control or display system, since the speed (and expense) of a corresponding specialized network interface my not be justified on the control system.
When a cluster includes many multi-homed systems, it can be difficult to configure the systems so that each one uses the most favorable communication channel to reach other workers. Systems may be geographically diverse and/or may be administered by different managers or according to different conventions. Inconsistent or incorrect system configurations may result in workers choosing sub-optimal channels to communicate, which may in turn cause the cluster to fall short of its expected computational performance. Worse, cluster users (software developers and users of MPI software) may not have appropriate skills to detect misconfigurations, or the necessary access permissions to correct them.
Methods to alleviate the impact of incorrect and/or inconsistent system configurations on MPI cluster performance may be of value in the field.