Electronic circuit chips (or integrated semiconductor circuit) are being built with increasing numbers components integrated on the chips. A single chip is fabricated to hold an integration of multiple nodelets. Even still, each nodelet on a single chip can have a number of processors. Processors in a nodelet can be homogeneous (i.e., of the same type) or heterogeneous (i.e., of different types). Each nodelet has its memory system, however, memory between nodelets are not shared. That is, each nodelet has a separate memory coherence domain.
In a multi-node system, nodes communicate between each other by using one or more network protocols. For many applications, the amount of communication between neighboring nodes is higher than between remote nodes. Similarly, communications between neighboring nodes is more frequent than between the more remote nodes. Mapping logically “close” nodes to physically neighboring nodes reduces latency and power consumption. By mapping logically close nodes to nodes on the same chip, significant part of the communication stays on the chip. Nodelets participate in a larger multi-node system by network connections using a network protocol, typically using Message Passing Interface (MPI) protocol.
Network communication, however, still involves overhead such as the work that needs to be implemented for network protocol tasks, transmitting packets, and receiving packets.
Message Passing Interface (MPI) is a programming paradigm used for high performance computing (HPC). The model has become popular mainly due to its portability and support across HPC platforms. Because MPI programs are written in a portable manner, programmers optimize application-related aspects, such as computation and communication, but typically do not optimize for the execution environment. In particular, MPI tasks are often mapped to the processors in a linear order.
Determining the communication patterns of applications have been studied by A. Aggarwal, A. K. Chandra, and M. Snir. On communication latency in PRAM computation. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, pages 11-21, June 1989, and by A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman. Log GP: Incorporating long messages into the Log P model for parallel computation. Journal of Parallel and Distributed Computing, 44(1):71-79, 1997.
Independently of such communication pattern studies, another category of existing technology provides a model to guide the MPI programmer. However, early models explicitly ignored hardware characteristics to simplify the model. More recent models (see, D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. Log P: Towards a realistic model parallel computation. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, May 1993; and M. I. Frank, A. Agarwal, and M. K. Vernon. LoPC: Modeling contention in parallel algorithms. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, pages 276-287, June 1997) attempt to develop a theoretical model for generic networks. However, such modeling has not employed empirical data to improve the model accuracy. With the existing techniques, it is difficult to obtain performance benefits.