1. Field of the Invention
This invention relates in general to the field of supercomputer systems and architectures and, more particularly, to optimally map an arbitrary parallel application in order to minimize execution times on supercomputer systems.
2. Description of the Related Art
Massively parallel computing structures (also referred to as “ultra-scale computers” or “supercomputers”) interconnect large numbers of compute nodes. Quite often, the interconnect topology is in the form of regular structures, such as trees or grids sometimes with periodic boundary conditions. The conventional approach for the most cost/effective ultra-scale computers has been to use standard processors (nodes) configured in uni-processors or symmetric multiprocessor (SMP) configurations, with the interconnect network supporting message passing communications. Today, these supercomputing machines exhibit computing performance achieving teraOPS-scale. One example of a supercomputer is the Blue Gene/L Supercomputer (BG/L) announced by International Business Machines of Armonk N.Y. General information regarding the BG/L architecture is available at the online URL ((http://sc2002.org/paperpdfs/pap.pap207.pdf)) with a paper entitled “The Blue Gene/L Supercomputer”, by G. Bhanot, D. Chen, A. Gara and P. Vranas, Nucl. Phys. B (Proc. Suppl.) 119 (2003) 114, which is hereby incorporated by reference in its entirety. BG/L is a massively parallel computer with two data communication networks: a nearest neighbor network, with the topology of a 3-D torus, and a global tree. In normal usage, the torus is the primary communications network and is used both for point-to-point and for global or collective communications. The tree is typically used for collective communications.
Computer nodes on BG/L are logically arranged into a 3-D lattice and the torus communications network provides physical links only between nearest neighbors in that lattice. All communications between nodes must therefore be routed to use the available physical connections and the cost of communications between nodes varies depending on the distance between the nodes involved and other effects such as the availability of buffers, the number of available paths through the network, network contention etc. A major challenge then is to optimally map an arbitrary parallel application to minimize the total execution time, which is a function of the time for communication and the time for computation.
The problem of assigning tasks to the processors of a parallel processing computer so as to achieve the optimal load balance and to minimize the cost of interprocessor communication is important if efficient use is to be made of parallel computers. This issue has been studied by many groups in recent years. However, the relative emphasis placed on computational balance as opposed to communication costs, and differing assumptions made as to the numbers of processors and the inter processor network architecture, have lead to many different approaches to the problem.
For small numbers of processors, many techniques can be successfully applied. For instance, a simple heuristic followed by an iterative improvement processes is developed to optimize the layout of tasks on processor networks with up to 64 processors in “Task Assignment on Distributed-Memory Systems with Adaptive Wormhole Routing”, Vibha A. Dixit-Radiya and Dhabaleswar K. Panda, Symposium on Parallel and Distributed Processing (SPDP '93), pp. 674-681, which is hereby incorporated by reference in its entirety. This work is unusual in that it includes link contention as well as total traffic volume during layout. A more complex algorithm that produces good results for small numbers of heterogeneous processors is presented in “An efficient algorithm for a task allocation problem”, A. Billionnet, M. C. Costa and A. Sutter, J. ACM, Vol. 39, No. 3, 1992, pp 502-518, which is hereby incorporated by reference in its entirety. However it is assumed that communication costs are independent of the communication endpoints, and so while useful for processors linked by a switched network, is less applicable to parallel computers using more complex network topologies.
Graph partitioning techniques have been used in the load balancing problem, and also in clustering tasks to optimize locality in hierarchical networks (for instance a cluster of SMP nodes linked by a switched network in “Implementing the MPI Process Topology Mechanism”, Jesper Larsson Traäff, Proceedings of the 2002 ACM/IEEE conference on Supercomputing, 2002, pp. 1-14, which is hereby incorporated by reference in its entirety.) Graph bipartitioning has also been used for task clustering and mapping on eight node hypercubes by “Task Allocation onto a Hypercube by Recursive Mincut Bipartitioning”, F. Ercal, J. Ramanujam and P. Sadayappan, Journal of Parallel and Distributed Computing Vo. 10. No. 1 1990 pp 35-44 (herein after “F. Ercal et al.”), which is hereby incorporated by reference in its entirety.
Simulated Annealing and related techniques have been applied to the mapping problem by several groups in F. Ercal et al noted in the paragraph above and “A New Mapping Heuristic Based on Mean Field Annealing”, Tevfik Bultan and Cevdet Aykanat, Journal of Parallel and Distributed Computing, Vol. 16, No. 4, 1992, pp 292-305 (herein after “Tevfik Bultan et al.”, which is hereby incorporated by reference in its entirety. Simulated Annealing is useful to create good mappings, however it is computationally expensive. Mean Field Annealing is an algorithm with similarities to Simulated Annealing, refer to Tevfik Bultan et al. It is applied to problems with up to 400 tasks and 32 processors in hypercube and mesh topologies, and compared to Simulated Annealing, refer to Tevfik Bultan et al.
Other work has limited itself to problems displaying certain communication patterns. For instance in “Rectilinear Partitioning of Irregular Data Parallel Computations”, David Nicol, Journal of Parallel and Distributed Computing Vol. 23 No. 2, November 1994, pp 119-134, which is hereby incorporated by reference in its entirety, develops an algorithm for mapping problems with a rectilinear topology. This is extended for problems with a k-ary n-cube work and communication pattern in ‘On Bottleneck Partitioning of k-ary n-cubes’, David Nicol and Weizhen Mao, Parallel Processing Letters Vol. 6, No. 6, June 1996, pp 389-399, which is hereby incorporated by reference in its entirety.
Although the above solutions are useful, none of the solutions are directed to optimizing mapping N tasks to N processors, which is a more constrained problem than mapping M>>N tasks to N processors. Accordingly, a need exists for optimally map N tasks to N processors for an arbitrary parallel application in order to minimize execution times on supercomputer systems.
Further, computer systems, such as the BG/L, are far larger than the target architectures of previous research; scaling the mapping to thousands of nodes is essential. The 3-D torus of BG/L adds complexity to the mapping problem. Accordingly, a need exists for optimal maps on supercomputers, such as BG/L, and supercomputers using 3-D torus interconnects for an arbitrary parallel application in order to minimize execution times on supercomputer systems.
Continuing further, much attention has been paid to achieving partitions of tasks that balance computational load and minimize inter-partition communication. Far less attention has been spent on placing those partitions on processors linked by a network, such as a torus or mesh, in which the communication costs between different processor pairs varies considerably. This is especially important for a computer such as BG/L since the cost differential between a good and a bad mapping in a torus increases with processor count. Accordingly, a need exists to find a balanced layout for an arbitrary parallel application in order to minimize communications times between tasks on supercomputer systems.