The increasing complexity and heterogeneity of supercomputers as we move beyond petaflop systems has called for an urgent development of programming and runtime systems that automatically deal with the complexity and at the same time run computations in a way that is efficient both from performance and energy considerations. The main challenges to address in the context of parallel computers, inter alia, are: effective parallelization and communication management between parallel processors. As the cost of communication has increased significantly relative to the cost of computation, it has become crucial that new techniques be developed that minimize communication in parallel computations.
To this end, there has been a significant amount of research in the realm of automatic cluster parallelization. Compiler algorithms using the polyhedral model for generation of required communication—receive and send instructions for a given computation and data distribution have been described. Techniques to reduce inefficiencies in communication generation schemes of earlier works have also been proposed.
Communication minimization in general has also received a lot of attention from the research community. The communication avoiding algorithms for various numerical algebra problems—such as matrix multiplication, LU decomposition have been developed and operate in 2.5D processor grids (they are 3 dimensional grids and one of the dimensions is of a constant size, hence the name 2.5D). These techniques generally trade off higher memory use (via data replication) for communication. The algorithms replicate either read-only data or reduction arrays and are applicable only for certain processor grid configurations, namely 2.5D.
Some source-to-source compilers, such as R-Stream™, can perform for automatic parallelization of sequential programs. The R-Stream™, for example, accepts loop nests such as those written in C or another programming language and produces parallelized codes for different targets, including multi-core machines, GPUs, and FPGAs. R-Stream™ can perform cluster parallelization. The R-Stream™ compiler uses the polyhedral model for program analysis and transformation. It implements high performance techniques that enhance data locality and perform parallelization.
The generated cluster-parallel programs have the SPMD (Single Program Multiple Data) form. R-Stream™, for example, can aggregate loop iterations into tasks as part of its parallelization process. The aggregation process may use the tiling program transformation. Data communication between processors are typically performed at the boundaries of these tasks. Communication operations are abstracted as logical DMA (Direct Memory Access) primitives—each task issues logical DMA GETs to fetch data needed for computation and PUTs to store live-out data produced by the task. The logical DMA operations are in turn implemented as an R-Stream™ runtime layer functionality using the Global Arrays™ toolkit. Global Arrays (GAs) may provide a global address space for creating and accessing data structures such as one and/or multi-dimensional arrays. Some techniques, such as those described in co-pending U.S. patent application Ser. No. 14/181,201, entitled, “Methods and Apparatus for Data Transfer Optimization,” describe efficient use of bulk transfer operations such as DMA commands. Some techniques, such as those described in co-pending U.S. patent application Ser. No. 13/712,659, entitled “Methods and Apparatus for Automatic Communication Optimizations in a Compiler Based on a Polyhedral Representation,” describe minimization of communication cost by replacing data exchanges between local and global memories with exchanges between two or more local memories.