Due to hardware scaling and low-power requirements, new processor and system architectures are being investigated, designed, and developed for exascale and extreme-scale computing. As a common theme, these architectures have a large numbers (e.g., tens, hundreds, or thousands) of cores that can react heterogeneously to their environment, and may be constrained by their global energy consumption. The computing devices may be operated at “near threshold voltage” (NTV), as lowering supply voltage can produce a quadratic improvement in power efficiency of computing devices with generally only a linear slowdown in throughput. Consequently it is possible to get improved power utilization as long as an increase in parallelism can be found to offset the linear slowdown. Another important consequence of lowering the supply voltage near threshold is that variations in device performance are exacerbated. Thus, beyond any intrinsic imbalance from the application itself, the hardware often creates imbalance.
The need for extracting more concurrency, reducing synchronization, and addressing the hardware imbalance generally imposes tall requirements on the software. The software to be executed desirably should be as parallel as possible to take advantage of the cores, and/or adaptable to changing core capabilities, and desirably should avoid wasting energy or should minimize wasted energy.
One way to address this problem is to depart from the Bulk-Synchronous Programming (BSP) model. While BSP model has promoted parallelism by enabling simple programming models such as loop parallelism and Single Program Multiple Data (SPMD) computations, the model may stand in the way of amounts of parallelism sought out. First, bulk synchronizations (across iterations of a “for” loop, for instance) often express an over-approximation of the actual dependences among computation instances (whether they are tasks or loop iterations). Also, synchrony often results in a loss of parallelism and a waste of energy, since cores spend a portion of their time waiting for some condition to occur (e.g., a barrier to be reached by other cores, a spawned task to return, etc.).
Event-driven task (EDT) model is emerging as an effective solution for new extreme-scale architectures. In this model, programs may be written as graphs of event-driven tasks, and can be asynchronous and non-bulk. Tasks are “scheduled” for asynchronous execution and they become runnable whenever their input data is ready. In this model, the more accurate the inter-task dependences are with respect to the semantics of the program, the more parallelism can be exposed. This model can improves dynamic load balancing, which makes it an attractive choice for extreme-scale systems, especially, near threshold computing (NTC) systems.
It is impractical, however, to expect programmers to write directly in the EDT form; the expression of explicit dependences between tasks is cumbersome, requiring a significant expansion in the number of lines of code, and making the code opaque to visual inspection and/or debugging. Therefore, in general a high-level compiler and optimization tool is a key component of an extreme-scale/exascale software stack, to attain performance, programmability, productivity, and sustainability for such application software.
Previously Published and Commercialized Version of R-Stream™ Compiler
A previously published and commercialized version of R-Stream™ (referred to as “Published R-Stream™”) is an example of a source-to-source automatic parallelization and optimization tool targeted at a wide range of architectures including multicores, GPGPU, and other hierarchical, heterogeneous architectures including exascale architectures such as Traleika Glacier. Without automatic mapping, the management of extreme scale features would generally require writing longer software programs (having more lines of code), thus requiring more effort to produce software, and such programs may be less portable, and may be error-prone. Published R-Stream™ provides advanced polyhedral optimization methods and is known for features that can transform programs to find more concurrency and locality, and for features that can manage communications and memory hardware explicitly as a way of saving energy.
Published R-Stream™ is a high-level automatic parallelization tool, performing mapping tasks, which may include parallelism extraction, locality improvement, processor assignment, managing the data layout, and generating explicit data movements. Published R-Stream™ can read sequential programs written in C as input, automatically determine the mapping of the code portions to processing units, based on the target machine, and output transformed code. Published R-Stream™ can handle high-level transformations described above, and the resulting source code output by Published R-Stream™ generally needs to be compiled using a traditional low-level compiler.
Published R-Stream™ typically works by creating a polyhedral abstraction from the input source code. This abstraction is encapsulated by a generalized dependence graph (GDG), the representation used in the Published R-Stream™ polyhedral mapper. Published R-Stream™ can explore an unified space of all semantically legal sequences of traditional loop transformations. From a statement-centric point of view in the polyhedral abstraction, such a sequence of transformations can be represented by a single schedule (e.g., a rectangular parametric integer matrix). The Published R-Stream™ optimizer may add capabilities to express the mathematical link between high-level abstract program properties and variables in this unified space. These properties include parallelism, locality, contiguity of memory references, vectorization/SIMDization, and data layout permutations.
Event-Driven Task (EDT) Based Runtimes/Platforms
There are several EDT-based runtimes (generally referred to as EDT platforms) that are being developed in the community for exascale systems such as Open Community Runtime (OCR), Concurrent Collections (CnC), SWift Adaptive Runtime Machine (SWARM), Realm, Charm++, and others. We have developed a hierarchical mapping solution using auto-parallelizing compiler technology to target three different EDT runtimes, namely, OCR, CnC, and SWARM. Specifically, we developed (1) a mapping strategy with selective trade-offs between parallelism and locality to extract fine-grained EDTs, and (2) a retargetable runtime API that captures common aspects of the EDT programming model and allows for uniform translation, porting, and comparisons between the different runtimes. We also developed a generic polyhedral compilation approach to compile programs for execution of EDT platforms.
OCR
OCR is an open-source EDT runtime platform that presents a set of runtime APIs for asynchronous task-based parallel programming models suited for exascale systems. The main paradigms in OCR are: (1) Event-driven tasks (EDTs), (2) Data Blocks (DBs), and (3) Events. All EDTs, DBs, and events have a global unique ID (GUID) that identifies them across the platform. EDTs are the units of computation in OCR. All EDTs need to declare a set of dependencies to which DBs or events can be associated. An EDT does not begin execution until all its dependencies have been satisfied. EDTs are intended to be non-blocking pieces of code and they are expected to communicate with other EDTs through the DBs (which are the units of storage) and events. All user data needs to be in the form of DBs and to be controlled by the runtime since the runtime can relocate and replicate DBs for performance, power, or resilience reasons.
Events provide a mechanism for creating data and control dependencies in OCR. An event can be associated with a DB or can be empty. An event that is associated with a DB can be used to pass data to the EDTs waiting on that event. This dependence can be understood as control+data dependence. An event without a DB associated therewith can be used to trigger EDTs waiting on the event. This dependence can be understood as control dependence. Pure data dependence can be encoded by attaching a DB in a dependence slot to an EDT.
A compiler generally creates an OCR program by constructing a dynamic acyclic graph of EDTs, DBs, and events. To this end, a compiler such as Published R-Stream™ can generate tasks (EDTs) and events from the specified program source code. Data blocks, however, generally need to be defined in the source code. Moreover, various known compilation and manual techniques that can guide the creation of data blocks by the EDT runtime/platform do not guide the EDT platform as to when the DBs may be created and/or destroyed. The known techniques also do not customize access to the DBs based on their usage by the tasks, and rely simply on the access mechanisms provided by the EDT runtime/platform. This can limit the performance of the EDT platform while executing a platform, e.g., in terms of speed of execution, memory load, and/or power and/or energy consumption.