Despite predictions on the end of Moore years, for both physical and economic reasons Intel has recently declared Moore's law alive and well. However, as the number of transistors fitting a given chip area continues to grow, so does the energy required to enable them, resulting in the heat envelope supported by the packaging being reached. The era of sequential computing relying on ever increasing clock speeds and decomposition of the processing pipeline into ever shorter stages indeed appears to have come to an end. As Gflops per Watt replaced traditional GHz, clock speeds stopped increasing and performance metrics started shifting. Subsequently, due to the same power wall which halted frequency scaling, the end of multi-core scaling was predicted. Some commentators estimate that for any chip organization and topology, multi-core scaling will also be power limited. To meet the power budget, they project, ever more significant portions of the chip will have to be turned off to accommodate the increase in static power loss from increasing transistor count. We are thus entering the “dark silicon” era.
From the point of view of programming models, in order to meet the requirements on power consumption and the necessary levels of parallelism for keeping the hardware busy, one answer pursued by researchers is the exploration of large-scale dataflow-driven execution model. In the dark silicon era as well as at Exascale levels of parallelism, the envisioned architectures are likely ill-balanced and will likely exhibit highly volatile performance and failure characteristics. It is envisioned that applications will, at least partially, steer away from the MPI bulk-synchronous model and may rely on relocatable tasks, scheduled by a dynamic, adaptive, work-stealing runtime.
These relocatable tasks are known as Event-Driven Tasks (EDTs). At least one of the runtimes, the Open Community Runtime (OCR), can support the execution model on the Intel Runnemede research architecture. In this context, communication and locality are performance and energy bottlenecks. Latencies to remote data will generally grow to accommodate lower energy budgets devoted to communication channels. As such, to hide these long latency operations, it is beneficial to overprovision the software and massive amounts of parallelism may need to be uncovered and balanced efficiently and dynamically. In some systems, such as GPGPU based systems, and in particular in CUDA, a user may specify more parallelism than can be exploited for the purpose of hiding latencies. The user specification of parallelism, however, is generally not based on any systematic analysis of the loop-carried dependencies and, as such, may not lead to parallelization necessary to meet simultaneously the performance requirements and power budgets.
Traditional approaches to parallelism typically require the programmer to describe explicitly the sets of operations that can be parallelized in the form of communicating sequential processes (CSPs). The fork-join model and the bulk-synchronous model are commonly used methodologies for expressing CSPs, for shared and distributed memory systems, respectively. As multi-socket, multi-core computers are becoming ubiquitous and are trending towards ever more cores on chip, new parallel programming patterns are emerging. Among these patterns, the task-graph pattern is being actively pursued as an answer to the overprovisioning and load-balancing problems. This model can support a combination of different styles of parallelism (data, task, pipeline). At a very high-level, the programmer writes computation tasks which can: (1) produce and consume data, (2) produce and consume control events, (3) wait for data and events, and (4) produce or cancel other tasks. Dependences between tasks must be declared to the runtime which keeps distributed queues of ready tasks (i.e., whose dependences have all been met) and decides where and when to schedule tasks for execution. Work-stealing can be used for load-balancing purposes. Specifying tasks and dependences that are satisfied at runtime is common to CnC, OCR, SWARM and to other Event Driven runtimes.
The user specification tasks, however, is generally not based on any systematic analysis of the program to be executed, so as to enable a portioning of the operations of the program into tasks that can fully exploit the parallel-processing power of a target runtime. Because the tasks themselves are often defined without the benefit of a systematic analysis, the dependencies associated with the tasks are usually not expressed to the parallelization necessary to achieve the required performance and/or to meet a power budget.
One transformation system for expressing tasks and dependencies therebetween is based on the polyhedral model. Some transformation systems allows for intricate transformation compositions, but the applicability of these system is generally limited because they employ static dependence analysis. Such transformation systems generally decide at compile time whether to parallelize a loop structure or not and, as such, typically require excessive compile times and/or may not achieve the parallelization that can be obtained using EDT-based runtimes. Some techniques can expand the scope of analyzable codes by (1) computing inter-procedural over- and under-approximations that present a conservative abstraction to the polyhedral toolchain, and (2) by introducing more general predicates that can be evaluated at runtime through fuzzy-array dataflow analysis. In practice, conservative solutions mix well with the polyhedral toolchain through a stubbing (a.k.a. blackboxing) mechanism and parallelism can be expressed across irregular code regions. Unfortunately this is not sufficient because the decision to parallelize or not remains an all-or-nothing compile-time decision performed at the granularity of the loop. In contrast EDT-based runtimes allow the expression of fine-grain parallelism down to the level of the individual instruction (overhead permitting), and the transformation systems discussed above do not permit runtime exploration of parallelism. Some techniques allow for performing speculative and runtime parallelization using the expressiveness of the polyhedral model. In these techniques, the speculation may be erroneous and/or the compile time can be too long.
In some techniques, a dependence analysis based on a directed acyclic graph (DAG) of linear-memory array descriptors can generate lightweight and sufficient runtime predicates to enable adaptive runtime parallelism. These methods may enable runtime evaluation of predicates, and can result in significant speedups on benchmarks with difficult dependence structures. In these techniques, however, parallelism is still exploited in a fork-join model via the generation of OpenMP annotations and, as such, these techniques generally limit the parallelization and performance benefits that can be achieved using EDT-based runtimes that use the event-driven task model.