Polyhedral Model Concepts
The polyhedral model is a mathematical abstraction to represent and reason about programs in a compact representation. It is based on a generalized dependence graph (GDG) based intermediate representation (IR) containing the following information.
Statement. A statement S is a set of operations grouped together in our internal representation. Statements are the nodes of the GDG. A statement in the model often corresponds to a statement in the original program. Depending on the level of abstraction, a statement can be arbitrarily simple (e.g., a micro-operation) or arbitrarily complex (e.g., an external pre-compiled object).
Iteration Domain. An iteration domain DS is an ordered set of iterations associated with each statement S. It describes the loop iterations in the original program that control the execution of S. To model multiple levels of nested loops, iteration domains are multi-dimensional sets. We denote the order between two iterations i1 and i2 of S by i1«i2 if S(i1) occurs before S(i2) in the program. Operations to manipulate domains and their inverse include projections to extract information along a sub-domain; image by a function to transform a domain into another domain, intersection to construct the iterations that are common to a list of domains, and index-set splitting to break a domain into disjoint pieces.
Dependence. A dependence (T→S) is a relation between the set of iterations of S and T. It conveys the information that some iteration iT ϵ DT depends on iS ϵ DS (i.e., they access the same memory location by application of a memory reference) and that iS«iT in the original program. We write the set relation {(iT, iS)  (T→S)} to refer to the specific iterations of T and S that take part in the dependence. Dependences between statements form the edges of the GDG and give it a multi-graph structure.
Dataflow dependence. A dataflow dependence (T→S)d is a special kind of raw dependence. It conveys additional last-write information. When it is exact, it does not carry any redundancy (i.e., each read memory value has at most 1 producer). Array dataflow analysis is a global process involving all the statement in the considered portion of the program to determine precise dependences.
Memory reference. A memory reference F is a function that maps domain iterations to locations in the memory space. The image of DS by F represents the set of memory locations read or written by S through memory reference F. If F is injective, distinct memory locations are touched; otherwise, memory reuse exists within the program. Each statement can access multiple memory references in read and/or write mode.
Scheduling function. A scheduling function θS is a function that maps the iterations of S to time. It is a partial order that represents the relative execution order of each iteration of S relative to the all other iterations of any statement in the program. If the scheduling function is injective, the output program is sequential; otherwise parallel iterations exist. In particular, the order«extends to time after scheduling is applied. Scheduling functions allow the global reordering of statement iterations. In particular, affine scheduling functions subsume many classical high-level loop transformations in traditional compiler terminology.
Loop types. We extend our scheduling representation with information pertaining to the kind of parallelism available in a loop. This information corresponds to common knowledge in the compiler community, and we use traditional terminology: (1) doall loops do not carry any dependence and can be executed in parallel; (2) permutable bands of loops carry forward-only dependencies and may be safely interchanged and tiled; (3) sequential loops must be executed in the specified order (not necessarily by the same processor); and (4) reduction loops can be executed in any sequential order (assuming the reduction operator is associative and commutative, otherwise they are degraded to sequential loops). Both schedule and loop type information are local to the statement nodes of the GDG.
Placement function. A placement function PS is a function that maps the iterations of S to hierarchies of processing elements. Its application to the iteration domain dictates (or provide hints at run time) what iterations of a statement execute where. There is an implicit relation between the type of loop and the placement function. Sequential loops synchronize linearly if executed by multiple processors, doall loops are synchronization-free, and reduction loops use tree-based synchronizations. Depending on the dependencies, sequential and reduction loops may be transformed into doall loops using locks. Placement information is local to the statement nodes of the GDG.
Primary Compiler-Mapping Phases
A polyhedral model based compiler (e.g. R-Stream™) can perform high-level automatic mapping to heterogeneous architectures and includes parallelism extraction, task-formation, locality improvement, processor assignment, data layout management, memory consumption management, explicit data movements generation (as well as their reuse optimization and pipelining with computations), and explicit synchronization generation. Many high-level optimizations in A polyhedral model based compiler (e.g. R-Stream™) can take a GDG as input and generate a new GDG with additional or altered information. Low-level optimizations occur on a different SSA-based IR, after high-level transformations are applied. The output code generated is based on the target architecture. It may be C extended with annotations and target-specific communication and synchronization library calls (OpenMP, pthreads, etc.) for SMP, CUDA for GPGPUs, etc.
Affine scheduling. A polyhedral model based compiler (e.g. R-Stream™) can perform exact dependence analysis and state-of-the-art polyhedral transformations through its joint parallelism, locality, contiguity, vectorization, and data layout (JPLCVD) affine scheduling framework. The strengths of this phase include the following: (1) it balances fusion, parallelism, contiguity of accesses, and data layout, and comes up with a communication- and synchronization-minimized program schedule; (2) it ensures that the degree of parallelism is not sacrificed when loops are fused, and it exposes and extracts all the available parallelism in the program, including both coarse-grained and fine-grained parallelism; and (3) it is applied as a single mapper phase which makes the algorithm very suitable for iterative optimization and auto-tuning.
Tiling. An important phase in the mapping process is “tiling.” A tile in traditional compiler terminology represents an atomic unit of execution. The affine scheduling algorithm identifies “permutable loops” that can be tiled to create an atomic unit of execution. Tiling is done for two primary reasons: (1) to divide the computation into tasks to distribute across processors, and (2) to block the computation into chunks such that each chunk requires data that can fit in a smaller but faster memory (enabling good data locality and reuse-temporal and spatial).
A polyhedral model based compiler (e.g. R-Stream™) can partition statements into groups that can be tiled together to fit within a constrained memory space. Such a group forms an atomic unit of memory allocation. Grouping of statements determines the tile shape as well as the allocation and lifespan of local arrays (data buffers in faster memories). The tiling algorithm is guaranteed to choose tile sizes that satisfy the following criteria: (1) the data footprint of the tile does not exceed the size of the fast memories, and (2) the tile size balances the amount of computation and communication (among tiles).
Placement. The placement phase determines the placement function that maps the iterations of statements to hierarchies of processing elements in the given target system. The placement decision is dictated by the affine schedule that carries key information regarding parallelism available in a loop and potential communication/synchronization resulting from the loop. The kind of parallelism available in a loop has direct implications on how it may be executed on a hierarchical and heterogeneous parallel machine.
Local memory management. A polyhedral model based compiler (e.g. R-Stream™) can support automatic creation and management of local arrays. These arrays are placed in smaller local faster memories (caches in x86 systems and scratchpad memory or registers in GPUs) and the compiler creates bulk copies (DMA or explicit copy loops) to and from them. When data is migrated explicitly from one memory to another, opportunities arise to restructure the data layout at a reduced relative cost. Such reorderings help reduce storage utilization and can enable further optimizations (e.g., simdization).
For each parametric affine array reference A[f(x)] in the program, this phase gives a mapping to its new local references A′i[gi(x)] where A′i represent the set of new arrays to be allocated in the local memory. Non-overlapping references to the same original array can be placed into distinct local arrays. The local arrays are created optimally to be compact.
Communication (data transfer) generation. Communication generation is invoked when there is a need (whether it arises from programmability or profitability) to explicitly transfer data between different memories (slower DRAM to faster local buffer, for example). For shared memory machines, R-Stream performs communication generation to generate DMA instructions or explicit copies that benefit from hardware prefetches. For GPUs, it generates explicit copy code to transfer data between global memory and scratchpad memory/registers.
One or more optimizations described above can enhance the execution of a software program one a target platform, i.e., a data processing system. Some data processing systems include one or more central processing units (CPUs), co-processor(s) such as math co-processor(s), dedicated and/or shared memory banks, data buffer(s), single or multi-level cache memory unit(s), etc. The above described optimizations can improve performance, e.g., by improving locality of data, reducing data communication, increasing parallelization, etc. These optimizations typically do not attempt to minimize energy/power consumption of the target platform during execution of the software program, however.