The progression of the computer industry in recent years has illustrated the need for more complex processor architectures capable of processing large volumes of data and executing increasingly complex software. A number of systems resort to multiple processing cores on a single processor. Other systems include multiple processors in a single computing device. Additionally, many of these systems utilize multiple threads per processing core. One limitation that these architectures experience is that the current commercially available compilers can not efficiently take advantage of the increase of computational resources.
In the software design and implementation process, compilers are responsible for translating the abstract operational semantics of the source program into a form that makes efficient use of a highly complex heterogeneous machine. Multiple architectural phenomena occur and interact simultaneously; this requires the optimizer to combine multiple program transformations. For instance, there is often a tradeoff between exploiting parallelism and exploiting locality to reduce the ever widening disparity between memory bandwidth and the frequency of processors: the memory wall. Indeed, the speed and bandwidth of the memory subsystems have always been a bottleneck, which worsens when going to multi-core. Since optimization problems are associated with huge and unstructured search spaces, this combinational task is poorly achieved by current compilers, resulting in weak scalability and disappointing sustained performance.
Even when programming models are explicitly parallel (threads, data parallelism, vectors), they usually rely on advanced compiler technology to relieve the programmer from scheduling and mapping the application to computational cores, understanding the memory model and communication details. Even provided with enough static information or annotations (OpenMP directives, pointer aliasing, separate compilation assumptions), compilers have a hard time exploring the huge and unstructured search space associated with these mapping and optimization challenges. Indeed, the task of the compiler can hardly been called optimization anymore, in the traditional meaning of reducing the performance penalty entailed by the level of abstraction of a higher-level language. Together with the run-time system (whether implemented in software or hardware), the compiler is responsible for most of the combinatorial code generation decisions to map the simplified and ideal operational semantics of the source program to the highly complex and heterogeneous machine.
The polyhedral model promises to be a powerful framework to unify coarse grained and fine-grained parallelism extraction with locality and communication optimizations. To date, this promise has yet been unfulfilled as no existing affine scheduling and fusion techniques can perform all these optimizations in a unified (i.e., non-phase ordered) and unbiased manner. Typically, parallelism optimization algorithms optimize for degrees of parallelism, but cannot be used to optimize locality or communication. In like manner, algorithms used for locality optimization cannot be used for the extracting parallelism. Additional difficulties arise when optimizing source code for the particular architecture of a target computing apparatus.
Therefore there exists a need for improved source code optimization methods and apparatus that can optimize both parallelism and locality.