The progression of the computer industry in recent years has illustrated the need for more complex processor architectures capable of processing large volumes of data and executing increasingly complex software. A number of systems resort to multiple processing cores on a single processor. Other systems include multiple processors in a single computing device. Additionally, many of these systems utilize multiple threads per processing core and have access to vector units which require specific know-how to be fully utilized. One limitation that these architectures experience is that the current commercially available compilers cannot efficiently take advantage of the increase of computational resources.
In the software design and implementation process, compilers are responsible for translating the abstract operational semantics of the source program into a form that makes efficient use of a highly complex heterogeneous machine. Multiple architectural phenomena occur and interact simultaneously; this requires the optimizer to combine multiple program transformations. For instance, there is often a tradeoff between exploiting parallelism and exploiting locality to reduce the ever widening disparity between memory bandwidth and the frequency of processors: the memory wall. The tension between parallelism and locality of memory references is an important topic in the field of compiler optimization. More parallelism allows more concurrent execution of the parallel portions of a program. Additional parallelism implicitly relates to more available computational operations per second. Increasing locality directly translates into communication reduction between memories and processing elements. Typically, however, the portions of a program that may be executed in parallel are not interdependent, and as such these portions together may access non local data or data that are distributed throughout the memory, Because of these program semantics constraints, increasing parallelism may decrease locality and vice-versa.
An additional architectural phenomenon related to both parallelism and the memory wall is the ability for processors to better process data elements whose addresses in memory are evenly spaced (also referred to as constant strides). Such regularity of memory accesses allow the program to take advantage of hardware streaming prefetchers that increase the sheer memory bandwidth available to processors; as well as of vector units that allow the execution of multiple logical instructions as a single hardware instruction. This additional constant-stride memory constraint conflicts with parallelism and locality in the sense that programs with good parallelism and locality may not exhibit constant strides and vice-versa.
Current trends in computer architecture amplify the utilization of vector units on a chip. Modern multiple-core computer architectures that include general purpose multi-core architectures and specialized parallel architectures such as the IBM Cell Broadband Engine, Intel Xeon processors with SSE4 SIMD instructions, Intel Many Integrated Core Architecture with AVX extensions and NVIDIA Graphics Processing Units (GPUs) have very high computation power per chip thanks to the use of wide vector units per chip. Current and future architectures are increasingly evolving towards heterogeneous mixes of general purpose and specialized parallel architectures. Such an execution model comes with the need for the application to exhibit joint parallelism, locality and constant-strided memory accesses. Increased parallelism may be obtained by explicitly writing programs with more parallelism or by using auto-parallelizing compilers.
While programming such systems by hand has been demonstrated for a range of applications, this is a difficult and costly endeavor; likely one to be revisited to allow the application to port to rapidly arriving new generations and configurations of heterogeneous architectures and programming abstractions that change the optimization tradeoffs.
Even when programming models are explicitly parallel (threads, data parallelism, vectors), they usually rely on advanced compiler technology to relieve the programmer from scheduling and mapping the application to computational cores, understanding the memory model and communication details. Even provided with enough static information or annotations (OpenMP directives, pointer aliasing, separate compilation assumptions), compilers have a hard time exploring the huge and unstructured search space associated with these mapping and optimization challenges. Indeed, the task of the compiler can hardly been called optimization anymore, in the traditional meaning of reducing the performance penalty entailed by the level of abstraction of a higher-level language. Together with the run-time system (whether implemented in software or hardware), the compiler is responsible for most of the combinatorial code generation decisions to map the simplified and ideal operational semantics of the source program to the highly complex and heterogeneous machine.
The polyhedral model is a powerful framework to unify coarse grained and fine-grained parallelism extraction with locality and constant-strided memory access optimizations. To date, this promise has not yet been completely fulfilled as no existing affine scheduling and constant-strided memory technique can perform all these optimizations in a unified (i.e., non-phase ordered) and unbiased manner. Typically, parallelism optimization algorithms optimize for degrees of parallelism, but cannot be used to also optimize for constant-strided memory accesses. In like manner, algorithms used for data layout transformations reshape the position of data elements in memory but cannot be used both for extracting parallelism and locality. Additional difficulties arise when optimizing source code for the particular architecture of a target computing apparatus.
Therefore there exists a need for improved source code optimization methods and apparatus that can jointly optimize scheduling and constant-stride memory accesses at multiple level of the heterogeneous hardware hierarchy.