The progression of the computer industry in recent years has illustrated the need for more complex processor architectures capable of processing large volumes of data and executing increasingly complex software. A number of systems resort to multiple processing cores on a single processor. Other systems include multiple processors in a single computing device. Additionally, many of these systems utilize multiple threads per processing core and have access to multiple types of memories which require specific know-how to be fully utilized. One limitation that these architectures experience is that the current commercially available compilers cannot efficiently take advantage of the different constraints imposed by different types of memories.
In the software design and implementation process, compilers are responsible for translating the abstract operational semantics of the source program into a form that makes efficient use of a highly complex heterogeneous machine. Multiple architectural phenomena occur and interact simultaneously; this requires the optimizer to combine multiple program transformations. For instance, there is often a tradeoff between exploiting parallelism and exploiting locality to reduce the ever widening disparity between memory bandwidth and the frequency of processors: the memory wall. The tension between parallelism and locality of memory references is an important topic in the field of compiler optimization. More parallelism allows more concurrent execution of the parallel portions of a program. Additional parallelism implicitly relates to more available computational operations per second. Increasing locality directly translates into communication reduction between memories and processing elements. Typically, however, the portions of a program that may be executed in parallel are not interdependent, and as such these portions together may access non local data or data that are distributed throughout the memory. Because of these program semantics constraints, increasing parallelism may decrease locality and vice-versa.
An additional architectural phenomenon related to both parallelism and the memory wall is the ability for processors to better process data elements whose addresses in memory are properly organized. Such organization of memory accesses allow the executing program to take advantage of multiple banks of memory that increase the sheer memory bandwidth available to processors; as well as local memory regions which exhibit lower latency than main memory. This additional memory organization constraint conflicts with parallelism and locality in the sense that programs with good parallelism and locality may not exhibit proper organization of memory accesses for the purpose of bandwidth and latency optimization and vice-versa.
Current trends in computer architecture amplify the utilization of private local memories on a chip and shared memory across multiple chips. Modern general purpose multi-core architectures exhibit a private first level cache and shared second and third level caches. Specialized parallel architectures such as the IBM Cell Broadband Engine and NVIDIA Graphics Processing Units (GPUs) exhibit both shared and private memory regions that must be explicitly programmed: the IBM Cell BE has a globally shared memory and local scratchpad memories that are accessible trough DMA calls. NVIDIA GPUs have a globally shared device memory (the main memory), locally shared memory and locally private memory (the registers). Current and future architectures are increasingly evolving towards heterogeneous mixes of general purpose and specialized parallel architectures. Such an execution model comes with the need for the application to properly manage data transfers between shared memory regions and private memory regions. Even when a partitioned global address space or a machine-wide memory coherence mechanism is available, performance and energy requirements dictate that the transfers are optimized explicitly.
While programming such systems by hand has been demonstrated for a range of applications, this is a difficult and costly endeavor; likely one to be revisited to allow the application to port to rapidly arriving new generations and configurations of heterogeneous architectures and programming abstractions that change the optimization tradeoffs.
Even when programming models are explicitly parallel (threads, data parallelism, vectors), they usually rely on advanced compiler technology to relieve the programmer from scheduling and mapping the application to computational cores, understanding the memory model and communication details. Even provided with enough static information or annotations (OpenMP directives, pointer aliasing, separate compilation assumptions), compilers have a hard time exploring the huge and unstructured search space associated with these mapping and optimization challenges. Indeed, the task of the compiler can hardly been called optimization anymore, in the traditional meaning of reducing the performance penalty entailed by the level of abstraction of a higher-level language. Together with the run-time system (whether implemented in software or hardware), the compiler is responsible for most of the combinatorial code generation decisions to map the simplified and ideal operational semantics of the source program to the highly complex and heterogeneous machine.
The polyhedral model is a powerful framework to unify parallelism and locality extraction with memory access optimizations. To date, this promise has not yet been completely fulfilled as no existing technique can perform advanced communication optimization with exploitation of reuse opportunities to reduce the overall cost of data transfers. Typically, memory and communication optimization algorithms try to minimize the size of local memory and hide communication latencies with computations. Additional difficulties arise when optimizing source code for the particular architecture of a target computing apparatus with multiple types of memories.
Therefore there exists a need for improved source code optimization methods and apparatus that can optimize communication reuse at multiple levels of the heterogeneous hardware hierarchy.