The progression of the computer industry in recent years has illustrated the need for more complex processor architectures capable of processing large volumes of data and executing increasingly complex software. A number of systems resort to multiple processing cores on a single processor. Other systems include multiple processors in a single computing device. Additionally, many of these systems utilize multiple threads of execution per processing core. One limitation that these architectures experience is that the current commercially available compilers cannot efficiently take advantage of the increase of computational resources.
In the software design and implementation process, compilers are responsible for translating the abstract operational semantics of the source program into a form that makes efficient use of a highly complex heterogeneous machine. Multiple architectural phenomena occur and interact simultaneously; this requires the optimizer to combine multiple program transformations. For instance, there is often a tradeoff between exploiting parallelism and exploiting locality to reduce the ever widening disparity between memory bandwidth and the frequency of processors: the memory wall. Indeed, the speed and bandwidth of the memory subsystems have always been a bottleneck, which worsens when going to multi-core. This memory wall is further exacerbated by non-contiguous memory accesses.
On many architectures, the order in which memory locations are read and written has a profound effect on how they are issued in hardware. Bad memory access patterns may result in multiple factors of loss of memory bandwidth. Since optimization problems are associated with huge and unstructured search spaces, the combinational task of optimizing a program balancing these hardware requirements is poorly achieved by current compilers, resulting in weak scalability and disappointing sustained performance.
Even when programming models are explicitly parallel (threads, data parallelism, vectors), they usually rely on advanced compiler technology to relieve the programmer from scheduling and mapping the application to computational cores, understanding the memory model and communication details. Even provided with enough static information or annotations (OpenMP directives, pointer aliasing, separate compilation assumptions), compilers have a hard time exploring the huge and unstructured search space associated with these mapping and optimization challenges. Indeed, the task of the compiler can hardly been called optimization anymore, in the traditional meaning of reducing the performance penalty entailed by the level of abstraction of a higher-level language. Together with the run-time system (whether implemented in software or hardware), the compiler is responsible for most of the combinatorial code generation decisions to map the simplified and ideal operational semantics of the source program to the highly complex and heterogeneous machine.
Current trends in computer architecture amplify the utilization of multiple processor cores on a chip. Modern multiple-core computer architectures that include general purpose multi-core architectures and specialized parallel architectures such as the Cell Broadband Engine and Graphics Processing Units (GPUs) have very high computation power per chip. Current and future architectures are increasingly evolving towards heterogeneous mixes of general purpose and specialized parallel architectures. One architectural concept of particular interest is the massively multi-threaded execution model. In this model, a large number of virtual threads of execution are mapped to a multiplicity of physical execution units. These virtual threads can be quickly switched in and out of the execution unit by the hardware runtime. In particular, when a long latency memory access is requested, another thread is scheduled to hide the latency of the memory access. Such an execution model comes with the need for the application to exhibit enough parallelism. Increased parallelism may be obtained by explicitly writing programs with more parallelism or by using auto-parallelizing compilers.
While programming such systems by hand has been demonstrated for a range of applications, this is a difficult and costly endeavor; likely one to be revisited to allow the application to port to rapidly arriving new generations and configurations of heterogeneous architectures and programming abstractions that change the optimization tradeoffs. Recent programming models and abstractions include but are not limited to Partitioned Global Address Space (PGAS), Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL). The application developer is also confronted to a programmability wall in addition to the memory wall and is responsible for writing a correct parallel application using one of these recent programming abstractions. Obtaining reasonable performance is an additional difficult task best left to a compiler.
The polyhedral model is a powerful framework to unify coarse grained and fine-grained parallelism extraction with locality and communication contiguity optimizations. To date, this promise has not yet been completely fulfilled as no existing affine scheduling, fusion and communication contiguity technique can perform all these optimizations in a unified (i.e., non-phase ordered) and unbiased manner. Typically, parallelism optimization algorithms optimize for degrees of parallelism, but cannot be used to optimize both locality and contiguity of communications. In like manner, algorithms used for locality optimization cannot be used both for extracting parallelism and optimizing the contiguity of communications. Additional difficulties arise when optimizing source code for the particular architecture of a target computing apparatus.
Therefore there exists a need for improved source code optimization methods and apparatus that can optimize parallelism, locality and contiguity of memory accesses at multiple level of the heterogeneous hardware hierarchy.