The progression of the computer industry in recent years has illustrated the need for more complex processor architectures capable of processing large volumes of data and executing increasingly complex software. A number of systems resort to multiple processing cores on a single processor. Other systems include multiple processors in a single computing device. Additionally, many of these systems utilize multiple threads per processing core. One limitation that these architectures experience is that the current commercially available compilers can not efficiently take advantage of the increase of computational resources.
In the software design and implementation process, compilers are responsible for translating the abstract operational semantics of the source program into a form that makes efficient use of a highly complex heterogeneous machine. Multiple architectural phenomena occur and interact simultaneously; this requires the optimizer to combine multiple program transformations. For instance, there is often a trade-off between exploiting parallelism and exploiting locality to reduce the “memory wall”, i.e., the ever widening disparity between memory bandwidth and the frequency of processors. Indeed, the speed and bandwidth of the memory subsystems are a performance bottleneck for the vast majority of computers, including single-core computers. Since traditional program optimization problems are associated with huge and unstructured search spaces, this combinational task is poorly achieved by current compilers, resulting in poor scalability of the compilation process and disappointing sustained performance of the supposedly optimized program.
Even when programming models are explicitly parallel (threads, data parallelism, vectors), they usually rely on advanced compiler technology to relieve the programmer from scheduling and mapping the application to computational cores, and from understanding the memory model and communication details. Even provided with enough static information and code annotations (OpenMP directives, pointer aliasing, separate compilation assumptions), traditional compilers have a hard time exploring the huge and unstructured search space associated with the mapping and optimization challenges. Indeed, the task of the compiler can hardly be called “optimization” anymore, in the traditional meaning of reducing the performance penalty entailed by the level of abstraction of a higher-level language. Together with the run-time system (whether implemented in software or hardware), the compiler is responsible for most of the combinatorial code generation decisions to map the simplified and ideal operational semantics of the source program to a highly complex and heterogeneous target machine.
Generating efficient code for deep parallelism and deep memory hierarchies with complex and dynamic hardware components is a difficult task. The compiler (along with the run-time system) now has to take the burden of much smarter tasks, that only expert programmers would be able to carry. In order to exploit parallelism, the first necessary step is to compute a representation which models the producer/consumer relationships of a program as closely as possible. The power of an automatic optimizer or parallelizer greatly depends on its capacity to decide whether two portions of the program execution may be run one after another on the same processing element or on different processing elements, or at the same time (“in parallel”). Such knowledge is related to the task of dependence analysis which aims at precisely disambiguating memory references. One issue is to statically form a compact description of the dynamic properties of a program. This process is generally undecidable and approximations have to be made.
Once dependence analysis has been computed, a compiler performs program transformations to the code with respect to different, sometimes conflicting, performance criteria. Any program transformation must ultimately respect the dependence relations in order to guarantee the correct execution of the program. A class of transformations targeting the loop nests of a program (such as “DO” loops in the FORTRAN language, and “for” and “while” loops in languages derived from the C language) are known to account for the most compute intensive parts of many programs.
Traditional optimizing compilers perform syntactic transformations (transformations based on a representation that reflects the way the program source code text was written, such as the Abstract Syntax Tree), making the optimizations brittle since they are highly dependent on the way that the input program is written, as opposed to the more abstract representation of the program's execution offered by the polyhedral model. Moreover, syntactic transformations are not amenable to global optimizations, since the problem of optimally ordering elementary syntactic transformations is yet unsolved. Many interesting optimizations are also not available, such as fusion of loops with different bounds or imperfectly nested loop tiling.
In some situations, such as in high performance signal and image processing, the applications may primarily operate on “dense” matrices and arrays. This class of applications primarily consists of do-loops with loop bounds which are affine functions of outer indices and parameters, and array indexing functions which are affine functions of loop indices and parameters. Other classes of programs can be approximated to that class of programs.
One significant area of concern in these large scale systems is memory management. For example, in a program, a large multi-dimensional array may be allocated and used to store data. This large block of data is typically stored in memory in contiguous memory cells. Certain operations on the array may not access all portions of the data. For example, in nested loops, an outer loop may be indexed by the column of the array and an inner loop may be indexed by the rows of the array. In a situation where the loop operation only accesses a portion of the elements of the array, it would be inefficient to transfer the entire array to a processing element that is assigned the access task. Further, since portions of the array are not accessed, the loop indices may be rewritten for local access on a processing element.
There have been a number of approaches used to implement these program transformations. Typical goals of these approaches include reducing the memory size requirements to increase the amount of useful data in local memory and to reduce communication volumes. One such algorithm is described in U.S. Pat. No. 6,952,821 issued to Schreiber. Schreiber's method is applicable to non-parametric rectangular iteration spaces and employs the Lenstra-Lenstra-Lovasz (LLL) lattice basis reduction algorithm. Schreiber's methods are additionally incapable of addressing data with non-convex sets of accessed data.
Therefore a need exists for more efficient compiler architectures that optimize the compilation of source code.