The progression of the computer industry in recent years has illustrated the need for more complex processor architectures capable of processing large volumes of data and executing increasingly complex software. A number of systems resort to multiple processing cores on a single processor. Other systems include multiple processors in a single computing device. Additionally, many of these systems utilize multiple threads per processing core. One limitation that these architectures experience is that many of the current commercially available compilers do not take advantage of the increased computational resources, e.g., multiple processors, multiple cores, etc.
In the software design and implementation process, compilers are typically responsible for translating the abstract operational semantics of the source program into a form that makes efficient use of a highly complex heterogeneous machine. Multiple architectural phenomena usually occur and interact simultaneously, requiring the optimizer to combine various program transformations. For instance, there is often a tradeoff between exploiting parallelism and exploiting locality of memory references to reduce the ever widening disparity between memory bandwidth and the processing capacity of the system—the disparity commonly known as the memory wall. Balancing the tension between parallelism and locality of memory references is important in compiler optimization.
More parallelism may allow more concurrent execution of the parallel portions of a program. Additional parallelism usually implicitly relates to the execution of more computational operations per second, often increasing the performance of a program. On the other hand, increasing locality generally directly translates into communication reduction between memories and processing elements, causing a reduction in the memory bandwidth required to execute the program. Because of program semantics constraints increasing parallelism typically decreases locality and increases the required bandwidth and increasing locality of memory references generally results in decreased parallelism.
In determining a good parallel schedule of a program, compilers are often limited by memory-based dependencies. These dependencies do not always directly contribute to the flow of values read and written while performing the computations required by the program. Sometimes, these dependencies arise when multiple temporary results must be stored in memory at the same time, thereby limiting the amount of parallelism in the program. Techniques to lessen the impact of such dependencies have been studied but they are subject to phase ordering issues. For instance, array privatization requires the loop to be in near-parallel form (i.e. it must not have any loop-carried dependencies) due to the prior scheduling decisions. On the other hand, techniques for performing array expansion and conversion to single assignment form suffer from increased memory usage and require additional techniques to reduce the memory footprint such as array contraction.
In general, the algorithms that optimize for parallelism allow for degrees of parallelism but cannot be used to control the amount of memory usage. In like manner, algorithms used for array privatization, array expansion, and array contraction generally depend on a given schedule and cannot be used for extracting or improving parallelism. Therefore there exists a need for improved systems and methods for source-code compilation.