From the very beginning of the computer industry, there has been a constant demand for improving the performance of systems in order to run application faster than before, or for running applications that can produce results in an acceptable time frame. One method for improving the performance of computer systems is to have the system run applications or portions of an application (e.g. a thread) in parallel with one another on a system having multiple processors. In order to run an application or thread in parallel, the application or thread must be independent, that is, it cannot depend on the results produced by another application or thread.
The process of specifying which applications or threads can be run in parallel with one another may be referred to as partitioning. One particular type of partitioning used by current systems in referred to as affine partitioning. An affine partition may be used to uniformly represent many program transformations, such as loop interchange, loop reversal and loop skewing, loop fusing, and statement re-ordering. Further, space partitioning in an affine partitioning framework may be used to parallelize code below for multiprocessor systems.
In general, an affine partition typically comprises a linear transformation and a translation of a vector or matrix operation within one or more loops, including transformation and translation of loop index variables. Various loop manipulations may be performed using affine partitioning. For example, loop interchange, loop reversal and loop skewing may be represented by linear transformations and translations performed in affine partitioning. The affine partition is an extension to unimodular transformation in use by current compilers. The affine partition extends the concept of unimodular transformation by:                i) Allowing each statement to have its own linear transformation. In comparison, for a unimodular transformation, all statements inside the loop body share one linear transformation.        ii) Allowing the partitioning to be applied to general loops structure. In comparison, a unimodular transform can only be applied to perfect nested loops.        iii) Allowing the degree of the linear transformation to be less than the nesting level of loops in the program. Namely, the transformation matrix could be non-square matrix. In contrast, a unimodular transformation requires the transformation matrix to be square matrix.        
While affine partitioning has provided benefits in producing code that can be run in parallel on multiprocessor systems, there remain significant issues. For example, many times successive iterations of a loop in a program will make continuous accesses to memory. Before partitioning, memory accesses of successive iterations of a loop may be near one another, resulting in a high likelihood that a memory reference will be available in faster cache memory. However, in affine partitioning, instances of instructions in a loop may be divided across multiple processors and code may be transformed such that memory access patterns are much different than prior to partitioning. As a result, it is more likely that memory accesses may no longer be contiguous or near one another, and in fact memory accesses may be quite far from one another. In this case, there is a higher likelihood of a cache miss, thereby increasing the time required to access memory. Thus some or all of the performance gains realized by executing instructions in parallel may be lost due to the increase in memory access times due to cache misses.