As physical limitations are being reached in contemporary processor design, current technology is turning more and more to parallel processing to speed up computing. One way to accomplish parallel processing is to have developers write code designed for parallel operation. Another way to accomplish parallel processing (to an extent) is to have the program compiler locate code that can be parallelized, e.g., two sets of code that do not depend on execution order with respect to one another because neither set relies on the results of the other's execution. The compiler can then arrange such code for parallel execution in an appropriate multiprocessor machine.
In high-performance compilers, one standard optimization is “tiling” (also known as “blocking”), in which a loop nest is transformed into an equivalent loop nest of tiles with a different iteration order and better cache locality. For example, consider a program that includes code with a loop of ten-thousand iterations nested inside another loop of ten-thousand iterations, creating a ten-thousand by ten-thousand iteration space:
for (i = 0; i < 10000; i++) {  for (j = 0; j < 10000; j++) {    b[i,j] = a[i−1,j] + a[i,j] + a[i+1,j];  } }
Such a loop nest may be transformed into equivalent code having fewer iterations per loop, but with more nested loops, such as a one-hundred loop iteration nested within another one-hundred loop iteration that in turn is nested within another one-hundred loop iteration, which is yet nested within another one-hundred loop iteration. Example code for this equivalent loop nest is set forth below:
 for (ii = 0; ii < 10000; ii += 100) {  for (jj = 0; jj < 10000; jj += 100) {    for (i = ii; i < ii + 100; i++) {      for (j = jj; j < jj + 100; j++) {        b[i,j] = a[i−1,j] + a[i,j] + a[i+1,j];      }    }  }}
The net effect of this transformation is to subdivide the ten-thousand by ten-thousand iteration space into one-hundred by one-hundred tiles, and proceed one tile at a time. Because the program exhibits two-dimensional memory locality, when using multiple processors, such tiling reduces overall memory traffic by increasing the number of cache hits. However, existing tiling approaches only consider one loop nest at a time.