1. Field of the Invention
The present invention relates in general to software optimization that may be used by a programmer or an automatic program optimizer, and more particularly to a technique for loop blocking or loop tiling applied to a perfect nest of loops.
2. Description of the Related Art
Loop tiling or loop blocking is a well known program transformation used by programmers and program optimizers to improve the instruction-level parallelism, data access locality, and register locality and to decrease the branching overhead of a set of perfectly nested loops program loops. Many optimizing compilers employ the loop tiling transformation to some degree.
To block a given loop, a compiler replaces the original loop with two perfectly nested loop s, an outer control loop and an inner blocked loop, as shown in the following example:
______________________________________ Original Loop Perfectly Nested Loops do i = 1, n do ii = 1, n, B Control Loop . . . do i = ii, min(i+B-1, n) Blocked Loop enddo . . . enddo enddo ______________________________________
In this example, B is the block size specifying the maximum number of iterations executed by the inner blocked loop.
Blocking a single loop, as shown above, does not in itself improve data locality, since the loop iterations are performed in exactly the same order as before. The real benefit of blocking is realized when it is combined with loop interchange. In the following example, a matrix transpose loop nest is transformed by first blocking both the j and i loops with block sizes B.sub.j and B.sub.i respectively, and then by interchanging loops j and ii so that the controls loops are moved outward and the blocked loops are moved inward.
______________________________________ Original Loops Loops After Blocking Loops After Interchange do j=1,n do jj=1,n,B.sub.j do jj=1,n,B.sub.j do i=1,n do j=jj,min(jj+B.sub.j -1,n) do ii=1,n,B.sub.i A(i,j)=B(j,i) do ii=1,n,B.sub.i do j=jj,min(jj+B.sub.j -1,n) enddo do i=ii,min(ii+B.sub.i -1,n) do i=ii,min(ii+B.sub.i --1,n) enddo A(i,j)=B(j,i) A(ij)=B(j,i) enddo enddo enddo enddo enddo enddo enddo enddo ______________________________________
Although the prior art defined the above blocking transformation, and recognized that the block sizes may affect program performance, the prior art has failed to provide a solution to the problem of determining optimized block/tile sizes for improved data locality.
For instance, Rajic et al. (Hrabri Rajic and Sanjiv Shah, "Maximum Performance Code Restructuring for Hierarchical Memory RISC Computers", SIAM Conference, Houston, Tex., March 1991.) teaches the loop blocking/tiling transformation and gives examples. Although, Rajic et al. teaches that the block sizes may be variables (iblock, jblock, kblock), Rajic et al. does not teach or suggest how to select the block size values.
Irigoin et al. (Francois Irigoin and Remi Triolet, "Supemode Partitioning", Conference Record of Fifteenth ACM Symposium on Principles of Programming Languages, 1988.) teaches supemode partitioning with multiple hyperplanes, which can be used to form general hyperparallelepiped tiles of the iteration space rather than just rectilinear tiles. Irigoin et al. also fails to teach or suggest how to select the block size values.
Similarly, Ramanujam et al. (J. Ramanujam and P. Sadayappan, "Compile-Time Techniques for Data Distribution in Distributed Memory Machines", IEEE Transactions on Parallel and Distributed Systems, 2(4) p. 472-482, October 1991.) teaches data partitioning in multicomputers with local memory, akin to the supernode partitioning introduced in Irigoin et al. Ramanujam et al.'s teachings produce communication-free hyperplane partitions for loops containing array references with affine index expressions, when communication-free partitions exist. However, Ramanujam et al.'s teachings also fail to teach or suggest a method for selecting block or tile sizes.
Schreiber et al. (Robert Schreiber and Jack J. Dongaffa, "Automatic Blocking of Nested Loops", Technical Report 90.38, RIACS, August 1990.) addresses the problem of deriving an optimized tiled (hyperparallelepiped) iteration space to minimize communication traffic. The teachings of Schreiber et al only address the restricted case in which all block sizes are assumed to be equal and the iteration and data spaces are isomorphic.
Abraham et al. (S. G. Abraham and D. E. Hudak, "Compile-Time Partitioning of Iterative Parallel Loops to Reduce Cache Coherency Traffic", IEEE Transactions on Parallel and Distributed Systems, 2(3), p. 318-328, July 1991.) teaches loop partitioning for multiprocessors with caches, and selecting tile sizes for parallelism rather than data locality. The teachings of Abraham et al. are limited by its assumptions that the number of loops in a nest match the number of dimensions in an array variable being processed, and that an array location being updated in a single iteration of the loops has the form Ai.sub.1,i.sub.2, . . . ! for loop index variables i.sub.1, i.sub.2, . . . .
As with the loop blocking prior art, the prior art of caching does not teach or suggest a method of determining optimized block/tile sizes for improved data locality.
Ferrante et al. (Jeanne Ferrante, Vivek Sarkar, and Wendy Thrash, "On Estimating and Enhancing Cache Effectiveness", Lecture Notes in Computer Science, p. 589, 1991, Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing, Santa Clara, Calif., USA, August 1991.) teaches how to efficiently estimate the number of distinct cache lines used by a given loop in a nest of loops. Ferrante et al. presents simulation results indicating that the estimates are reasonable for loop nests such as matrix multiply.
The problem of estimating the amount of local memory needed by array references contained within a nest of loops was considered in Gannon et al. (Dennis Gannon, William Jalby, and Kyle Gallivan, "Strategies for Cache and Local Memory Management by Global Program Transformations", Proceedings of the First ACM International Conference on Supercomputing, June 1987.) and Gallivan et al. (Kyle Gallivan, William Jalby, and Dennis Gannon, "On the Problem of Optimizing Data Transfers for Complex Memory Systems", Proc. of ACM 1988 Int'l. Conf. on Supercomputing, St. Malo, France, Jul. 4-8, 1988, pp.238-253, 1988.). They introduced the notion of uniformly generated data dependences and based their analysis for this class of dependences. Since many data dependences are not uniformly generated, it limits the applicability of their technique in practice. As evidence of this, in Shen et al. (Zhiyu Shen, Zhiyuan Li, and Pen-Chung Yew, "An Empirical Study on Array Subscripts and Data Dependences", Technical Report CSRD Rpt. No. 840, University of Illinois-CSRD, May 1989.), in a sample of Fortran programs (including library packages such as Linpack and Eispack and numeric programs such as SPICE) 86% of dependences found had non-constant distance vectors. Gannon et al. and Gallivan et al. focused on optimizing for a software-controlled local memory (as in a distributed-memory multiprocessor), rather than for a hardware-controlled cache memory. Therefore, factors like cache line size, set associativity and cache size were not taken into account.
The problem of estimating the number of cache lines for uniprocessor machines was considered by Porterfield (Allan K. Porterfield, "Software Methods for Improvement of Cache Performance on Supercomputer Applications", PhD Thesis, Rice University, May 1989, Rice COMP TR89-93). However, Porterfield's technique assumes a cache line size of one element, and many machines have a cache line size greater than one. Further, the analysis in Porterfield only applies to the special case of data dependences with constant direction vectors, which is not the usual case in practice, Shen et al. The approach in Porterfield is based on computing all cache dependences of a program (similar to data dependences); therefore, the number of dependence tests performed may be quadratic in the number of array references.
Wolf et al. (Michael E. Wolf and Monica S. Lam., "A Data Locality Optimization Algorithm", Proceedings of the ACM SIGPLAN Symposium on Programming Language Design and Implementation, June 1991.), teaches a cache cost model based on a number of loops carrying reuse. Such reuse may either be temporal (relating to the same data item) or spatial (relating to data items in the same cache line), and is given for both single and multiple references.
Related work in summarizing array dependence information for interprocedural analysis can be found in Balasundaram (Vasanth Balasundaram, "A Mechanism for Keeping Useful Internal Information in Parallel Programming Tools: The Data Access Descriptor", Journal of Parallel and Distributed Computing}, 9, p. 154-170, 1990.) While less costly than full dependence analysis, such summaries are usually more costly and precise than needed for cache analysis.
Thus, despite extensive effort in the compiler art related to loop blocking/tiling and cache data locality, the prior art has failed to provide a solution to the problem of determining optimized block/tile sizes for improved data locality. Accordingly, there is a clearly felt need in the art for a method of, system for, and computer program product for, providing optimized block/tile sizes for improved data locality.