1. Field of the Invention
The present invention relates in general to a computer program analysis method for use by an optimizing or parallelizing compiler or by a computer program analysis tool, and more particularly to a technique for performing weighted loop fusion.
2. Description of the Related Art
Much of the computation involved in parallel programs occurs within loops, either nested loops as in parallel scientific applications or collections of loops as in stream-based applications. As a result, being able to handle loops efficiently is of fundamental importance. Much of the past work in optimizing the performance of loops has focused on individual loop nests rather than on collections of loop nests. This is represented by the teachings of Allen et al. (R. Allen and K. Kennedy. Automatic translation of FORTRAN programs to vector form. ACM Transactions on Programming Languages and Systems. 9:491-542, 1987); Banerjee (U. Banerjee. Dependence Analysis for Supercomputing. KIuwer Academic Publishers, Boston, Mass., 1988); Irigoin et al. (Francois Irigoin and Remi Triolet. Supernode Partitioning. Conference Record of Fifteenth ACM Symposium on Principles of Programming Languages, 1988); Sarkar et al. (Vivek Sarkar and Radhika Thekkath. A General Framework for Iteration-Reordering Loop Transformations. Proceedings of the ACM SIGPLAN '92 Conference on Programming Language Design and Implementation, pages 175-187, June 1992); Wolf et al. (Michael E. Wolf and Monica S. Lam. A Data Locality Optimization Algorithm. Preceedings of the ACM SIGPLAN Symposium on Programming Language Design and Implementation, pages 30-44, June 1991); Wolf et al. (Michael E. Wolf and Monica S. Lam. A Loop Transformation Theory and an Algorithm to Maximize Parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4):452-471, October 1991); and Wolfe (Michael J. Wolfe. Optimizing Supercompilers for Supercomputers. Pitman, London and The MIT Press, Cambridge, Mass., 1989. In the series, Research Monographs in Parallel and Distributed Computing).
In the weighted loop fusion problem, each pair of loop nests has an associated non-negative weight which is the cost savings that would be obtained if the two loop nests were fused. The weight values depend on the target hardware; contributions to the weights can arise from savings of messages on distributed-memory multiprocessors, and from savings of load/store instructions and cache misses on shared-memory multiprocessors and uniprocessors. The loop nests may contain parallel or sequential loops; care is taken to ensure that a parallel loop does not get serialized after fusion.
A fusion partition is a partition of the loop nests into disjoint fusion clusters such that each fusion cluster represents a set of loop nests to be fused. There are two conditions that must be satisfied by a legal fusion partition. First, any two loop nests that are specified as being a "noncontractable" pair must be placed in distinct fusion clusters. Second, the inter-cluster dependence graph defined by the fusion partition must be acyclic. This general definition of a fusion partition permits fusion of non-adjacent loops and subsumes restricted definitions of "horizontal" and "vertical" loop fusion that have been considered in past work (see Goldberg et al. for a brief summary).
Weighted loop fusion is the problem of finding a legal fusion partition of loop nests into fusible clusters so as to minimize the total inter-cluster node-pair weights. Kennedy et al. have shown that the weighted loop fusion problem is NP-hard (Ken Kennedy and Kathryn S. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. Springer-Verlag Lecture Notes in Computer Science, 768. Proceedings of the Sixth Workshop on Languages and Compilers for Parallel Computing, Portland, Oreg., August 1993). Hence greedy algorithms are used in practice to obtain heuristic solutions to the weighted loop fusion problem with no proven performance bounds on how the heuristic solutions compare to optimal solutions.
Loop distribution is a well known loop transformation that separates a single loop nest into multiple conformable loop nests and is thus the inverse of loop fusion (Wolfe). Loop distribution is effective in controlling register pressure and in creating a larger number of loop nests to feed into loop fusion. An understanding of the interaction between loop distribution and loop fusion may be reached by observing that the result of any sequence of fusion and distribution transformations is a regrouping of the statements in the bodies of the loop nests in the original program. All sequences of fusion and distribution transformations that result in the same regrouping of statements and in the same ordering of regrouped loop nests are equivalent. The goal of combining distribution and fusion is to automatically select an optimized fusion/distribution configuration, i.e., an optimized regrouping of statements. Therefore, without any loss of generality, it may be assumed that all loop nests are maximally distributed (Wolfe) before any fusion transformation is applied. Maximal distribution also yields a larger number of perfect loop nests that can be subject to iteration-reordering loop transformations (eg., interchange, tiling) before loop fusion. The problem of selecting a fusion/distribution configuration thus becomes equivalent to an optimal weighted loop fusion problem after maximal distribution.
Gao et al. (G. R. Gao, R. Olsen, V. Sarkar, and R. Thekkath. Collective loop fusion for array contraction. Springer-Verlag Lecture Notes in Computer Science, 757. Proceedings of the Fifth Workshop on Languages and Compilers for Parallel Computing, Yale University, August 1992) studied the weighted loop fusion problem in the context of array contraction, and presented a polynomial-time algorithm based on the max-flow/min-cut algorithm as a heuristic solution. Kennedy et al. proved that the weighted loop fusion problem is NP-hard and presented two polynomial-time algorithms as heuristic solutions, a simple greedy algorithm and a more powerful algorithm based on the max-flow/min-cut algorithm. They also presented uniprocessor performance improvements due to loop fusion in the range of 4-17% (depending on the processor) for the Erlebacher benchmark thus demonstrating the benefits of weighted loop fusion even in a uniprocessor context.
Unweighted loop fusion is the problem of finding a legal fusion partition that minimizes the number of fusion clusters (there are no edge weights in this problem statement and hence no consideration of locality savings for pairs of loops). Callahan (David Callahan. A Global Approach to Detection of Parallelism. PhD thesis, Rice University, April 1987. Rice COMP TR87-50) presented a greedy partitioning algorithm for unweighted loop fusion and proved its optimality. Kennedy et al. (Ken Kennedy and Kathryn S. McKenley. Typed Fusion with Applications to Parallel and Sequential Code Generation. Technical report, Department of Computer Science, Rice University, 1993. TR93-208) extended Callahan's result by addressing the problem of (unweighted) typed fusion, an extension to unweighted loop fusion in which each loop has an assigned type and only loops of the same type can be fused together. In ordered typed fusion, there is a prioritized ordering of types, t.sub.1, . . . , t.sub.k, and the objective is to find a legal fusion partition with the lexicographically smallest value of the tuple (Nt.sub.1, . . . , Nt.sub.k), where Nt.sub.i is the number of fusion clusters of type t.sub.i. The authors presented a polynomial-time algorithm for finding an optimal solution to this ordered typed fusion problem. An important application of ordered typed fusion is the case of fusing a collection of parallel and serial loops in which priority is given to the parallel type over the serial type. However, in this work, the authors did not address the issue of preventing fusion when fusing two parallel loops introduces a loop-carried data dependence. In unordered typed fusion, there is no prioritization among types and the objective is to find a legal partition with the minimum number of fusion clusters. The authors proved that the unordered typed fusion problem can be solved optimally in polynomial time for two types, but is NP-hard in general.
Goldberg et al. (A. Goldberg and R. Paige. Stream processing. 1984 ACM Symposium on Lisp and Functional Programming, pages 53-62, August 1984. Austin, Tex.) studied the problem of stream processing, an optimization technique that is related to loop fusion. They showed how stream processing and loop fusion techniques can be used to avoid intermediate storage in database queries and thus reduce the execution time of the queries. Their work highlights another important application area for loop fusion.
Conventional methods for performing weighted loop fusion provide sub-optimal solutions through the use of heuristics. These conventional methods may also have large execution times if using an exhaustive search algorithm which is not generally practical for use in a product-quality optimizing compiler. Thus, there is a clearly felt need for a method of, system for, and computer program product for, providing optimal weighted loop fusion. There is also a clearly felt need for a method of, system for, and computer program product for, providing a more efficient and practical weighted loop fusion.