1. Field of the Invention
This invention relates to the field of Optimizing Compilers for computer systems. More specifically, the invention is a method and apparatus for efficient determination of a load latency value for modulo scheduling of target program instructions during the code optimization pass of an optimizing compiler.
2. Background
It is desirable that computer programs be as efficient as possible in their execution time and memory usage. This need has spawned the development of computer architectures capable of executing target program instructions in parallel. A recent trend in processor design is to build processors with increasing instruction issue capability and many functional units. Some examples of such designs are Sun's UltraSparc.TM. (4 issue), IBM's PowerPC.TM. series (2-4 issue), MIPS' RlOOOO.TM. (5 issue) and Intel's Pentium-Pro.TM. (aka P6) (3 issue). (These processor names are the trademarks respectively of Sun Microsystems, Inc., IBM Corporation, MIPS Technologies, Inc., and Intel Corporation). At the same time the push toward higher clock frequencies has resulted in deeper pipelines and longer instruction latencies. These and other computer processor architectures contain multiple functional units such as I/O memory ports, integer adders, floating point adders, multipliers, etc. which permit multiple operations to be executed in the same machine cycle. The process of optimizing the target program's execution speed becomes one of scheduling the execution of the target program instructions to take advantage of these multiple computing resource units or processing pipelines. This task of scheduling these instructions is performed as one function of an optimizing compiler. Optimizing compilers typically contain a Code Optimization section which sits between a compiler front end and a compiler back end. The Code Optimization section takes as input the "intermediate code" output by the compiler front end, and operates on this code to perform various transformations to it which will result in a faster and more efficient target program. The transformed code is passed to the compiler back end which then converts the code to a binary version for the particular machine involved (i.e. SPARC, X86, IBM, etc). The Code Optimization section itself needs to be as fast and memory efficient as it possibly can be and needs some indication of the computer resource units available and pipelining capability of the computer platform for which the target program code is written.
In the past, attempts have been made to develop optimizing compilers generally, and code optimizer modules specifically which themselves run as efficiently as possible. A general discussion of optimizing compilers and the related techniques used can be found in the text book "Compilers: Principles, Techniques and Tools" by Alfred V. Aho, Ravi Sethi and Jeffrey D. Ullman, Addison-Wesley Publishing Co 1988, ISBN 0-201-10088-6, especially chapters 9 & 10 pages 513-723. One such attempt at optimizing the scheduling of instructions in inner loops in computer platforms with one or more pipelined functional units is a technique called "modulo scheduling." Modulo scheduling is known in the art and is generally described in the paper titled "Parallelization of WHILE Loops on Pipelined Architectures" by Parthasarathy P. Tirumalai, Meng Lee and Michael S. Schlansker, The Journal of Supercomputing, 5, pages 119-136 (1991) which is incorporated fully herein by reference. Modulo scheduling is one form of software pipelining that extracts instruction level parallelism from inner loops by overlapping the execution of successive iterations. Modulo scheduling makes use of two values, the Minimum Iteration Interval (MII) and the Recurrence Minimum Iteration Interval (RMII) as lower bounds on the Iteration Interval (II), which is a key metric in the modulo scheduling process. These values, mii and rmii are also used in selecting the optimal load latency value to be used in the modulo scheduling process. A brief summary of modulo scheduling is contained in the detailed description section below.
Most modem computer systems architectures include a multi-level memory hierarchy. A consequence of such a design is that the latency of a load operation ("load latency") can vary for different operations and on different target computer platforms. "Load latency" is defined as the time elapsed between the issuance of the load command and the return of the requested data. As an example of various architectural differences, it is common today for a computer processor to have an on-chip cache, an off-chip cache and main memory. If the requested data is in the on-chip cache (L1) it is usually returned in a few clock cycles (1-3). If it is not in L1, but is in the off-chip cache (L2) the data is typically returned in about 10 clock cycles. And if the data has to be fetched from memory, it can take tens of clock cycles. Some computer systems architectures also possess a non-blocking cache feature, wherein the processor can continue to execute instructions even after a previous load instruction has not found the requested data in one of the caches (say L1). And if the processor can sustain multiple cache misses, it only needs to stall (i.e. wait or stop executing) when the requested data is actually needed for a computation. In such systems there is an advantage in separating the load instruction from the use of the requested data. However, there are problems associated with the separation. Too much separation can be harmful because it increases "register pressure" (i.e. the length of time a value is held in a register increases the likelihood there will be competing demands for use of the register by other instructions) and because some cases just do not benefit from the separation because of data dependency constraints. On the other hand, too little separation can result in processor stalls or wasted processor cycles. The invention disclosed herein provides a process for determining what this separation, called load latency, should be. It applies in the context of optimizing compilers which use modulo scheduling for loops in the target computer program where varying load latency conditions exist in the target computer architecture. Such systems can benefit through improved program execution speed by optimally separating the load instructions from the use of the requested data.
The prior art does not describe attempts to automatically select the load latency to be used in modulo scheduling loops as described in this invention. In the prior art, compilers have relied on user supplied directives to select the load latency. This method has the disadvantage of requiring user intervention and user expertise in selecting a proper load latency. The invention described herein consists of a scheme to derive the desired load latency value automatically without any user supplied information. Moreover, even with the same source program the desirable latency value to be used can vary depending on the target machine characteristics. The ability to automatically select a good load latency value allows the same program to work well on a variety of target machines. This is possible because the invention can be embedded within a compiler which can then automatically generate different executables targeting different machines.