It is desirable that computer programs be as efficient as possible in their execution time and memory usage. This need has spawned the development of computer architectures capable of executing target program instructions in parallel. A recent trend in processor design is to build processors with increasing instruction issue capability and many functional units. Some examples of such designs are Sun's UItraSparc.TM. (4 issue), IBM's PowerPC.TM. series (2-4 issue), MIPS' RlOOOO.TM. (5issue) and Intel's Pentium-Pro.TM. (aka P6) (3 issue). (These processor names are the trademarks respectively of Sun Microsystems, Inc., IBM Corporation, MIPS Technologies, Inc., and Intel Corporation). At the same time the push toward higher clock frequencies has resulted in deeper pipelines and longer instruction latencies. These and other computer processor architectures contain multiple functional units such as I/O memory ports, integer adders, floating point adders, multipliers, etc. which permit multiple operations to be executed in the same machine cycle. The process of optimizing the target program's execution speed becomes one of scheduling the execution of the target program instructions to take advantage of these multiple computing resource units or processing pipelines. This task of scheduling these instructions is performed as one function of an optimizing compiler. Optimizing compilers typically contain a Code Optimization section which sits between a compiler front end and a compiler back end. The Code Optimization section takes as input the "intermediate code" output by the compiler front end, and operates on this code to perform various transformations to it which will result in a faster and more efficient target program. The transformed code is passed to the compiler back end which then converts the code to a binary version for the particular machine involved (i.e. SPARC, X86, IBM, etc). The Code Optimization section itself needs to be as fast and memory efficient as it possibly can be and needs some indication of the computer resource units available and pipelining capability of the computer platform for which the target program code is written.
In the past, attempts have been made to develop optimizing compilers generally, and code optimizer modules specifically which themselves run as efficiently as possible. A general discussion of optimizing compilers and the related techniques used can be found in the text book "Compilers: Principles, Techniques and Tools" by Alfred V. Aho, Ravi Sethi andJeffrey D. Ullman, Addison-Wesley Publishing Co 1988, ISBN 0-201-10088-6, especially chapters 9 & 10 pages 513-723. One such attempt at optimizing the scheduling of instructions in inner loops in computer platforms with one or more pipelined functional units is a technique called "modulo scheduling." Modulo scheduling is known in the art and is generally described in the paper entitled "Some Scheduling Techniques and An Easily Schedulable Horizontal Architecture For High Performance Scientific Computing" by Rau B. R. and Glaeser, C. D., Proceedings of Fourteen Annual Workshop on Microprogramming, Advanced Processor Technology Group, ESL, Inc. October 1981, Pages 183-198 which is incorporated fully herein by reference. Modulo scheduling is one form of software pipelining that extracts instruction level parallelism from inner loops by overlapping the execution of successive iterations.
The modulo schedule is derived by traversing the data dependency graph for the loop assigning time-stamps to the instructions. Since a data dependency graph represents precedence relationships between instructions, the traditional approach is to schedule the sources of dependencies before the targets. The problem arises when the scheduling of a target needs to be delayed either because of unsatisfied precedence relationships with other sources or because of modulo constraints. In such cases, the lifetime of the register between the source and the target is lengthened. This has two negative consequences for the software pipelined loop:
1) Since register lifetimes are lengthened, increased register pressure may result in more register spills.
2) Since the number of times the kernel is unrolled depends on the longest register lifetime, greater code expansion may occur.
This invention addresses this problem by introducing a time-reversed scheduling approach for modulo scheduling. Forms of data dependency graphs, flow graphs etc. have been known for use in various fields requiring data flow analysis such as Operations Research and in optimizing compilers for some time. Time reversed scheduling of of such data flow graphs is a technique that has also been known in various of these fields. However, there is no known prior art which uses or suggests the use of time-reversed data dependency graph scheduling in modulo scheduling a target programs loop instructions in an optimizing compiler.