1. Field of the Invention
The invention generally relates to methods and devices for optimizing computer register allocation and assignment, particularly as implemented in an optimizing compiler using instruction level scheduling.
2. Related Art
A compiler is a computer program that transforms a source computer program written in one language, such as Fortran, or C, into a target computer program that has the same meaning but is written in another language, such as an assembler or machine language. A compiler""s tasks may be divided into an analysis stage followed by a synthesis stage, as explained in Compilers: Principles, Techniques, and Tools by A. Aho et al. (Addison Wesley, 1988) pp. 2-22. The product of the analysis stage may be thought of as an intermediate representation of the source program; i.e., a representation in which lexical, syntactic, and semantic evaluations and transformations have been performed to make the source code easier to synthesize. The synthesis stage may be considered to consist of two tasks: code optimization, in which the goal is generally to increase the speed at which the target program will run on the computer, or possibly to decrease the amount of resources required to run the target program; and code generation, in which the goal is to actually generate the target code, typically relocatable machine code or assembly code.
A compiler that is particularly well suited to one or more aspects of the code optimization task may be referred to as an xe2x80x9coptimizing compiler.xe2x80x9d Optimizing compilers are of increasing importance for several reasons. First, the work of an optimizing compiler frees programmers from undue concerns regarding the efficiency of the high-level programming code that they write. Instead, the programmers can focus on high-level program constructs and on ensuring that errors in program design or implementation are avoided. Second, designers of computers that are to employ optimizing compilers can configure hardware based on parameters dictated by the optimization process rather than by the non-optimized output of a compiled high-level language. Third, increased use of microprocessors that are designed for instruction level parallel processing, such as RISC and VLIW microprocessors, presents new opportunities to exploit this processing through a balancing of instruction level scheduling and register allocation.
There are various strategies that an optimizing compiler may pursue. Many of them are described in S. Muchnick, Advanced Compiler Design and Implementation (Morgan Kaufmann Publishers, 1997). One large group of these strategies focus on optimizing transformations, such as are described in D. Bacon et al., xe2x80x9cCompiler Transformations for High-Performance Computing,xe2x80x9d in ACM Computing Surveys, Vol. 26, No. 4 (December 1994) at pp. 345-520. These transformations often involve high-level, machine-independent, programming operations: for example, removing redundant operations, simplifying arithmetic expressions, removing code that will never be executed, removing invariant computations from loops, and storing values of common sub-expressions rather than repeatedly computing them. These machine-independent transformations are hereafter referred to as high level optimizations.
Other strategies employ machine-dependent transformations. These machine-dependent transformations are hereafter referred to as low level optimizations. Two important types of low level optimizations are: (a) instruction scheduling and (b) register allocation. An important portion of both types of low level optimization strategies are focused on loops in the code, where in many applications the majority of execution time is spent.
A principal goal of some instruction scheduling strategies is to permit two or more operations within a loop to be executed in parallel, a process referred to as instruction level parallel (ILP) processing. ILP processing generally is implemented in processors with multiple execution units. One way of communicating with the central processing unit (CPU) of the computer system is to create xe2x80x9cvery long instruction wordsxe2x80x9d (VLIW""s). VLIW""s specify the multiple operations that are to be executed in a single machine cycle. For example, a VLIW may instruct one execution unit to begin a memory load and a second to begin a memory store, while a third execution unit is processing a floating point multiplication. Each of these execution tasks has a latency period; i.e., the task may take one, two, or more cycles to complete. The objective of ILP processing is thus to optimize the use of the execution units by minimizing the instances in which an execution unit is idle during an execution cycle. ILP processing may be implemented by the CPU or, alternatively, by an optimizing compiler. Utilizing CPU hardware, however, may be complex and result in an approach that is not as easy to change or update as the use of an appropriately designed optimizing compiler.
One known technique for improving instruction level parallelism in loops is referred to as software pipelining. As described in the work by D. Bacon et al. referred to above, the operations of a single loop iteration are separated into s stages. After transformation, which may require the insertion of startup code to fill the pipeline for the first sxe2x88x921 iterations and cleanup ode to drain it for the last sxe2x88x921 iterations, a single iteration of the transformed code will perform stage 1 from pre-transformation iteration i, stage 2 from pre-transformation iteration i-l, and so on. This single iteration is known as the kernel of the transformed code. A particular known class of algorithms for achieving software pipelining is referred to as modulo scheduling, as described in James C. Dehnert and Ross A. Towle, xe2x80x9cCompiling for the Cydra 5,xe2x80x9d in The Journal of Supercomputing, vol. 7, pp. 181, 190-197 (1993; Kluwer Academic Publishers).
Typically, the application of an instruction scheduling algorithm depends on information provided by a dependence graph (as well as information about the machine on which the instructions will be executed). As is known to those skilled in the art, the dependence graph represents source program dependencies at the machine instruction level. The construction of the dependence graph is based upon general data flow information that may be computed and maintained across several optimization phases. There are several alternative forms of data flow representation described in the literature, and a typical optimizer may choose to use any one or more of these. Among them are so-called xe2x80x9cdef-usexe2x80x9d (definition-use) chains, static single assignment (SSA) form, and dynamic single assignment (DSA) form. From the instruction scheduling point of view, the fewer dependencies there are in the dependence graph, the more freedom the scheduler has to achieve higher degrees of ILP. Some forms of data flow representation (such as SSA) enable more accurate and more resource-efficient construction of instruction dependence graphs than others.
As noted, another group of low level optimization strategies involves register allocation and assignment. Some of these strategies have as their goal improved allocation and assignment of registers used in performing loop operations. The allocation of registers generally involves the selection of variables to be stored in registers during certain portions of the compiled computer program. The subsequent step of assignment of registers involves the choosing of specific registers in which to place the variables. The term xe2x80x9cvariablexe2x80x9d will generally be understood to refer to a quantity that has a xe2x80x9clive rangexe2x80x9d during the portion of the computer program under consideration. Specifically, a variable has a live range at a particular point in the computer program if that point may be included in a control path having a preceding point at which the variable is defined and a subsequent point at which the variable is used. Thus, register allocation may be described as referring to the selection of live ranges to be stored in registers, and register assignment as the assignment of a specific physical register to one of the live ranges previously allocated for these assignments.
Registers are high-speed memory locations in the CPU generally used to store the value of variables. They are a high-value resource because they may be read from or written to very quickly. Typically, at least two registers can be read and a third written within a single machine cycle. In comparison, a single access to random access memory (RAM) may require several cycles to complete. Registers typically are also a relatively scarce resource. In comparison to the large number of words of RAM addressable by the CPU, typically numbered in the tens or hundreds of millions and requiring tens of bits to address, the number of registers will often be on the order of ten or a hundred and therefore require only a small number of bits to address. Because of their high value in terms of speed, the decisions of how many and which kind of registers to assign may be the most important decisions in determining how quickly the program will run. For example, a decision to assign a frequently used variable to a register may eliminate a multitude of time-consuming reads and writes of that variable from and to memory. This assignment decision often will be the responsibility of an optimizing compiler.
Register allocation and assignment are particularly difficult problems, however, when combined with the goal of minimizing the idle time of multiple execution units using instruction level scheduling. In particular, there is the well known problem, sometimes referred to as xe2x80x9cphase ordering,xe2x80x9d of which task should be performed first. In order to provide full freedom to the instruction scheduler to achieve a high degree of ILP, it is better to perform instruction scheduling before register allocation. However, having an insufficient number of registers to perform all the operations would cause the register allocator/assigner to insert xe2x80x9cspillxe2x80x9d instructions to spill one or more registers. That is, the contents of the spilled registers are temporarily moved to RAM to provide registers for the remaining operations that must be performed, and loaded back again into registers when required for subsequent operations. In order to schedule these spill instructions, the instruction scheduler must execute after the register allocator. Typically, compilers overcome this problem by executing the instruction scheduler twice: once before the register allocator/assigner executes, and once after.
Modulo scheduling and rotating register allocation/assignment introduce additional considerations into this already complex situation. Typically, modulo scheduling is performed as part of the instruction-scheduling phase before general register allocation/assignment in order to exploit more instruction level parallelism, as mentioned above. One would be able arrive at the exact register requirements (rotating or static) for a loop only after a modulo schedule is determined. It is quite possible, however, that after a modulo schedule is determined, the register allocator/assigner may determine that spill code must be inserted due to an insufficient number of registers.
One attempt to address this problem is described in Q. Ning and Guang R. Gao, xe2x80x9cA Novel Framework of Register Allocation for Software Pipelining,xe2x80x9d in Proceedings of the SIGPLAN""93 Conference on POPL (1993) at pp. 29-42. The method described in that article (hereafter, the xe2x80x9cNing-Gao methodxe2x80x9d) makes use of register allocation as a constraint on the software pipelining process. The Ning-Gao method generally consists of determining time-optimal schedules for a loop using an integer linear programming technique and then choosing the schedule that imposes the least restrictions on the use of registers. One disadvantage of this method, however, is that it is quite complex and may thus significantly contribute to the time required for the compiler to compile a source program. Another significant disadvantage of the Ning-Gao method is that it does not address the need for, or impact of, inserting spill code. That is, the method assumes that the minimum-restriction criterion for register usage can be met because there will always be a sufficient number of available registers. However, this is not always a realistic assumption as applied to production compilers. (A production compiler is one intended for commercial production, as contrasted, for example, with a research compiler for experimental use.)
Another known method that attempts to provide for loop scheduling and register allocation while taking into account the potential need for inserting spill code is described in Jian Wang, et al., xe2x80x9cSoftware Pipelining with Register Allocation and Spilling,xe2x80x9d in Proceedings of the MICRO-27,xe2x80x9d (1994) at pp. 95-99. The method described in this article (hereafter, the xe2x80x9cWang methodxe2x80x9d) generally assumes that all spill code for a loop to be software pipelined is generated during instruction-level scheduling. Thus, the Wang method requires assumptions about the number of registers that will be available for assignment to the operations within the loop after taking into account the demand on register usage imposed by live ranges in the subprogram outside of the loop. These assumptions may, however, prove to be inaccurate, thus requiring either unnecessarily conservative assumptions to avoid this possibility, repetitive loop scheduling and register allocation, or other variations on the method.
Thus, a better method and system are needed for performing loop instruction scheduling and register allocation/assignment. This improved method and system should be capable of generating schedules with high degrees of instruction level parallelism. They should take into account practical constraints on the number of available registers and thus the potential need to insert spill code. However, the need to insert spill code should be minimized. The improved method and system should be efficient in terms of resource consumption (memory usage and compile time) for incorporation into production compilers.
The foregoing and other objects, features, and advantages are achieved in a system, method, and product for instruction scheduling and register allocation/assignment in an optimizing compiler. In one aspect of the invention, a scheduler-assigner for allocating rotating registers is disclosed. The scheduler-assigner is used in a computer with a memory unit, in which is stored a first intermediate representation (first IR) of source code. The first IR has data flow information in SSA form.
The scheduler-assigner includes a software-pipelined instruction scheduler that generates a first software-pipelined instruction schedule based on the first IR. The scheduler-assigner also includes a rotating register allocator that designates live ranges of loop-variant variables in the first software-pipelined instruction schedule as being allocated to rotating registers, when available. If a live range is exposed, the rotating register allocator may determine that none of the rotating registers should be designated as allocated to the exposed live range.
The first software-pipelined instruction schedule may be a modulo schedule. When a rotating register is not available, the software-pipelined instruction scheduler may generate a second software-pipelined instruction schedule having an initiation interval greater than the initiation interval of the first software-pipelined instruction schedule. In this case, the rotating register allocator may designate live ranges of loop-variant variables in the second software-pipelined instruction schedule as being allocated to rotating registers, when available. If rotating registers are not available for all these live ranges, the process may be repeated one or more times. For example, the software-pipelined instruction scheduler may generate a third software-pipelined instruction schedule having an initiation interval greater than the initiation interval of the second software-pipelined instruction schedule.
The scheduler-assigner may also include a modulo schedule code generator that generates, based on the designations of the live ranges as being allocated to the rotating registers, a rotating register assigned intermediate representation that includes an assignment of the rotating registers to the live ranges. The modulo schedule code generator includes a software-pipelined instruction-schedule code inserter that generates from the first IR a software-pipelined IR having one or more instructions that are software-pipelined based on the first software-pipelined instruction schedule. The modulo schedule code generator also includes a rotating register assigner that assigns the first rotating register in the software-pipelined IR to the first live range, thereby generating a rotating-register assigned IR. The assignment is based upon the designation of the first live range as being allocated to the first rotating register.
The rotating-register assigned IR may have one or more phi functions including a first phi function having an operand to which the rotating register assigner has assigned the first rotating register. The modulo schedule code generator includes an SSA updater that propagates the first rotating register to at least one use of the operand, thereby generating a data-flow updated IR. When the first rotating register has been propagated to at least one use of the operand, the SSA updater removes the first phi function from the data-flow updated IR, thereby generating an SSA-updated IR.
The scheduler-assigner may have an SSA discarder that eliminates data flow information from the SSA-updated IR, thereby generating an SSA-discarded IR. In some implementations, the SSA discarder eliminates the data flow information using a sibling relationship technique. In some implementations, the computer as static registers and the SSA-discarded IR includes one or more static virtual registers. In these implementations, the scheduler-assigner may include a static register assigner and memory spiller that assigns a first static register, when available, to replace a first of the one or more static virtual registers, thereby generating a static-register assigned IR. When the first static register is not available, the static register assigner and memory spiller inserts in the static-register assigned IR one or more spill code instructions for a live range corresponding to the first static virtual register. The scheduler-assigner may further include a machine code generator that transforms the static-register assigned IR into a set of machine code instructions suitable for execution by the computer""s processor.
In some aspects of the invention, a method for allocating rotating registers is described. The method includes the steps of: (a) generating a first software-pipelined instruction schedule based on a first IR of source code; and (b) designating live ranges of a loop-variant variables in the first software-pipelined instruction schedule as being allocated to rotating registers. The first IR includes data flow information in SSA form. The first software-pipelined instruction schedule may be a modulo schedule.
In some implementations of the method, step (b) includes, if a rotating register is not available for having a live range designated to it, generating a second software-pipelined instruction schedule having an initiation interval greater than the initiation interval of the first software-pipelined instruction schedule. Live ranges of a loop-variant variable in the second software-pipelined instruction schedule may be designated as being allocated to the first rotating register, when available. If rotating registers are not available for all these live ranges, the method includes generating a third software-pipelined instruction schedule having an initiation interval greater than the initiation interval of the second software-pipelined instruction schedule. These steps of generating software-pipelined instruction schedules with increasing initiation intervals, and attempting to designate all live ranges as being allocated to rotating registers, may continue to be repeated to find a schedule such that a sufficient number of rotating registers are available. In some aspects, step (a) includes, when a rotating register is not available for having a live range designated to it, (i), inserting one or more spill code instructions in the first IR for the live range, and (ii) generating another software-pipelined instruction schedule based on the first IR including the spill code instructions.
The method may also have a step(c) of generating, based on the software-pipelined instruction schedule and the designation of live ranges as being allocated to rotating registers, a rotating register assigned IR that includes an assignment of the rotating registers to the live ranges. In some implementations, this step (c) includes (i) generating from the first IR a software-pipelined IR having one or more instructions that are software-pipelined based on the first software-pipelined instruction schedule, and (ii) assigning the rotating registers in the software-pipelined IR to the live ranges, thereby generating a rotating-register assigned IR, wherein the assignment is based upon the designation of the live ranges as being allocated to the rotating registers.
The rotating-register assigned IR generated in accordance with this method may have one or more phi functions including a first phi function having an operand to which a first rotating register has been assigned. In this implementation, step (c) of the method further includes (iii) propagating the first rotating register to at least one use of the operand, thereby generating a data-flow updated IR. When the first rotating register has been propagated to at least one use of the operand, step (c) (iii) further includes the step of removing the first phi function from the data-flow updated IR, thereby generating an SSA-updated IR. Another step in the method may be (d) eliminating data flow information from the SSA-updated IR.
In yet other aspects of the invention, an optimizing compiler is described. The compiler is for use in a computer that has rotating registers. The compiler includes a front end processor that applies high-level, machine independent, optimizing transformations to a source code image, thereby generating a low level intermediate representation (low level IR) of the source code. The compiler also includes a low-level code optimizer that has a control and data flow information generator that generates a low level IR with control and data flow information. The data flow information is based upon data flow in the low level IR, and is in SSA form. Also included in the compiler is a global and loop optimizer that applies global, low level optimization techniques to the low level IR with control and data flow information, thereby generating a low-level optimized IR. A global scheduler then applies instruction scheduling techniques to the low-level optimized IR, thereby generating a list scheduled IR with control and data flow information (list-scheduled IR). Also included in the compiler is a scheduler-assigner that allocates rotating registers. The scheduler-assigner includes a software-pipelined instruction scheduler that generates a first software-pipelined instruction schedule based on the list scheduled IR, and a rotating register allocator that designates live ranges of loop-variant variables in the first software-pipelined instruction schedule as being allocated to rotating registers. The first software-pipelined instruction schedule may be a modulo schedule.
In a further aspect of the invention, a computer system is described. The computer system has a processor, one or more rotating registers, and a memory unit having stored therein a first intermediate representation (first IR) of source code and a set of scheduling-assignment instructions for execution by the processor. The first IR includes data flow information in SSA form. The set of scheduling-assignment instructions includes a set of software-pipelined instruction scheduler instructions that generate a first software-pipelined instruction schedule based on the first IR. The set of scheduling-assignment instructions also includes a set of rotating register allocator instructions that designate live ranges of a loop-variant variables in the first software-pipelined instruction schedule as being allocated to rotating registers. The first software-pipelined instruction schedule may be a modulo schedule.
Storage media are described in another aspect of the invention. The storage media contain software that, when executed on a computing system, performs a method for allocating rotating registers. The method includes the steps of: (a) generating a software-pipelined instruction schedule based on a first intermediate representation (first IR) of source code stored in a memory unit of the computer; and (b) designating live ranges of loop-variant variables in the software-pipelined instruction schedule as being allocated to rotating registers. The first IR includes data flow information in SSA form. The software-pipelined instruction schedule may be a modulo schedule.
The above aspects and implementations of the invention are not necessarily inclusive or exclusive of each other and may be combined in any manner that is non-conflicting and otherwise possible, whether they be presented in association with a same, or a different, aspect or implementation of the invention. The description of one aspect is not intended to be limiting with respect to other aspects. Also, any one or more function, step, operation, or technique described elsewhere in this specification may, in alternative aspects, be combined with any one or more function, step, operation, or technique described in the summary. Thus, the above aspects are illustrative rather than limiting.