1. Field of the Invention
The present invention relates generally to software compiler systems used by processors and computing devices. More particularly, the invention relates to a method and apparatus for optimizing object-code sequencing by such a compiler.
2. Description of the Related Art
Processors can be made faster either by using faster circuitry or by increasing the amount of simultaneous computation they perform. The degree of simultaneous computation may be increased by including instruction-level parallelism (ILP), which exploits the inherent parallelism that most programs naturally contain. The basis of ILP is that the order in which operations are performed by the processor may be rearranged so that there is overlap in the execution of operations. A simple example is where the operation of locating a memory address is performed while the value to be stored at that address is computed. Significant improvements in efficiency can be achieved by appropriate rearrangement of operations.
Some ILP processors, such as Superscalar processors, reorder operations that are known to depend on each other because of the hardware. Other processors, such as Very Long Instruction Word (VLIW) processors or Explicitly Parallel Instruction Computing (EPIC) processors, rely on the instructions themselves to express parallelism. An advantage of ILP exploitation is that individual users are not required to rewrite programs to make use of the parallelism; instead, compilation techniques are invoked automatically to map a program""s ILP to the parallelism available for the target architecture.
The approaches that have traditionally been taken for reorganizing the instruction set, referred to generally as xe2x80x9cinstruction scheduling,xe2x80x9d may be categorized as either local or global scheduling, such terminology depending on the organization of the instruction sequence into xe2x80x9cbasic blocks.xe2x80x9d A basic block is a straight-line sequence of intermediate instructions having a single entry and a single exit, with no internal branches. Thus, each operation specified in a basic block will be executed once each time execution enters the block. Local scheduling refers to reorganization of instructions only within individual basic blocks while global scheduling encompasses the rearrangement of instructions between multiple basic blocks. While local scheduling may have been adequate for early processors, it is generally insufficient for modern architectures because it is inherently unable to exploit parallelism between basic blocks. This is particularly relevant for an EPIC architecture, which requires global scheduling to achieve adequate improvements.
There are several approaches to global scheduling that extract parallelism from a program by moving operations across basic block boundaries and inserting compensation copies to maintain program semantics. One is the use of a trace scheduling algorithm, which attempts to optimize the most frequently used program paths at the expense of less frequently used paths. This technique reduces the problem to a local-scheduling problem and is described, for example, in P. Geoffrey Lowney et al., xe2x80x9cThe Multiflow Trace Scheduling Compiler,xe2x80x9d J. Supercomputing 7, 51 (1993) and Joseph A. Fisher, xe2x80x9cTrace Scheduling: A Technique for Global Microcode Compaction,xe2x80x9d IEEE Trans. Comps. C-30, 478 (1981). The technique functions by identifying a frequently executed acyclic path, i.e. a xe2x80x9ctrace,xe2x80x9d in the flow of operations and then allowing operations to move past branches or labels within that path. Corrections to the code are then inserted to compensate for such movement where basic blocks branch into the middle of traces or are branched into from traces. In particular, operations that are moved from one basic block on the trace to another basic block on the trace need to be copied to certain basic blocks that are not on the trace. While the technique reduces the overall number of instructions in the trace, this may be at the expense of increasing the number of instructions in program segments that are not part of the trace. Since those program segments that are not part of the trace are executed less frequently, the scheduling results in an overall improvement in program execution time.
Another approach to global scheduling is percolation scheduling as described, for example, in A. Nicolau, xe2x80x9cPercolation Scheduling: A Parallel Computation Technique,xe2x80x9d Tech. Rep, Cornell Univ. (1984). This technique is based on a set of core transformations applicable to a program graph. It is a xe2x80x9cgreedyxe2x80x9d algorithm that increases parallelism by moving operations upwards in the graph as much as possible, i.e. as close as possible to the graph node that represents the particular function being compiled. Because the technique ignores system requirements, operations that are executed with small probability nevertheless consume significant system resources. An attempt to address this deficiency is included in enhanced percolation scheduling, where operation movement is delayed until scheduling time. Enhanced percolation scheduling is described in, for example, K. Ebcioglu and A. Nicolau, xe2x80x9cA Global Resource-Constrained Parallelization Technique,xe2x80x9d Proc. 3d Int""l Conf. Supercomp., 154 (1989).
Like trace scheduling, percolation scheduling requires that compensatory code corrections be introduced. Other related techniques that also require compensatory code corrections include sentinel scheduling and region scheduling. Sentinel scheduling uses profiling information to schedule discrete multiple basic blocks (xe2x80x9csuperblocksxe2x80x9d) and is described, for example, in S. A. Mahlke et al., xe2x80x9cSentinel scheduling for VLIW and superscalar processors,xe2x80x9d asplos5 27, 238 (1992). Region scheduling, in which program transformation rules are used in an attempt to balance the program graph, is described, for example, in R. Gupta and M. Soffa, xe2x80x9cRegion scheduling: An approach for detecting and redistributing parallelism,xe2x80x9d IEEE Trans. Software Eng. 16, 421 (1990).
Global instruction scheduling differs from these in that it attempts to limit code duplication. While operations may be moved beyond basic block boundaries, the technique constrains the movement of operations to be within an enclosing loop. The technique is described, for example, in D. Bernstein and M. Rodeh, xe2x80x9cGlobal instruction scheduling for superscalar machines,xe2x80x9d Conf. Prog. Lang. Design and Implementation, SIGPLAN xe2x80x291, p. 241 (1991).
A further approach to global scheduling is described in U.S. Pat. No. 5,557,761, issued to Chan et al. The technique presented there operates by first selecting source and target basic blocks, which are identified according to the relationships of the instructions within them. An attempt is made to identify a maximal set of instructions contained in the source basic block that may by moved to the target basic block without violating any data dependency relationships and remaining within the confines of available resources. A model (called the xe2x80x9cLatency Subsumption Cost of Movement Modelxe2x80x9d) is used to specify the overall cost of such movement, which is then used to decide whether to move the identified maximal instruction set.
The present invention is directed to a method and apparatus for generating an optimized intermediate representation of source code for a computer program. The optimized intermediate representation is generated from an initial intermediate representation, which is extracted from the source code by organizing it as a plurality of basic blocks. Each of the basic blocks contains at least one program instruction. The basic blocks in the initial intermediate representation are ordered according to an estimated profit value that is assigned to each of the basic blocks. A goal function, which is designed as an appropriate measure of the degree of optimization of the program according to its intermediate representation, is then calculated. The effect on the goal function of modifying the intermediate representation by moving an instruction from one of the basic blocks to each of its predecessors is tested iteratively. If the modified intermediate representation causes a reduction in the goal function, it is adopted. The process iterates, progressively testing movement of instructions to predecessors, until the intermediate representation converges and the value of the goal function is minimized.
In a preferred embodiment, the iterative process proceeds by selecting a current basic block and storing the profit value of the basic block subsequent to it in the ordered representation of basic blocks. The first instruction of the current basic block is then moved to all of its predecessors to create a test intermediate representation. A new goal function is calculated based on the test intermediate representation. If the difference between the previous goal function and the new goal function exceeds the stored profit value, the test intermediate representation is adopted, but is not adopted if the difference is less than or equal to the stored profit value.
In another embodiment, the invention is directed to a computer apparatus for generating the optimized object code. The computer apparatus includes an input device capable of reading the set of source code instructions, a compiler, and an output device for outputting the optimized object code. The compiler includes a front end, an optimizer, and a code generator. The front end is programmed to generate an initial representation of the source code in terms of basic blocks, which it then orders according to an estimated profit value assigned to each basic block. The front end also calculates an initial value for the goal function. The optimizer is programmed to test iteratively the effect on the goal function of modifying the intermediate representation by moving an instruction from a basic block to each of its predecessors; if the modified intermediate representation has a lower goal function, it is adopted. The iterative procedure is carried out by the optimizer until the intermediate representation has converged. The converged intermediate representation is then converted by the code generator into the optimized object code.