Most computers today, particularly Reduced Instruction Set Computer (RISC) systems, use instruction pipelines, which allow the arithmetic logic unit (ALU) to always have at its disposal the next instruction to execute, theoretically allowing a one-to-one correspondence between clock rate and execution speed. Under certain well defined conditions within the hardware, a pipeline interlock prevents execution of a particular instruction until instruction operands are available as results from previous computations. The effect of this interlock may be a lost processing cycle. The most frequent cause for a delay, that is, a lost processing cycle, occurs when processing a conditional branch. At this point all instructions in the pipeline after the branch instruction are potentially nonexecutable, because the branch may direct the control of the program into a different instruction stream; if it does so, they must be replaced by that instruction stream. During this reloading of the pipeline the ALU sits idle waiting for the new instruction stream by means of the action of the interlock. The problem is compounded if the redirected instruction stream immediately contains another conditional branch introducing another potential interlock.
Several attempts have been made in the past to provide a constantly replenished instruction pipeline.
Early RISC machines provide a post-execute form of branch instruction which allows certain instructions, which normally are executed prior to the branch, to appear in the instruction stream after the branch. This makes it possible for some instructions to be executed while the pipeline is being refilled. Such is the case with the IBM.sup.1 RT.sup.2 PC computer system. FNT .sup.1 Trademark of International Business Machines Corp. FNT .sup.2 Trademark of International Business Machines Corp.
Finding appropriate instructions to insert between the definition and use point of a condition register, to increase instruction pre-fetch opportunities, forms part of the class of compiler optimizations called instruction scheduling, and is part of the background of this invention. Prior art code hoisting techniques generally look for computations which occur along parallel paths of execution and hoist them to a node which dominates both paths. The resulting module is typically smaller in size as a result of replacing two computations with one. This invention employs a refinement of such techniques to find and hoist instructions along paths of conditional execution in order to maintain a full instruction pipeline. The result is that instructions are scheduled across basic block boundaries. The articles below outline some attempts to maintain the instruction pipeline full of executable instructions:
Arya S., Optimal Instruction Scheduling for a Class of Vector Processors: An Integer Programming Approach, Tech. Rept. CRL-TR-19-83, Computer Research Laboratory, Univ. of Mich., Ann Arbor, April 1983.
Auslander M. and Hopkins M., An Overview of the PL.8 Compiler, Proc. ACM SIGPLAN Symp. on Compiler Construction, Boston, June 1982, pp. 22-31.
Gibbons P. and Muchnick S., Efficient Instruction Scheduling for a Pipelined Architecture, Proc. SIGPLAN 86 Symp. on Compiler Construction, Palo Alto, 1986, pp. 11-16.
Gross T. R., Code Optimization of Pipeline Constraints, Tech. Rept. 83-255, Computer Systems Lab., Stanford Univ., December 1983.
Hennessy T. L. and Gross T. R., Postpass Code Optimization of Pipeline Constraints, ACM Trans. on Prog. Lang. and Sys, Vol. 5, No. 3, July 1983, pp. 422-448.
Sites, R. L., Instruction Ordering for the Cray-1 Computer, Tech. Rept. 78-CS-023, Univ. of Calif., San Diego, July 1978.
Research on compile-time pipeline scheduling is relatively sparse. Gibbons et al. (1986), Gross (1983), Hennessey et al. (1983), and Sites (1978) discuss scheduling performed during a pass after code generation and register allocation. Instruction scheduling prior to register allocation has been implemented in several compilers, including the IBM PL.8 compiler described by Auslander et al. (1982), but in all the referenced works scheduling is limited to reducing interlocks only within basic blocks.
Despite these attempts, there remains a need for a way to reduce delays caused by potentially nonexecutable code being present in the instruction pipeline when control in the program flows across multiple basic block boundaries.