1. Technical Field of the Invention
The present invention relates generally to computer architectures for instruction decoding, and more particularly to apparatus and method for improved prefetching and decoding for computer architectures with instruction pipelines.
2. Background Art
Instruction pipelining of computer operations has long been used to increase the performance of von Neumann computers. In the simplest von Neumann architecture, each phase of the execution of an instruction (e.g., fetch, decode, execute) is done sequentially even if the hardware needed for the phases never interacts. Pipelining allows phases of more than one instruction to be processed by non-interacting sections of hardware at the same time. Typically, there are three phases of instruction execution: the instruction fetch, instruction decode and instruction execute. These three operations can be processed independently of one another as long as the phases of each instruction remain sequential. Thus, the fetch of a first instruction can be performed in one cycle. When this phase is complete, the decode unit processes the first instruction. The fetch unit, however, can now begin the processing of a second instruction in parallel with the decoding of the first instruction. This pipelining of instructions allows the completion of an instruction every cycle, once the pipeline is filled, even though each instruction takes more than one cycle to process.
However, program flow changes, such as branch statements and procedure calls and returns, which cause the program instruction code to be non-sequential, create problems for pipelined architectures. For example, with the three-stage pipelines discussed above, if a branch instruction is instruction 1, two other instructions will have been partially processed when the direction of the branch is determined in the execute stage. If these instructions are not the target of the branch, the instructions at the proper location must be fed into the pipeline and the processing already done has been wasted. This delay causes a two cycle "bubble" in the execution stream. Also, if this processing is not suspended before it changes the state of the machine, some of it could produce incorrect results which must be fixed before the correct instruction can begin. This situation would cause a larger bubble. Since branches and other program flow changes can account for 12% to 33% of the instructions executed in a program, the branch problem can cause a significant degradation in performance because each branch can potentially delay the execution of the pipeline if the incorrect target is processed before the branch is executed.
The branch problem thus contributes to the discrepancy between the peak and sustained performance of a machine. The peak performance is the maximum attainable instruction throughput. To determine this performance, the instruction code is organized to take advantage of all the features of an architecture and to avoid all of its possible bottlenecks. Sustained performance is a measure of throughput based on a normal load on a machine. If pipeline bubbles cause a performance degradation in a particular machine, the peak performance would be obtained using a workload or program with very few or no branches. Standard computer programs, of course, have branches, causing the sustained performance to be a function of the number of these branches.
Many approaches have been used to reduce the performance degradation due to these pipeline branch effects. One early developed and simple approach to this problem is to allow the instruction prefetching mechanism to continue down one direction of the branch. This approach is used in the control unit of the ILLIAC IV, for example, see the paper by Barnes, et al., "The ILLIAC IV Computer", IEEE Transactions, pp. 746-757, August, 1968. In this approach, if the correct direction has been prefetched, the pipeline continues operating without bubbles. If the wrong direction is prefetched, however, the pipeline must be flushed and restarted at the target instruction. The simplest implementation of this method involves prefetching the instructions immediately following the branch. If the branch is not taken, no bubble will occur. Branches, however, are taken in normal types of programs over 60% of the time.
Instead of prefetching in one direction, prefetching the instructions in both directions of the branch has also been tried Systems using variations of this method include the IBM 360/91, which is described in the texts by Hwang et al., Computer Architecture and Parallel Processing, McGraw-Hill, 1984; and by Kogge, Architecture of Pipelined Computers, McGraw-Hill, 1981. Both directions of the branch are prefetched. The instructions in one of the directions are decoded until the branch has been executed. If the other direction was taken, the decoded instructions would be flushed, and the prefetched other direction would be decoded.
Prefetching in both directions of a branch improves performance if only one branch is in the pipeline at a time. If multiple branches are being processed, all the possible targets of those branches need to be prefetched. The performance improvement based on the number of branches which have been prefetched is proportional to the square root of the number of branches, without taking into account the distance between branches. If branches are separated in the code, they could be loaded into different fetching units one after another. The amount of prefetching called for by this approach greatly increases the complexity of the instruction fetch unit.
In order to decrease the number of prefetched instructions which are not used, branch prediction can be used. This prediction of which direction a branch may take can be either dynamic (during execution) or static (during compilation). Several dynamic methods are discussed in the paper by Lee et al, "Branch Prediction Strategies and Branch Target Buffer Design", Computer, Vol. 17, No. 1, January, 1984. Another dynamic approach, which is described in the paper by McFarling, et al., "Reducing the Cost of Branches", Proceedinqs of the 13th International Symposium on Computer Architectures, pp. 396-403, June, 1986, uses a cache-like table containing lines of two prediction bits. Access to this table is determined by the low-order bits of the branch address. The two bits give the recent history of the activity of the branch. This history is used to predict the most likely direction the branch will take, and the prefetch of the branch target is based on this decision. Once the true branch direction is decided, a finite state machine updates the history bits. If the prediction is correct, the branch penalty is only one cycle since the decode phase is still suspended until the branch is executed.
Static prediction involves having the compiler set a single prediction bit. This bit is not changed during program execution. One such system is the Bell Labs CRISP microprocessor, which is described in the paper by Ditzel, et al., "Branch Folding in the CRISP Microprocessor: Reducing Branch Delay to Zero", Proceedings of the 14th International Symposium on Computer Architectures, pp. 2-9, 1987. The CRISP system relies on a special compiler to assign the static prediction bit.
Both static and dynamic prediction involve an increase in the complexity of the system, either in software or in hardware. Neither scheme is able to predict the direction of a branch with 100% accuracy. While these schemes certainly improve performance, they do not solve the non-sequential program flow problem.
Branch target buffers or branch history tables are further extensions to branch prediction methods. They use a cache-like structure to store the target which the branch has recently addressed. Such systems are described in the text by Stone, High-Performance Computer Architecture, Addison-Wesley, 1987. When a branch is encountered, its address is used as a tag into the cache, which contains the last target address of that branch. From this point, the procedure progresses in the same manner as other branch prediction methods. When the target of the current branch is actually determined, the cache is updated. If this prediction is wrong, a full branch penalty is incurred.
The size of the branch target buffer obviously has an effect on its performance. It has been shown that the buffer must be fairly large. For example, the MU-5, a high speed general purpose computer built in the early 1970's at Manchester University, with an eight entry branch target buffer only had the correct target in its buffer 40-60% of the time. This hit rate can be increased to 93% with a larger buffer of 256 entries.
The best improvement obtained by a branch target buffer is for unconditional branches and subroutine calls. After the target of one of these type of instructions is stored, the prediction is always correct provided the line is not removed from the buffer due to the replacement policy of the cache. This system also works well when predicting the control loop structure branches. A loop will branch many times to one target and only once to the other. A branch target buffer will only make the wrong prediction once. The buffer can also be constructed to contain the next few instructions after the predicted branch, as described in the article by Lilja, "Reducing the Branch Penalty in Pipelined Processors", Computer, pp. 47-55, July, 1988. If a loop is small, this branch target buffer resembles an instruction cache. The hardware complexity, however, is greater than other solutions and this scheme is also not 100% accurate.
Another method of dealing with the branch problem involves the use of code reorganization to fill the bubbles with useful work. Delayed branching uses a compiler to fill the gap following a branch with instructions normally occurring before the branch. When the compiler detects a branch, it searches through the instructions preceding it looking for instructions on which the branch computation is not dependent. If any are found, they are relocated into delay slots following the branch. The number of delay slots corresponds to the delay involved in obtaining the target. No matter what the outcome of the branch, the delayed instructions will always need to be executed because they were originally located before the branch in the program. If all the delay slots have been filled, the target of the branch will be ready to be input into the pipeline after the delayed instructions have started. This process produces no pipeline bubbles. Delayed branching is used in various RISC systems such as the IBM 801, see the paper by Radin, "The 801 Minicomputer", Proceedings on the Architectural Support for Programming Languages and Operating Systems, pp. 39-47, March, 1982; the Berkeley RISC I, see the paper by Patterson et al., "RISC-I: A Reduced Instruction Set VLSI Computer", Proceedings of the 8th International Symposium on Computer Architectures, May, 1981; MIPS, see the paper by Hennessy et al., "MIPS: A VLSI Processor Architecture", Proceedings of the CMU Conference on VLSI Systems and Computations, October, 1981, and the paper by Moussouris, et al., "A CMOS RISC Processor With Integrated System Functions", Proceedings of the Spring COMPCON, p. 126, 1986; and the HP Spectrum, see the paper by Birnbaum, et al., "Beyond RISC: High Precision Architecture", Proceedings of the Spring COMPCON, p. 40, 1986.
The success of delayed branching, however, is dependent on finding instructions to fill the delay slots. The instructions cannot affect the outcome of the comparison or the branch in any way since, once they are relocated, they will be executed after the branch begins. In the MIPS, for example, one delay slot can be filled 70% of the time. A second slot can only be filled 25% of the time. These unfilled slots are filled with socalled "no operations" (NOP's) and are essentially wasted. Delayed branches also introduce some complexity into the construction of a machine's compiler since it is the mechanism which searches for and relocates the appropriate code.
Branch folding is another type of code reorganization, which is used by the CRISP microprocessor. See the Ditzel et al. paper referenced hereinabove; the paper by Berenbaum et al., "Architectural Innovations in the CRISP Microprocessor", Proceedings of the Spring COMPCON, pp. 91-95, February, 1987; the paper by Ditzel et al., "The Hardware Architecture of the CRISP Microprocessor", Proceedings of the 14th Annual International Symposium on Computer Architectures, pp. 309-19, June, 1987; and the paper by Berenbaum et al., "A Pipelined 32b Microprocessor with 13kb of Cache Memory", Proceedings of the International Solid States Circuits Conference, pp. 34-35, February, 1987. CRISP uses a horizontal microcode where each microinstruction contains two fields, the Next-PC and the Alternate Next-PC. These fields determine the address of the next instruction. During the instruction decoding, the hardware can identify a branch instruction and "fold" its two target addresses into the field of the previous microinstruction. In a sense, each instruction can be thought of as a branch instruction because each instruction contains the address (or addresses) of the next instruction. Static branch prediction is used to decide which direction of the branch is to be prefetched. If the prediction is correct, the execution pipeline continues uninterrupted. In this case, branch folding actually eliminates the branch. Otherwise, the instruction fetch pipeline is flushed and the correct target is fetched. In an ideal situation, therefore, CRISP can execute more than one instruction per cycle.
The implementation of branch folding requires a complex decoding unit. It also does not, in itself, improve the performance due to solving the pipepline branch problem. It does, however, decrease the code size, in some cases considerably, causing a nearly offsetting performance improvement.
In summary, many methods, both hardware and software based, have been tried in order to improve the efficiency of instruction pipelines. The simplest solutions, such as prefetching and simple prediction, unfortunately do not result in great improvements. Other methods generate performance gains but at the cost of greater hardware and software complexity. None of the prior art solutions can guarantee that it will work all of the time. For this reason, branch problems are still a large factor in the gap between the peak performance and sustained performance.
The need for more effective processing of non-sequential programs is substantial. Most real-time applications contain various program flow changing instructions such as branches and subroutine calls. Aircraft flight control systems, particularly for aircraft with artificial stability, for example, require real-time decision making based on continually changing sensory inputs. A computing system which will not suffer performance degradation when these decisions grow in number would be extremely useful.
A large class of time-consuming algorithms also depend on conditional statements. If these branches are contained in tight loops, the performance degradation is compounded. Examples of such algorithms are fractal algorithms, and circuit testing algorithms like the D-algorithm and the PODEM algorithm. Symbolic processing also has many applications with an intrinsically serial nature, especially in the processing of linked lists, where dependence and connectivity are very localized. An architecture which could remove the performance degradation due to non-sequential program flow could greatly speed up this type of processing.