1. Technical Field
The present invention relates generally to computer processing systems and, in particular, to a method and apparatus for prefetching superblocks in a computer processing system.
2. Background Description
Early microprocessors generally processed instructions one at a time. Each instruction was processed using four sequential stages: instruction fetch; instruction decode; instruction execute; and result writeback. Within such microprocessors, different dedicated logic blocks performed each different processing stage. Each logic block waited until all the preceding logic blocks completed operations before beginning its operation.
Improved computational speed has been obtained by increasing the speed with which the computer hardware operates and by introducing parallel processing in one form or another. One form of parallel processing relates to the recent introduction of microprocessors of the "superscalar" type, which can effect parallel instruction computation. Typically, superscalar microprocessors have multiple execution units (e.g., multiple integer arithmetic logic units (ALUs)) for executing instructions and, thus, have multiple "pipelines". As such, multiple machine instructions may be executed simultaneously in a superscalar microprocessor, providing obvious benefits in the overall performance of the device and its system application.
For the purposes of this discussion, latency is defined as the delay between the fetch stage of an instruction and the execution stage of the instruction. Consider an instruction which references data stored in a specified register. Such an instruction requires at least four machine cycles to complete. In the first cycle, the instruction is fetched from memory. In the second cycle, the instruction is decoded. In the third cycle, the instruction is executed and, in the fourth cycle, data is written back to the appropriate location.
To improve efficiency and reduce instruction latency, microprocessor designers overlapped the operations of the fetch, decode, execute, and writeback logic stages such that the microprocessor operated on several instructions simultaneously. In operation, the fetch, decode, execute, and writeback logic stages concurrently process different instructions. At each clock pulse the result of each processing stage is passed to the subsequent processing stage. Microprocessors that use the technique of overlapping the fetch, decode, execute, and writeback stages are known as "pipelined" microprocessors. In principle, a pipelined microprocessor can complete the execution of one instruction per machine cycle when a known sequence of instructions is being executed. Thus, it is evident that the effects of the latency time are reduced in pipelined microprocessors by initiating the processing of a second instruction before the actual execution of the first instruction is completed.
In general, instruction flow in a microprocessor requires that the instructions are fetched and decoded from sequential locations in a memory. Unfortunately, computer programs also include branch instructions. A branch instruction is an instruction that causes a disruption in this flow, e.g., a taken branch causes decoding to be discontinued along the sequential path, and resumed at a new location in memory. The new location in memory may be referred to as a target address of the branch. Such an interruption in pipelined instruction flow results in a substantial degradation in pipeline performance.
There are various types of branch instructions. One type of branch instruction is known as an unconditional branch in that it unconditionally transfers control from the branch instruction to the target instruction. That is, at the time t hat the branch instruction is decoded, it is known that the transfer of control to the target instruction will take place. Examples of unconditional branches include subroutine CALL/RETURN and GOTO. In terms of performance, a more costly branch instruction is known as a conditional branch. This instruction specifies that control is to be transferred to the target instruction only if some condition, as determined by the outcome of a previous instruction, is met. Examples of conditional branch constructs include the DO LOOP and the IF/THEN/ELSE.
Unconditional branches, such as subroutine CALL/RETURN and GOTO, are always taken. Thus, their behavior seems easy to predict. On the other hand, conditional branching constructs, such as the DO LOOP and the IF/THEN/ELSE block, do not always follow the same path. In actuality, a DO LOOP is a fairly well-behaved structure, usually resulting in a taken branch to a destination at a negative displacement off the program counter. However, the IF/THEN/ELSE block may not exhibit predictable behavior since it may change direction on subsequent iterations.
If it can be determined at instruction decode time that a conditional branch instruction will not be taken, then there is no penalty associated with the execution of the conditional branch instruction. That is, the next sequential instruction may be decoded immediately following the decoding of the branch instruction. However, if it is determined that the branch will be taken, a multi-cycle penalty associated with the branch is incurred in that the target address must be generated and the target instruction must be fetched.
Accordingly, many pipelined microprocessors use branch prediction mechanisms that predict the existence (i.e., taken or not taken) and the outcome of branch instructions within an instruction stream. The instruction fetch unit uses the branch predictions to fetch subsequent instructions.
When a branch prediction mechanism predicts the outcome of a branch instruction and the microprocessor executes subsequent instructions along the predicted path, the microprocessor is said to have "speculatively executed" along the predicted instruction path. During speculative execution the microprocessor is performing useful processing if the branch instruction was predicted correctly.
However, if the branch prediction mechanism mispredicted the branch instruction, the microprocessor is executing instructions down the wrong path and, thus, accomplishes nothing. When the microprocessor eventually detects the mispredicted branch, the microprocessor must flush the instructions that were speculatively fetched from the instruction pipeline and restart execution at the correct address. The effect of the above described non-sequential operation, and of the resultant flushing of the pipeline, is exacerbated in the case of superscalar pipelined microprocessors. For example, if a branch or other interruption in the sequential instruction flow of the microprocessor occurs, the number of lost pipeline slots, or lost execution opportunities, is multiplied by the number of parallel execution units (i.e., parallel pipelines). The performance degradation due to branches and corresponding non-sequential program execution is therefore amplified in superscalar pipelined microprocessors.
Thus, when the number of stages in a processor pipeline increases, the latency between the instruction execute stage and the instruction fetch stage becomes longer. This causes an increased delay in the launching of instructions after a conditional branch which depends on the execution of a prior instruction. Unless useful work can be inserted between the instructions and the conditional branch by the compiler, the delay results in lost cycles.
To reduce the number of lost cycles, especially in the common situation where the compiler is unable to insert more than a couple of instructions during the time in which the branch instruction is being resolved, it is necessary for the hardware to accurately anticipate (predict) the result of executing a particular branch and fetch instructions along the appropriate stream. When the prediction is correct, lost cycles are avoided. However, when the prediction is incorrect, it is possible that the number of lost cycles is greater than if no prediction was attempted. It therefore becomes essential to increase the success rate of the predictions.
Prediction techniques have included the use of Branch History Tables (BHTs) and Branch Target Buffers (BTBs). In its simplest form, a BHT maintains the outcomes of previously executed branches. The table is accessed by the instruction prefetch unit and decides whether prefetching should be redirected or not. The table is searched for a valid entry, just as a cache is searched. The table is typically set-associative, as is the case with many cache organizations. An entry is only added to the table when a taken branch is executed by the processor. On each BHT hit, the historical information in that entry is used by the prediction algorithm. The algorithm redirects prefetching for a taken prediction, or continues with the next sequential instruction for a not-taken prediction. Some implementations invalidate the entry when the branch changes to not taken. In this case, a BHT miss will occur subsequently, and next-sequential prefetching will ensue. If the prediction is wrong, the processor must be equipped with a back-out strategy to restore the necessary state. The use of BHTs is not without deficiency. For example, one problem with a conventional BHT is that it is typically indexed by the address of the branch that is to be predicted. This implies that the stream of instructions fetched from the cache has to be decoded to detect a branch before the prediction can be made. The decoding of instructions could possibly be done multiple instructions at a time when the instruction length is fixed. In some instruction sets where instructions are variable in length, this is a bigger problem.
Conventional BTBs are cache-like buffers that are used in the fetch units of microprocessors to store an identifier of a previously performed branch instruction as a tag, along with the target address (i.e., the address to which the branch points in its predicted state) and an indication of the branch's history. Upon subsequent fetches of the branch, the target address is used (depending on the branch history) as the next address to fetch in the pipeline. Upon execution of the branch instruction itself, the target address is compared against the actual next instruction address determined by the execution unit to verify whether the speculative execution was valid. However, the use of BTBs is not without deficiency. For example, as with the BHT, a BTB indexed using the address of the branch is able to provide the address of the target only when the branch is decoded in the instruction stream.
Thus, it would be desirable and highly advantageous to have a method and apparatus for accurately predicting the existence and the outcome of branch instructions within an instruction stream.